Performance of Merge Stage vs Hashed Lkup to combine files

Post questions here relative to DataStage Server Edition for such areas as Server job design, DS Basic, Routines, Job Sequences, etc.

Moderators: chulett, rschirm, roy

Post Reply
jreddy
Premium Member
Premium Member
Posts: 202
Joined: Tue Feb 03, 2004 5:09 pm

Performance of Merge Stage vs Hashed Lkup to combine files

Post by jreddy »

I need to merge 3 fixed width sequential files with about 15 currency fields each and have approximately 1.5 million records which may or may not be sorted on the key values. The number of records may slowly grow over time. I need to untimately combine the 3 files to be loaded into a staging table. Before choosing the Merge Stage over the Hashed Lookup stage or vice versa, I wanted to find out if there are any performace issues or pros/cons of either stage that should be kept in mind.

1. Is there a known limit to the size or record counts of files that need to be merged?
2. When two files are being merged, not that there is an issue of temporary space, but is it enough if the temporary directory has atleast twice the size of both files put together available? i.e if the files are 1GB each would it be enough if the temporary directory has atleast 4GB available?
3. If we manually created 64-bit hashed files to avoid the 2GB limit, would it perform better than the merge stage?
4. If the Merge stage were picked, would any time spent in pre-sorting the 3 files speed up the merge process significantly?

Code: Select all

Example:
File A
KEY A1 A2 ... A15
V   0  1  ... 2
X   1  3  ... 5
Y   4  5  ... 7 
.
.

File B
KEY B1 B2 .... B15
Y   1  7  ...  14
Z   3  4  ...  2
V   1  5  ...  2

File C
KEY C1 C2 .... C15
X   3   5  ... 6
W   3   8  ... 6
V   3   6  ... 7

Result Required:

KEY A1 ... A15 B1 ... B15 C1 ... C15
X   1  ... 5              3  ... 6
Y   4  ... 7   1 ...  14   
W                         3  ... 6
Z              3 ...  2               
V   1  ... 2   1 ...  2   3  ... 7
The above example may not correclty depict the actual dataset but all three files will have atleast 90-95% of the Key values in common and I need the extra rows (outer join) from each file as well.

Thanks in Advance.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

1. No.
2. Yes.
3. Yes.
4. Maybe. But the Merge stage is limited to two inputs so you would need two jobs.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
narasimha
Charter Member
Charter Member
Posts: 1236
Joined: Fri Oct 22, 2004 8:59 am
Location: Staten Island, NY

Post by narasimha »

Some adjusting for Input File Size for Merge Stage

Code: Select all

The Merge stage supports 64-bit files. But you must change the value of the property Max Space in VM for Hash Table to accommodate extremely large input files. Failure to do so results in abnormal termination of jobs. The default value of Max Space in VM for Hash Table is 12. This value is appropriate for many file sizes. As the size of the larger of the two input files grows, you must increase the value of Max Space in VM for Hash Table. For files of 2 GB or larger, you must set the value of Max Space in VM for Hash Table to its maximum value of 512
Narasimha Kade

Finding answers is simple, all you need to do is come up with the correct questions.
kcbland
Participant
Posts: 5208
Joined: Wed Jan 15, 2003 8:56 am
Location: Lutz, FL
Contact:

Post by kcbland »

Take the final staging table metadata map each file into that metadata. The jobs should look like SEQ --> XFM --> SEQ. For each column that doesn't have a source from the respective file derive the vaue as "".

Then, concatenate the three output files together, maybe in a before-stage/job routine call in this next job. Now, write a job that looks like SEQ --> AGG --> SEQ. Group by the key column, and use a MAX(link.column) derivation on all other columns. Any blanks are ignored, leaving you with just the populated values.

This job is the fastest solution, especially over joining because you simply pass thru the data once for the merge operation. If you like a simpler job you can avoid separate jobs and put this all in one job using a Link Collector stage to merge the three streams together, but the Link Collector is not a fully stable stage and I avoid it.

You need to NOT use the MERGE stage. The solution I just gave you is much faster and cleaner. It also can be used with a partitioning strategy and multiple job instances in case you need to scale to hundreds of millions of rows.
Kenneth Bland

Rank: Sempai
Belt: First degree black
Fight name: Captain Hook
Signature knockout: right upper cut followed by left hook
Signature submission: Crucifix combined with leg triangle
I_Server_Whale
Premium Member
Premium Member
Posts: 1255
Joined: Wed Feb 02, 2005 11:54 am
Location: United States of America

Post by I_Server_Whale »

kcbland wrote: but the Link Collector is not a fully stable stage and I avoid it.
Ken, What are the reasons for the Link Collector Stage not to be stable? I have used it many a times without any problem. Are there any specific issues with it?

Thanks,
Whale.
Anything that won't sell, I don't want to invent. Its sale is proof of utility, and utility is success.
Author: Thomas A. Edison 1847-1931, American Inventor, Entrepreneur, Founder of GE
kcbland
Participant
Posts: 5208
Joined: Wed Jan 15, 2003 8:56 am
Location: Lutz, FL
Contact:

Post by kcbland »

Fundamental law of nature. Anything that has a timeout is inherently unstable. A Link Collector is simply a fancy wrapping around a FIFO. You have to put a timeout setting so that the job won't sit there forever waiting on input to the stage. This means that anything that delays the reception of rows to the stage could put you in jeopardy of a timeout.

I prefer metaphysical certitude. Once-in-a-blue-moon tolerance for failure in programming is unacceptable to me. I would rather stream output to a sequential file and concatenate files together and then process the entire set as a next milestone in transformation.

Here's another reason. You cannot easily involve multiple job instances in divide-n-conquery processing. If you want multiple extraction pipelines, that's difficult when you put an LC stage into your design. You've effectively bottlenecked your design at that point.
Kenneth Bland

Rank: Sempai
Belt: First degree black
Fight name: Captain Hook
Signature knockout: right upper cut followed by left hook
Signature submission: Crucifix combined with leg triangle
jreddy
Premium Member
Premium Member
Posts: 202
Joined: Tue Feb 03, 2004 5:09 pm

Post by jreddy »

Thank you all for your very helpful replies. I will try the Merge and also the Aggregator Stage as kcbland has suggested and choose based on which runs faster.
Post Reply