Performance of Merge Stage vs Hashed Lkup to combine files

jreddy · Post by **jreddy** » Fri Jan 19, 2007 11:40 am

I need to merge 3 fixed width sequential files with about 15 currency fields each and have approximately 1.5 million records which may or may not be sorted on the key values. The number of records may slowly grow over time. I need to untimately combine the 3 files to be loaded into a staging table. Before choosing the Merge Stage over the Hashed Lookup stage or vice versa, I wanted to find out if there are any performace issues or pros/cons of either stage that should be kept in mind.

1. Is there a known limit to the size or record counts of files that need to be merged?
2. When two files are being merged, not that there is an issue of temporary space, but is it enough if the temporary directory has atleast twice the size of both files put together available? i.e if the files are 1GB each would it be enough if the temporary directory has atleast 4GB available?
3. If we manually created 64-bit hashed files to avoid the 2GB limit, would it perform better than the merge stage?
4. If the Merge stage were picked, would any time spent in pre-sorting the 3 files speed up the merge process significantly?

Code: Select all

Example:
File A
KEY A1 A2 ... A15
V   0  1  ... 2
X   1  3  ... 5
Y   4  5  ... 7 
.
.

File B
KEY B1 B2 .... B15
Y   1  7  ...  14
Z   3  4  ...  2
V   1  5  ...  2

File C
KEY C1 C2 .... C15
X   3   5  ... 6
W   3   8  ... 6
V   3   6  ... 7

Result Required:

KEY A1 ... A15 B1 ... B15 C1 ... C15
X   1  ... 5              3  ... 6
Y   4  ... 7   1 ...  14   
W                         3  ... 6
Z              3 ...  2               
V   1  ... 2   1 ...  2   3  ... 7

The above example may not correclty depict the actual dataset but all three files will have atleast 90-95% of the Key values in common and I need the extra rows (outer join) from each file as well.

Thanks in Advance.

ray.wurlod · Post by **ray.wurlod** » Fri Jan 19, 2007 3:47 pm

1. No.
2. Yes.
3. Yes.
4. Maybe. But the Merge stage is limited to two inputs so you would need two jobs.

narasimha · Post by **narasimha** » Sat Jan 20, 2007 12:48 am

Some adjusting for Input File Size for Merge Stage

Code: Select all

The Merge stage supports 64-bit files. But you must change the value of the property Max Space in VM for Hash Table to accommodate extremely large input files. Failure to do so results in abnormal termination of jobs. The default value of Max Space in VM for Hash Table is 12. This value is appropriate for many file sizes. As the size of the larger of the two input files grows, you must increase the value of Max Space in VM for Hash Table. For files of 2 GB or larger, you must set the value of Max Space in VM for Hash Table to its maximum value of 512

kcbland · Post by **kcbland** » Sat Jan 20, 2007 9:38 am

Take the final staging table metadata map each file into that metadata. The jobs should look like SEQ --> XFM --> SEQ. For each column that doesn't have a source from the respective file derive the vaue as "".

Then, concatenate the three output files together, maybe in a before-stage/job routine call in this next job. Now, write a job that looks like SEQ --> AGG --> SEQ. Group by the key column, and use a MAX(link.column) derivation on all other columns. Any blanks are ignored, leaving you with just the populated values.

This job is the fastest solution, especially over joining because you simply pass thru the data once for the merge operation. If you like a simpler job you can avoid separate jobs and put this all in one job using a Link Collector stage to merge the three streams together, but the Link Collector is not a fully stable stage and I avoid it.

You need to NOT use the MERGE stage. The solution I just gave you is much faster and cleaner. It also can be used with a partitioning strategy and multiple job instances in case you need to scale to hundreds of millions of rows.

I_Server_Whale · Post by **I_Server_Whale** » Sat Jan 20, 2007 10:27 am

kcbland wrote: but the Link Collector is not a fully stable stage and I avoid it.

Ken, What are the reasons for the Link Collector Stage not to be stable? I have used it many a times without any problem. Are there any specific issues with it?

Thanks,
Whale.

kcbland · Post by **kcbland** » Sat Jan 20, 2007 10:35 am

Fundamental law of nature. Anything that has a timeout is inherently unstable. A Link Collector is simply a fancy wrapping around a FIFO. You have to put a timeout setting so that the job won't sit there forever waiting on input to the stage. This means that anything that delays the reception of rows to the stage could put you in jeopardy of a timeout.

I prefer metaphysical certitude. Once-in-a-blue-moon tolerance for failure in programming is unacceptable to me. I would rather stream output to a sequential file and concatenate files together and then process the entire set as a next milestone in transformation.

Here's another reason. You cannot easily involve multiple job instances in divide-n-conquery processing. If you want multiple extraction pipelines, that's difficult when you put an LC stage into your design. You've effectively bottlenecked your design at that point.

jreddy · Post by **jreddy** » Fri Jan 26, 2007 12:27 pm

Thank you all for your very helpful replies. I will try the Merge and also the Aggregator Stage as kcbland has suggested and choose based on which runs faster.