How to merge multiple sequencial files together effectively

xli · Post by **xli** » Tue Feb 01, 2005 4:57 pm

Hi,

I have four huge flat files needed to be merged together. They have the same records number of 24 millions with the same key fields key1, key2, key3. each file has average 50 fields.

These files are required to be sorted by key1, key2 and key3 first, and being merged together, then output to apply further complex transformation functions in datastage job.

I tried loading these files into Oracle tables and apply simply inner join by a indexed key. ordered by key1, key2 and key3, but the performance is too bad. It is also very slow even without "order by" applied.

My questions are :

1. Is there any Before-job routine to sort and merge these files, how about the performance?

2. What is the best solution to handle such huge files sorting and merging.

Thanks in advance for your advice.

xli

vmcburney · Post by **vmcburney** » Tue Feb 01, 2005 5:32 pm

The best advice I can give you is to upgrade to Enterprise Edition! It is far more efficient at sorting and joining large sequential files. It will be easier to build and will run a lot faster.

Using server jobs have a look at my message in this post:
viewtopic.php?p=123234#123234

The four jobs writing to the one hash file lets you process all four files at the same time through a multiple instance job. Instead of splitting your single data field into key and non key fields you will need to split it into three key fields and non key fields. The FIELD command is useful for this.

ray.wurlod · Post by **ray.wurlod** » Tue Feb 01, 2005 7:53 pm

You can write the routine, the essenceof which is

Code: Select all

Loop
While ReadSeq Line1 From File1 
   ReadSeq Line2 From File2
   ReadSeq Line3 From File3
   ReadSeq Line4 From File4
   * Adjust the next statement to handle intervening delimiter character
   NewLine = Line1 : Line2 : Line3 : Line4
   WriteSeq NewLine To File5
Repeat

This depends on the files having been sorted (UNIX sort) and having the same number of rows (which forms part of your specification).

talk2shaanc · Post by **talk2shaanc** » Tue Feb 01, 2005 10:10 pm

Hi,
I think Link Collector stage in DS, will also solve the purpose.
In which I can have multiple file as input, with same structure and one single output and in properties tab you can specify the collection algorithm.

vmcburney · Post by **vmcburney** » Tue Feb 01, 2005 10:58 pm

Unfortunately in this case the sequential files have a different column structure so the link collector wont work. Collector does a union of input streams, in this case a join is required.

Your main problem here is the size of the files, 24 million rows with a large number of columns and multiple keys. A routine will have scalability problems, you don't get row buffering or memory management and you can't monitor the rows. DataStage server edition cannot join sequential files, loading of that much data into hash files is going to be slow on the hash file build and slow on the hash file lookup. That's why I suggested a design with a single consolidated hash file and multiple instances.

Enterprise Edition is great at this work. Another option are some of the sort tools such as co-sort or synch-sort. They can sort and join sequential files very efficiently.

kcbland · Post by **kcbland** » Tue Feb 01, 2005 11:10 pm

xli wrote:They have the same records number of 24 millions with the same key fields key1, key2, key3. each file has average 50 fields.

This is going to be brutal. 4 files X 50 columns merged together = ~200 columns. This row is so wide you are going to take serious performance penalties reading all of these characters.

xli wrote: These files are required to be sorted by key1, key2 and key3 first, and being merged together, then output to apply further complex transformation functions in datastage job.

Why sort? Merging requires sorting but if you take the hash file route sorting is a waste of time because hash files don't care about the sort order when loading, it puts the data wherever it wants. If you're not using hash files, then you're using a database for reference purposes.

xli wrote: I tried loading these files into Oracle tables and apply simply inner join by a indexed key. ordered by key1, key2 and key3, but the performance is too bad. It is also very slow even without "order by" applied.

Define "too bad". Did you load the tables, analyze them, and use a parallel query to do the work? Did you enforce a minimum degree of parallelism? How about partitioning the work tables to assist the query parser in dividing the work?

xli wrote: My questions are :
1. Is there any Before-job routine to sort and merge these files, how about the performance?

None from Ascential, and if you use Unix commands (sort, sed, etc) you are single threading and not using all your resources.

xli wrote: 2. What is the best solution to handle such huge files sorting and merging.

A parallel database, such as Oracle 9i. Or DataStage PX. Since you have Oracle already....

Sainath.Srinivasan · Post by **Sainath.Srinivasan** » Wed Feb 02, 2005 3:14 am

Maybe this is a case of using a simple 'sort' and 'paste' commands in Unix with effective use of tmp area !!

DSXchange

How to merge multiple sequencial files together effectively

How to merge multiple sequencial files together effectively

Re: How to merge multiple sequencial files together effectiv