How to merge multiple sequencial files together effectively

Post questions here relative to DataStage Server Edition for such areas as Server job design, DS Basic, Routines, Job Sequences, etc.

Moderators: chulett, rschirm, roy

Post Reply
xli
Charter Member
Charter Member
Posts: 74
Joined: Fri May 09, 2003 12:31 am

How to merge multiple sequencial files together effectively

Post by xli »

Hi,

I have four huge flat files needed to be merged together. They have the same records number of 24 millions with the same key fields key1, key2, key3. each file has average 50 fields.

These files are required to be sorted by key1, key2 and key3 first, and being merged together, then output to apply further complex transformation functions in datastage job.

I tried loading these files into Oracle tables and apply simply inner join by a indexed key. ordered by key1, key2 and key3, but the performance is too bad. It is also very slow even without "order by" applied. :evil:

My questions are :

1. Is there any Before-job routine to sort and merge these files, how about the performance?

2. What is the best solution to handle such huge files sorting and merging.

Thanks in advance for your advice.

xli :roll:
vmcburney
Participant
Posts: 3593
Joined: Thu Jan 23, 2003 5:25 pm
Location: Australia, Melbourne
Contact:

Post by vmcburney »

The best advice I can give you is to upgrade to Enterprise Edition! It is far more efficient at sorting and joining large sequential files. It will be easier to build and will run a lot faster.

Using server jobs have a look at my message in this post:
viewtopic.php?p=123234#123234

The four jobs writing to the one hash file lets you process all four files at the same time through a multiple instance job. Instead of splitting your single data field into key and non key fields you will need to split it into three key fields and non key fields. The FIELD command is useful for this.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

You can write the routine, the essenceof which is

Code: Select all

Loop
While ReadSeq Line1 From File1 
   ReadSeq Line2 From File2
   ReadSeq Line3 From File3
   ReadSeq Line4 From File4
   * Adjust the next statement to handle intervening delimiter character
   NewLine = Line1 : Line2 : Line3 : Line4
   WriteSeq NewLine To File5
Repeat
This depends on the files having been sorted (UNIX sort) and having the same number of rows (which forms part of your specification).
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
talk2shaanc
Charter Member
Charter Member
Posts: 199
Joined: Tue Jan 18, 2005 2:50 am
Location: India

Post by talk2shaanc »

Hi,
I think Link Collector stage in DS, will also solve the purpose.
In which I can have multiple file as input, with same structure and one single output and in properties tab you can specify the collection algorithm.
vmcburney
Participant
Posts: 3593
Joined: Thu Jan 23, 2003 5:25 pm
Location: Australia, Melbourne
Contact:

Post by vmcburney »

Unfortunately in this case the sequential files have a different column structure so the link collector wont work. Collector does a union of input streams, in this case a join is required.

Your main problem here is the size of the files, 24 million rows with a large number of columns and multiple keys. A routine will have scalability problems, you don't get row buffering or memory management and you can't monitor the rows. DataStage server edition cannot join sequential files, loading of that much data into hash files is going to be slow on the hash file build and slow on the hash file lookup. That's why I suggested a design with a single consolidated hash file and multiple instances.

Enterprise Edition is great at this work. Another option are some of the sort tools such as co-sort or synch-sort. They can sort and join sequential files very efficiently.
kcbland
Participant
Posts: 5208
Joined: Wed Jan 15, 2003 8:56 am
Location: Lutz, FL
Contact:

Re: How to merge multiple sequencial files together effectiv

Post by kcbland »

xli wrote:They have the same records number of 24 millions with the same key fields key1, key2, key3. each file has average 50 fields.
This is going to be brutal. 4 files X 50 columns merged together = ~200 columns. This row is so wide you are going to take serious performance penalties reading all of these characters.
xli wrote: These files are required to be sorted by key1, key2 and key3 first, and being merged together, then output to apply further complex transformation functions in datastage job.
Why sort? Merging requires sorting but if you take the hash file route sorting is a waste of time because hash files don't care about the sort order when loading, it puts the data wherever it wants. If you're not using hash files, then you're using a database for reference purposes.
xli wrote: I tried loading these files into Oracle tables and apply simply inner join by a indexed key. ordered by key1, key2 and key3, but the performance is too bad. It is also very slow even without "order by" applied. :evil:
Define "too bad". Did you load the tables, analyze them, and use a parallel query to do the work? Did you enforce a minimum degree of parallelism? How about partitioning the work tables to assist the query parser in dividing the work?
xli wrote: My questions are :
1. Is there any Before-job routine to sort and merge these files, how about the performance?
None from Ascential, and if you use Unix commands (sort, sed, etc) you are single threading and not using all your resources.
xli wrote: 2. What is the best solution to handle such huge files sorting and merging.
A parallel database, such as Oracle 9i. Or DataStage PX. Since you have Oracle already....
Kenneth Bland

Rank: Sempai
Belt: First degree black
Fight name: Captain Hook
Signature knockout: right upper cut followed by left hook
Signature submission: Crucifix combined with leg triangle
Sainath.Srinivasan
Participant
Posts: 3337
Joined: Mon Jan 17, 2005 4:49 am
Location: United Kingdom

Post by Sainath.Srinivasan »

Maybe this is a case of using a simple 'sort' and 'paste' commands in Unix with effective use of tmp area !!
Post Reply