join stage performance

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
wuruima
Participant
Posts: 65
Joined: Mon Nov 04, 2013 10:15 pm

join stage performance

Post by wuruima »

For some business requirements I need to JOIN 2 big files.

The left link is about 30,000,000 records and the right link is 10,000,000 records.

The performance is not good enough even though I split the input file into files according to 1 primary key column.(e.g. Key1 records go to File1, File2, then use File1 to join File2)

Besides remove the useless columns, what else could I do to improve the performance of join stage??? Thanks.

Is there any setup/config or steps to do to reduce the processing time?
wuruimao
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

If the files can be pre-sorted by the join key(s), that would help. Use an explicit Sort stage with Sort Mode set to "Don't sort (previously sorted)".

You could try throwing two or three or four reader processes per node at it, but this could interfere with partitioning by at least the first join key (which should be done on the input of the Sort stage).

Set partitioning on the Join stage to Same.

Choose a target that can be written to fast, such as a Data Set.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
maypandh
Participant
Posts: 13
Joined: Mon Apr 18, 2016 8:37 am

Post by maypandh »

While fetching data from a sequential file specify option Read From Multiple Nodes.Also specify Keep file partitions as True.
Post Reply