For some business requirements I need to JOIN 2 big files.
The left link is about 30,000,000 records and the right link is 10,000,000 records.
The performance is not good enough even though I split the input file into files according to 1 primary key column.(e.g. Key1 records go to File1, File2, then use File1 to join File2)
Besides remove the useless columns, what else could I do to improve the performance of join stage??? Thanks.
Is there any setup/config or steps to do to reduce the processing time?
join stage performance
Moderators: chulett, rschirm, roy
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
If the files can be pre-sorted by the join key(s), that would help. Use an explicit Sort stage with Sort Mode set to "Don't sort (previously sorted)".
You could try throwing two or three or four reader processes per node at it, but this could interfere with partitioning by at least the first join key (which should be done on the input of the Sort stage).
Set partitioning on the Join stage to Same.
Choose a target that can be written to fast, such as a Data Set.
You could try throwing two or three or four reader processes per node at it, but this could interfere with partitioning by at least the first join key (which should be done on the input of the Sort stage).
Set partitioning on the Join stage to Same.
Choose a target that can be written to fast, such as a Data Set.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.