join stage performance

wuruima · Post by **wuruima** » Wed May 04, 2016 11:47 pm

For some business requirements I need to JOIN 2 big files.

The left link is about 30,000,000 records and the right link is 10,000,000 records.

The performance is not good enough even though I split the input file into files according to 1 primary key column.(e.g. Key1 records go to File1, File2, then use File1 to join File2)

Besides remove the useless columns, what else could I do to improve the performance of join stage??? Thanks.

Is there any setup/config or steps to do to reduce the processing time?

ray.wurlod · Post by **ray.wurlod** » Thu May 05, 2016 12:26 am

If the files can be pre-sorted by the join key(s), that would help. Use an explicit Sort stage with Sort Mode set to "Don't sort (previously sorted)".

You could try throwing two or three or four reader processes per node at it, but this could interfere with partitioning by at least the first join key (which should be done on the input of the Sort stage).

Set partitioning on the Join stage to Same.

Choose a target that can be written to fast, such as a Data Set.

maypandh · Post by **maypandh** » Thu May 05, 2016 12:23 pm

While fetching data from a sequential file specify option Read From Multiple Nodes.Also specify Keep file partitions as True.