join stage performance
Posted: Wed May 04, 2016 11:47 pm
For some business requirements I need to JOIN 2 big files.
The left link is about 30,000,000 records and the right link is 10,000,000 records.
The performance is not good enough even though I split the input file into files according to 1 primary key column.(e.g. Key1 records go to File1, File2, then use File1 to join File2)
Besides remove the useless columns, what else could I do to improve the performance of join stage??? Thanks.
Is there any setup/config or steps to do to reduce the processing time?
The left link is about 30,000,000 records and the right link is 10,000,000 records.
The performance is not good enough even though I split the input file into files according to 1 primary key column.(e.g. Key1 records go to File1, File2, then use File1 to join File2)
Besides remove the useless columns, what else could I do to improve the performance of join stage??? Thanks.
Is there any setup/config or steps to do to reduce the processing time?