JOb Performance
Posted: Mon Aug 24, 2009 8:58 am
Hi All,
I have this one job which is currently using like 11 join stages to load a dataset. It is joining like 14 tables .Well the job needs those many join stages though cause each join is based on the previous join results and the join keys are also based on the result of previous join stage.
The jobs out here though are in parallel they are currently using just 1 node, but going forward they are wanting to use 4 nodes .
The jobs currently have auto as the partitioning method for all the join stages and no sorting done.
Since we are now planning for 4 nodes we need to sort the data and partition it correctly . but i am concerned as i would need to sort the data and hash partition them at each join stage due to the keys being different . But if i do so i'm thinking that would hit the performance .
If it does, any hint on what can i do to make the performance better.
Will i have to sort it and partition it at each join stage
Any help is highly appreciated
Thanks in advance
I have this one job which is currently using like 11 join stages to load a dataset. It is joining like 14 tables .Well the job needs those many join stages though cause each join is based on the previous join results and the join keys are also based on the result of previous join stage.
The jobs out here though are in parallel they are currently using just 1 node, but going forward they are wanting to use 4 nodes .
The jobs currently have auto as the partitioning method for all the join stages and no sorting done.
Since we are now planning for 4 nodes we need to sort the data and partition it correctly . but i am concerned as i would need to sort the data and hash partition them at each join stage due to the keys being different . But if i do so i'm thinking that would hit the performance .
If it does, any hint on what can i do to make the performance better.
Will i have to sort it and partition it at each join stage
Any help is highly appreciated
Thanks in advance