JOb Performance

sshettar · Post by **sshettar** » Mon Aug 24, 2009 8:58 am

Hi All,

I have this one job which is currently using like 11 join stages to load a dataset. It is joining like 14 tables .Well the job needs those many join stages though cause each join is based on the previous join results and the join keys are also based on the result of previous join stage.
The jobs out here though are in parallel they are currently using just 1 node, but going forward they are wanting to use 4 nodes .
The jobs currently have auto as the partitioning method for all the join stages and no sorting done.
Since we are now planning for 4 nodes we need to sort the data and partition it correctly . but i am concerned as i would need to sort the data and hash partition them at each join stage due to the keys being different . But if i do so i'm thinking that would hit the performance .
If it does, any hint on what can i do to make the performance better.
Will i have to sort it and partition it at each join stage

Any help is highly appreciated

Thanks in advance

miwinter · Post by **miwinter** » Mon Aug 24, 2009 9:09 am

Yes, as part of natural course, your data will need to be sorted and partitioned appropriately - that's always true.

Consider, if volumes are amenable to it, the use of a lookup instead of a join. Given the number of stages in this process, I'd also be keen to look at splitting the process down and landing datasets at intermediate points instead.

Oh, and... testing, testing, testing :D

sshettar · Post by **sshettar** » Mon Aug 24, 2009 11:02 am

Thanks Mark!!!!
will try and use lookup stage ( where ever possible)

arnabdey · Post by **arnabdey** » Wed Aug 26, 2009 11:07 am

Join itself does a sorting of the incoming data in both the links. Using so many lookups in cascade may also degrade performance even if the volume of data in reference dataset is low. So I feel best thing is to break up the job into two with 5-6 lookups in each and use datasets as intermediate storage.

sshettar · Post by **sshettar** » Wed Aug 26, 2009 12:22 pm

Well currently the time taken to complete the job is just about 4 minutes. I did check with my lead and he says the data that this job deals with is very less and is not intending to grow much .

So i was thinking it would be better for now to just leve the job as it is with the 11 join stages and the partitioning being auto itself cause when i changed the partitioning to hash for all the join stages and sorted the data accordsingly and using 2 nodes as compared to 1 node , the job is taking more time then the older version with auto partitioing and 1 node.
the job is taking abt 6 to 7 minutes.

after going through couple of sites i did learn a new thing that by keeping the partitioing to auto for the input links of join stages ( the auto partitioing takes care of the partitioing and also sorting) please corect me if i am wrong .

Do you think that keeping the partitioing to auto and using 2 node would solve the problem for me ?

Thanks in advance

miwinter · Post by **miwinter** » Thu Aug 27, 2009 2:44 am

If you used auto partitioning over 2 nodes, you would get the same effect as you did with 2 nodes and hash partitioning, so no, this won't solve your 'problem'. This said, having now realised we are talking about a job that doesn't process a high volume, I'm not sure where the need to tune this process comes from. A job that takes 4 or 5 minutes shouldn't really be a focus of efforts I would surmise, unless it is expected to handle volumes which will grow significantly (something you have already ruled out).