Should i Repartition or Not?

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
hitmanthesilentassasin
Participant
Posts: 150
Joined: Tue Mar 13, 2007 1:17 am

Should i Repartition or Not?

Post by hitmanthesilentassasin »

Hi Experts,

1. What I understand for the sort stage is that the pipe line parallelism is restricted while using sort stage. Would it be of any help if i go for hash partitioning in the sort stage?

2. I have a series of join stages one followed by another. I have sorted the data and hash partitioned it in the first join stage. Now, when the second join stage comes in should I hash partition it on both the input links again? or just the sort and hash partition the data coming from the new source(Since the data is already sorted and hash partitioned in the previous join stage, can I just not select the preserve partitioning so that the partition is preserved and the logic still works fine?)? after the join will the previously sorted data will be unsorted?

Thanks a lot for your answers.

Regards,
Waseem
bcarlson
Premium Member
Premium Member
Posts: 772
Joined: Fri Oct 01, 2004 3:06 pm
Location: Minnesota

Post by bcarlson »

It is all about colocation. Hashing gets like-keys on the same node. All the 1's in one node, all the 2's in another, etc. Hash-then-sort for joins. You shouldn't need to rehash or resort unless you change your join criteria. I believe the output stream from your first join should retain its partitioning and sorting going into the next join. Just make sure the other stream going into your second join is also hash/sorted the same way.

Brad.
kumar_s
Charter Member
Charter Member
Posts: 5245
Joined: Thu Jun 16, 2005 11:00 pm

Post by kumar_s »

If all of the Join stages are joining data based on same key field, or subset of key marked in First join stage, you dont need to repartition and sort the data.
In that case, you can wisely choose the Superset of Keys in the Initial Join stage. Else you need to do sort and repartition for each Join.
Impossible doesn't mean 'it is not possible' actually means... 'NOBODY HAS DONE IT SO FAR'
kumar_s
Charter Member
Charter Member
Posts: 5245
Joined: Thu Jun 16, 2005 11:00 pm

Post by kumar_s »

If all of the Join stages are joining data based on same key field, or subset of key marked in First join stage, you dont need to repartition and sort the data.
In that case, you can wisely choose the Superset of Keys in the Initial Join stage. Else you need to do sort and repartition for each Join.
Impossible doesn't mean 'it is not possible' actually means... 'NOBODY HAS DONE IT SO FAR'
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

"Blocking" stage types, like Sort and Aggregator, necessarily "interfere with pipeline partitioning" - it's unavoidable. Basically they can not output their first row as soon as the first input row has arrived - they have to wait until the sorting or grouping has been completed.

Partitioning is unrelated to this dilemma. You must use correct partitioning to get correct results, but partitioning of data affects partition parallelism, not pipeline parallelism.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
hitmanthesilentassasin
Participant
Posts: 150
Joined: Tue Mar 13, 2007 1:17 am

Post by hitmanthesilentassasin »

Hi All,

Thanks a lot for the answers. :D

Regards,
Waseem
Post Reply