Should i Repartition or Not?

hitmanthesilentassasin · Fri Mar 21, 2008 12:43 pm

Hi Experts,

1. What I understand for the sort stage is that the pipe line parallelism is restricted while using sort stage. Would it be of any help if i go for hash partitioning in the sort stage?

2. I have a series of join stages one followed by another. I have sorted the data and hash partitioned it in the first join stage. Now, when the second join stage comes in should I hash partition it on both the input links again? or just the sort and hash partition the data coming from the new source(Since the data is already sorted and hash partitioned in the previous join stage, can I just not select the preserve partitioning so that the partition is preserved and the logic still works fine?)? after the join will the previously sorted data will be unsorted?

Thanks a lot for your answers.

Regards,
Waseem

bcarlson · Post by **bcarlson** » Fri Mar 21, 2008 2:54 pm

It is all about colocation. Hashing gets like-keys on the same node. All the 1's in one node, all the 2's in another, etc. Hash-then-sort for joins. You shouldn't need to rehash or resort unless you change your join criteria. I believe the output stream from your first join should retain its partitioning and sorting going into the next join. Just make sure the other stream going into your second join is also hash/sorted the same way.

Brad.

kumar_s · Post by **kumar_s** » Fri Mar 21, 2008 3:07 pm

If all of the Join stages are joining data based on same key field, or subset of key marked in First join stage, you dont need to repartition and sort the data.
In that case, you can wisely choose the Superset of Keys in the Initial Join stage. Else you need to do sort and repartition for each Join.

kumar_s · Post by **kumar_s** » Fri Mar 21, 2008 3:08 pm

If all of the Join stages are joining data based on same key field, or subset of key marked in First join stage, you dont need to repartition and sort the data.
In that case, you can wisely choose the Superset of Keys in the Initial Join stage. Else you need to do sort and repartition for each Join.

ray.wurlod · Post by **ray.wurlod** » Fri Mar 21, 2008 5:19 pm

"Blocking" stage types, like Sort and Aggregator, necessarily "interfere with pipeline partitioning" - it's unavoidable. Basically they can not output their first row as soon as the first input row has arrived - they have to wait until the sorting or grouping has been completed.

Partitioning is unrelated to this dilemma. You must use correct partitioning to get correct results, but partitioning of data affects partition parallelism, not pipeline parallelism.

hitmanthesilentassasin · Fri Mar 21, 2008 10:05 pm

Hi All,

Thanks a lot for the answers. :D

Regards,
Waseem