Page 1 of 1

Partition and sort

Posted: Thu Mar 29, 2012 9:30 am
by pdntsap
Hello,

We have a requirement where we need to sort on 10 keys, then remove duplicates based on the first 8 keys out of the 10 keys and then join based on the first 9 keys out of the 10 keys. We have two sorter stages and then a join stage but the partition method chosen seems to not give us the right output. I am looking for the partition and sorting approach that can be used in the above stages. We tried different options of partition but still confused.

Thanks.

Re: Partition and sort

Posted: Thu Mar 29, 2012 10:38 am
by kwwilliams
Your partition requirement is not the same as your sort requirement. Choose 1 field that has high cardinality, and is used in both sorts and as a key on the join. The high cardinality will give you an even spread across your nodes - however any field which the two sorts and the join have in common will work.

On your second sort are you using don't sort already sorted for the fields already sorted in the first sort? This won't effect the data outcome but would be a huge performance improvement in your job if you are not already using this method.