Keeping data sorted whilst repartioning.

dohertys · Post by **dohertys** » Thu Jan 29, 2009 7:40 am

I'm trying to understand what happens to sorted data as it gets repartioned.

I'm fairly happy to assume that if I have sorted data on multiple nodes, and then repartition it, then it cannot still be sorted. However, I'm not sure what would happen if I had data sorted on a single node and then repartition it to multiple nodes.

For exampe...
If I have a dataset which is contains sorted data and was written using just 1 node, and then read that dataset using a job that is running on multiple nodes, will that data still be sorted?

Is there a way I can confirm this ? If I use a sort stage, with the setting 'Don't sort - already sorted' will it generate an error if the data is not sorted correctly?

Thanks

Mike · Post by **Mike** » Thu Jan 29, 2009 8:32 am

Depends on your partitioning method...

Hash partition will certainly disrupt the sort order. Round robin partitioning will maintain sort order within the partitions, but round robin is not suitable for key-based operations.

In general, when you go from sequential to parallel, it's best to make no assumptions about sort order.

Mike

ray.wurlod · Post by **ray.wurlod** » Thu Jan 29, 2009 3:19 pm

dohertys wrote:If I use a sort stage, with the setting 'Don't sort - already sorted' will it generate an error if the data is not sorted correctly?

Yes

dohertys · Post by **dohertys** » Tue Feb 10, 2009 2:59 am

Thanks

DSXchange

Keeping data sorted whilst repartioning.

Keeping data sorted whilst repartioning.

Re: Keeping data sorted whilst repartioning.