Page 1 of 1

Keeping data sorted whilst repartioning.

Posted: Thu Jan 29, 2009 7:40 am
by dohertys
I'm trying to understand what happens to sorted data as it gets repartioned.

I'm fairly happy to assume that if I have sorted data on multiple nodes, and then repartition it, then it cannot still be sorted. However, I'm not sure what would happen if I had data sorted on a single node and then repartition it to multiple nodes.

For exampe...
If I have a dataset which is contains sorted data and was written using just 1 node, and then read that dataset using a job that is running on multiple nodes, will that data still be sorted?

Is there a way I can confirm this ? If I use a sort stage, with the setting 'Don't sort - already sorted' will it generate an error if the data is not sorted correctly?

Thanks

Posted: Thu Jan 29, 2009 8:32 am
by Mike
Depends on your partitioning method...

Hash partition will certainly disrupt the sort order. Round robin partitioning will maintain sort order within the partitions, but round robin is not suitable for key-based operations.

In general, when you go from sequential to parallel, it's best to make no assumptions about sort order.

Mike

Re: Keeping data sorted whilst repartioning.

Posted: Thu Jan 29, 2009 3:19 pm
by ray.wurlod
dohertys wrote:If I use a sort stage, with the setting 'Don't sort - already sorted' will it generate an error if the data is not sorted correctly?
Yes

Posted: Tue Feb 10, 2009 2:59 am
by dohertys
Thanks