Keeping data sorted whilst repartioning.
Posted: Thu Jan 29, 2009 7:40 am
I'm trying to understand what happens to sorted data as it gets repartioned.
I'm fairly happy to assume that if I have sorted data on multiple nodes, and then repartition it, then it cannot still be sorted. However, I'm not sure what would happen if I had data sorted on a single node and then repartition it to multiple nodes.
For exampe...
If I have a dataset which is contains sorted data and was written using just 1 node, and then read that dataset using a job that is running on multiple nodes, will that data still be sorted?
Is there a way I can confirm this ? If I use a sort stage, with the setting 'Don't sort - already sorted' will it generate an error if the data is not sorted correctly?
Thanks
I'm fairly happy to assume that if I have sorted data on multiple nodes, and then repartition it, then it cannot still be sorted. However, I'm not sure what would happen if I had data sorted on a single node and then repartition it to multiple nodes.
For exampe...
If I have a dataset which is contains sorted data and was written using just 1 node, and then read that dataset using a job that is running on multiple nodes, will that data still be sorted?
Is there a way I can confirm this ? If I use a sort stage, with the setting 'Don't sort - already sorted' will it generate an error if the data is not sorted correctly?
Thanks