I'm trying to understand what happens to sorted data as it gets repartioned.
I'm fairly happy to assume that if I have sorted data on multiple nodes, and then repartition it, then it cannot still be sorted. However, I'm not sure what would happen if I had data sorted on a single node and then repartition it to multiple nodes.
For exampe...
If I have a dataset which is contains sorted data and was written using just 1 node, and then read that dataset using a job that is running on multiple nodes, will that data still be sorted?
Is there a way I can confirm this ? If I use a sort stage, with the setting 'Don't sort - already sorted' will it generate an error if the data is not sorted correctly?
Thanks
Keeping data sorted whilst repartioning.
Moderators: chulett, rschirm, roy
Depends on your partitioning method...
Hash partition will certainly disrupt the sort order. Round robin partitioning will maintain sort order within the partitions, but round robin is not suitable for key-based operations.
In general, when you go from sequential to parallel, it's best to make no assumptions about sort order.
Mike
Hash partition will certainly disrupt the sort order. Round robin partitioning will maintain sort order within the partitions, but round robin is not suitable for key-based operations.
In general, when you go from sequential to parallel, it's best to make no assumptions about sort order.
Mike
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
Re: Keeping data sorted whilst repartioning.
Yesdohertys wrote:If I use a sort stage, with the setting 'Don't sort - already sorted' will it generate an error if the data is not sorted correctly?
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.