Hello,
I have a sequential file with 2000 million records which is already sorted on key1,key2 and key3. I am reading this file using a sequential file stage running in sequential mode and then hash partitioning the data after reading on Key1.
Seq stage -> copy stage (input hash on key 1) -> Dataset stage
From the test i did, the data going into the dataset is sorted within the partition.
Is my understanding from the test that the sorted data remains sorted even after partitioning correct?
Thanks,
Ds
Read very high volume sorted sequential file
Moderators: chulett, rschirm, roy
-
- Premium Member
- Posts: 72
- Joined: Mon Jul 06, 2009 9:34 pm
- Location: Sydney
The partioning done on the input link of copy stage sorts the data on KEY 1 for each partitioning and i think when you say the data is sorted, I expect the sorted KEY1 data per partition. So affectively the sort order in dataset will be differnt from your source file.
Prakash Dasika
ETL Consultant
Sydney
Australia
ETL Consultant
Sydney
Australia
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
Thanks Prakash and Ray.
Ray,
When you say "Re-partitioning is almost guaranteed to destroy parallel-sorted data's sorted order.", i guess you mean to say if the input was a dataset containing parallel sorted data. Correct?
In my scenario, since i am reading a csv file in a sequential mode and then repartitioning it, i am guessing data will be sorted within a given partition (due to the first in first out pipe line processing nature of the stage and because the data is moving from sequential to parallel stage).
Pls let me know your thought.
Thanks,
Ds
Ray,
When you say "Re-partitioning is almost guaranteed to destroy parallel-sorted data's sorted order.", i guess you mean to say if the input was a dataset containing parallel sorted data. Correct?
In my scenario, since i am reading a csv file in a sequential mode and then repartitioning it, i am guessing data will be sorted within a given partition (due to the first in first out pipe line processing nature of the stage and because the data is moving from sequential to parallel stage).
Pls let me know your thought.
Thanks,
Ds
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact: