Page 1 of 1

Read very high volume sorted sequential file

Posted: Mon Nov 15, 2010 5:09 pm
by ds_infy
Hello,

I have a sequential file with 2000 million records which is already sorted on key1,key2 and key3. I am reading this file using a sequential file stage running in sequential mode and then hash partitioning the data after reading on Key1.

Seq stage -> copy stage (input hash on key 1) -> Dataset stage

From the test i did, the data going into the dataset is sorted within the partition.

Is my understanding from the test that the sorted data remains sorted even after partitioning correct?

Thanks,
Ds

Posted: Mon Nov 15, 2010 9:23 pm
by prakashdasika
The partioning done on the input link of copy stage sorts the data on KEY 1 for each partitioning and i think when you say the data is sorted, I expect the sorted KEY1 data per partition. So affectively the sort order in dataset will be differnt from your source file.

Posted: Mon Nov 15, 2010 11:07 pm
by ray.wurlod
Re-partitioning is almost guaranteed to destroy parallel-sorted data's sorted order. Partitioning of a sequential stream will tend to preserve sorted order.

Posted: Tue Nov 16, 2010 12:15 am
by ds_infy
Thanks Prakash and Ray.

Ray,
When you say "Re-partitioning is almost guaranteed to destroy parallel-sorted data's sorted order.", i guess you mean to say if the input was a dataset containing parallel sorted data. Correct?

In my scenario, since i am reading a csv file in a sequential mode and then repartitioning it, i am guessing data will be sorted within a given partition (due to the first in first out pipe line processing nature of the stage and because the data is moving from sequential to parallel stage).

Pls let me know your thought.

Thanks,
Ds

Posted: Tue Nov 16, 2010 1:21 am
by ray.wurlod
That's correct. However, if you had re-partitioned the data on the input to the Data Set, the data in the Data Set would almost certainly not be sorted.

Posted: Wed Nov 17, 2010 10:39 am
by ds_infy
Thanks Ray for clarifying!!