Read very high volume sorted sequential file

ds_infy · Post by **ds_infy** » Mon Nov 15, 2010 5:09 pm

Hello,

I have a sequential file with 2000 million records which is already sorted on key1,key2 and key3. I am reading this file using a sequential file stage running in sequential mode and then hash partitioning the data after reading on Key1.

Seq stage -> copy stage (input hash on key 1) -> Dataset stage

From the test i did, the data going into the dataset is sorted within the partition.

Is my understanding from the test that the sorted data remains sorted even after partitioning correct?

Thanks,
Ds

prakashdasika · Post by **prakashdasika** » Mon Nov 15, 2010 9:23 pm

The partioning done on the input link of copy stage sorts the data on KEY 1 for each partitioning and i think when you say the data is sorted, I expect the sorted KEY1 data per partition. So affectively the sort order in dataset will be differnt from your source file.

ray.wurlod · Post by **ray.wurlod** » Mon Nov 15, 2010 11:07 pm

Re-partitioning is almost guaranteed to destroy parallel-sorted data's sorted order. Partitioning of a sequential stream will tend to preserve sorted order.

ds_infy · Post by **ds_infy** » Tue Nov 16, 2010 12:15 am

Thanks Prakash and Ray.

Ray,
When you say "Re-partitioning is almost guaranteed to destroy parallel-sorted data's sorted order.", i guess you mean to say if the input was a dataset containing parallel sorted data. Correct?

In my scenario, since i am reading a csv file in a sequential mode and then repartitioning it, i am guessing data will be sorted within a given partition (due to the first in first out pipe line processing nature of the stage and because the data is moving from sequential to parallel stage).

Pls let me know your thought.

Thanks,
Ds

ray.wurlod · Post by **ray.wurlod** » Tue Nov 16, 2010 1:21 am

That's correct. However, if you had re-partitioned the data on the input to the Data Set, the data in the Data Set would almost certainly not be sorted.

ds_infy · Post by **ds_infy** » Wed Nov 17, 2010 10:39 am

Thanks Ray for clarifying!!