Read very high volume sorted sequential file

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
ds_infy
Premium Member
Premium Member
Posts: 59
Joined: Tue Jun 09, 2009 4:17 am
Location: India

Read very high volume sorted sequential file

Post by ds_infy »

Hello,

I have a sequential file with 2000 million records which is already sorted on key1,key2 and key3. I am reading this file using a sequential file stage running in sequential mode and then hash partitioning the data after reading on Key1.

Seq stage -> copy stage (input hash on key 1) -> Dataset stage

From the test i did, the data going into the dataset is sorted within the partition.

Is my understanding from the test that the sorted data remains sorted even after partitioning correct?

Thanks,
Ds
prakashdasika
Premium Member
Premium Member
Posts: 72
Joined: Mon Jul 06, 2009 9:34 pm
Location: Sydney

Post by prakashdasika »

The partioning done on the input link of copy stage sorts the data on KEY 1 for each partitioning and i think when you say the data is sorted, I expect the sorted KEY1 data per partition. So affectively the sort order in dataset will be differnt from your source file.
Prakash Dasika
ETL Consultant
Sydney
Australia
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Re-partitioning is almost guaranteed to destroy parallel-sorted data's sorted order. Partitioning of a sequential stream will tend to preserve sorted order.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
ds_infy
Premium Member
Premium Member
Posts: 59
Joined: Tue Jun 09, 2009 4:17 am
Location: India

Post by ds_infy »

Thanks Prakash and Ray.

Ray,
When you say "Re-partitioning is almost guaranteed to destroy parallel-sorted data's sorted order.", i guess you mean to say if the input was a dataset containing parallel sorted data. Correct?

In my scenario, since i am reading a csv file in a sequential mode and then repartitioning it, i am guessing data will be sorted within a given partition (due to the first in first out pipe line processing nature of the stage and because the data is moving from sequential to parallel stage).

Pls let me know your thought.

Thanks,
Ds
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

That's correct. However, if you had re-partitioned the data on the input to the Data Set, the data in the Data Set would almost certainly not be sorted.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
ds_infy
Premium Member
Premium Member
Posts: 59
Joined: Tue Jun 09, 2009 4:17 am
Location: India

Post by ds_infy »

Thanks Ray for clarifying!!
Post Reply