Sorting data - partition design question

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
dscon9128
Participant
Posts: 25
Joined: Fri Jun 13, 2008 9:11 am

Sorting data - partition design question

Post by dscon9128 »

Hi ,

I had a question regarding the ideal partitioning strategy for sorting data , and landing it in 2 datasets thereafter.

My design is as :

(Inputdata)-->Sort stage ---> Copy ----dataset1
------------------------------------|
-----------------------------------dataset2

That is to say , sort data and then copy the sorted data to land in 2 different datasets.

The sort stage sorts on 2 keys , namely sortcode (ranging from 1-12) and store#

I have been trying to play around with the partitioning to accomplish this , but i dont get what i need .The results when i use auto partition everywhere in the specified stages are similar to what is shown :

sortcode-------------store number----column1----column 'n'
4--------------------A
4--------------------B
4--------------------C
6 -------------------X
6 ------------------- Y
2 ------------------- U
2 ------------------- V
1 ------------------- I
1 ------------------- J
1 -------------------- K
4-------------------- D
4 -------------------- E

I basically need all the records with sort code 1 to be ahead of those with 2 and so on.

The execution mode of the datasets is set to be parallel.

Any help on how i could make this work with a single sort stage is greatly appreciated!

Thanks in advance!!
priyadarshikunal
Premium Member
Premium Member
Posts: 1735
Joined: Thu Mar 01, 2007 5:44 am
Location: Troy, MI

Re: Sorting data - partition design question

Post by priyadarshikunal »

dscon9128 wrote:Hi ,

I had a question regarding the ideal partitioning strategy for sorting data , and landing it in 2 datasets thereafter.

My design is as :

(Inputdata)-->Sort stage ---> Copy ----dataset1
------------------------------------|
-----------------------------------dataset2

That is to say , sort data and then copy the sorted data to land in 2 different datasets.

The sort stage sorts on 2 keys , namely sortcode (ranging from 1-12) and store#

I have been trying to play around with the partitioning to accomplish this , but i dont get what i need .The results when i use auto partition everywhere in the specified stages are similar to what is shown :

sortcode-------------store number----column1----column 'n'
4--------------------A
4--------------------B
4--------------------C
6 -------------------X
6 ------------------- Y
2 ------------------- U
2 ------------------- V
1 ------------------- I
1 ------------------- J
1 -------------------- K
4-------------------- D
4 -------------------- E

I basically need all the records with sort code 1 to be ahead of those with 2 and so on.

The execution mode of the datasets is set to be parallel.

Any help on how i could make this work with a single sort stage is greatly appreciated!

Thanks in advance!!
Can't understand the output. generally it doesn't give output like that unless you mess with partitioning,
Use hash partitioning on sort code unless you are using these datasets as reference for lookup.
Priyadarshi Kunal

Genius may have its limitations, but stupidity is not thus handicapped. :wink:
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Partition by sortcode (modulus or hash as the algorithm) and sort by sortcode then by store number. You did not give us the rule for what determines the data set into which a particular row goes - but presumably you can use a Switch, Filter or Transformer stage to effect that.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
zulfi123786
Premium Member
Premium Member
Posts: 730
Joined: Tue Nov 04, 2008 10:14 am
Location: Bangalore

Re: Sorting data - partition design question

Post by zulfi123786 »

hi,

basically the sort stage sorts the data within the partitions. at the o/p of the sort stage the data is sorted within the partitions and when writing to dataset the data from all partitions gets collected, thereby if u look at the entire set of data the sorting looks as if it is lost. To avoid this you can set the sort stage to execute sequentially.

I am not sure of the requirement but u dont need to have the data completely sorted in the dataset, had it been a sequential file then it is just to have the data look as you want it to be.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

The given job design uses Data Sets as targets. Please don't introduce "red herrings". You can use a sort/merge collector if you need a sequential file to preserve sorting; it is not necessary to force the Sort stage to execute in sequential mode.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Post Reply