Page 1 of 1

Sorting data - partition design question

Posted: Tue Dec 02, 2008 7:48 pm
by dscon9128
Hi ,

I had a question regarding the ideal partitioning strategy for sorting data , and landing it in 2 datasets thereafter.

My design is as :

(Inputdata)-->Sort stage ---> Copy ----dataset1
------------------------------------|
-----------------------------------dataset2

That is to say , sort data and then copy the sorted data to land in 2 different datasets.

The sort stage sorts on 2 keys , namely sortcode (ranging from 1-12) and store#

I have been trying to play around with the partitioning to accomplish this , but i dont get what i need .The results when i use auto partition everywhere in the specified stages are similar to what is shown :

sortcode-------------store number----column1----column 'n'
4--------------------A
4--------------------B
4--------------------C
6 -------------------X
6 ------------------- Y
2 ------------------- U
2 ------------------- V
1 ------------------- I
1 ------------------- J
1 -------------------- K
4-------------------- D
4 -------------------- E

I basically need all the records with sort code 1 to be ahead of those with 2 and so on.

The execution mode of the datasets is set to be parallel.

Any help on how i could make this work with a single sort stage is greatly appreciated!

Thanks in advance!!

Re: Sorting data - partition design question

Posted: Tue Dec 02, 2008 10:37 pm
by priyadarshikunal
dscon9128 wrote:Hi ,

I had a question regarding the ideal partitioning strategy for sorting data , and landing it in 2 datasets thereafter.

My design is as :

(Inputdata)-->Sort stage ---> Copy ----dataset1
------------------------------------|
-----------------------------------dataset2

That is to say , sort data and then copy the sorted data to land in 2 different datasets.

The sort stage sorts on 2 keys , namely sortcode (ranging from 1-12) and store#

I have been trying to play around with the partitioning to accomplish this , but i dont get what i need .The results when i use auto partition everywhere in the specified stages are similar to what is shown :

sortcode-------------store number----column1----column 'n'
4--------------------A
4--------------------B
4--------------------C
6 -------------------X
6 ------------------- Y
2 ------------------- U
2 ------------------- V
1 ------------------- I
1 ------------------- J
1 -------------------- K
4-------------------- D
4 -------------------- E

I basically need all the records with sort code 1 to be ahead of those with 2 and so on.

The execution mode of the datasets is set to be parallel.

Any help on how i could make this work with a single sort stage is greatly appreciated!

Thanks in advance!!
Can't understand the output. generally it doesn't give output like that unless you mess with partitioning,
Use hash partitioning on sort code unless you are using these datasets as reference for lookup.

Posted: Tue Dec 02, 2008 10:39 pm
by ray.wurlod
Partition by sortcode (modulus or hash as the algorithm) and sort by sortcode then by store number. You did not give us the rule for what determines the data set into which a particular row goes - but presumably you can use a Switch, Filter or Transformer stage to effect that.

Re: Sorting data - partition design question

Posted: Sat Jan 10, 2009 6:44 am
by zulfi123786
hi,

basically the sort stage sorts the data within the partitions. at the o/p of the sort stage the data is sorted within the partitions and when writing to dataset the data from all partitions gets collected, thereby if u look at the entire set of data the sorting looks as if it is lost. To avoid this you can set the sort stage to execute sequentially.

I am not sure of the requirement but u dont need to have the data completely sorted in the dataset, had it been a sequential file then it is just to have the data look as you want it to be.

Posted: Sat Jan 10, 2009 3:39 pm
by ray.wurlod
The given job design uses Data Sets as targets. Please don't introduce "red herrings". You can use a sort/merge collector if you need a sequential file to preserve sorting; it is not necessary to force the Sort stage to execute in sequential mode.