Sorting in Datastage

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
anu123
Premium Member
Premium Member
Posts: 143
Joined: Sun Feb 05, 2006 1:05 pm
Location: Columbus, OH, USA

Sorting in Datastage

Post by anu123 »

I have a sequential file with the below data.

column names: a,b,c,d,e,f
101,5,A,20,30,X
101,4,S,10,26,Z
101,1,C,15,24,R
101,3,B,20,44,S
102,7,W,20,20,D
102,2,F,40,53,T
102,1,Z,75,34,A
102,3,V,10,34,S
103,4,V,20,31,X
103,11,W,90,22,D
103,2,C,35,32,D
103,9,S,30,45,S

I am trying to get highest and second highest values for each group. To do that I am using the following steps.

After reading from Seq_File, passing to Sort Stage (key column a, assigning key change column) , filtering 1's and 0's (2 out put links) based of value in key change column and passing records with value 0 in key change column to another sort stage (key column a, assigning key change column) , filtering out 1's and passing joining the records based on key.

I think there should be a better approach than this. Can any one please suggest me the best approach ?

Expected Output:
101,5,A,20,30,X,4,S,10,26,Z
102,7,W,20,20,D,3,V,10,34,S
103,4,V,20,31,X,9,S,30,45,S

Thanks,
Anu
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

First off, highest and second highest based on what? For example, if it's column a and column b, then your expected output is incorrect.

Second, why does your expected output not include six rows (highest and second highest from each of three groups)?

Perhaps you need to give more thought to exactly what you are trying to accomplish. Document the transformation rules and this will form a specification for your ETL design.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
ssnegi
Participant
Posts: 138
Joined: Thu Nov 15, 2007 4:17 am
Location: Sydney, Australia

Getting Highest, second highest and so on

Post by ssnegi »

Hash Partition colA and sort ascending. Dont partition, only Sort colB,C... descending. Then use stage variable to keep count. So for key 101 the first record would have count 1, second 2 and so on. The stage variable should get reset when 101 becomes 102. Then put constraint in transformer where stage variable is 1 or 2. This will ensure that only the first two records are output.
svcnt --> if InputColumnA = svcolAOld then svcntold+1 else 1
svcolAold --> svcolA
svcntold --> svcnt
Constraint1 --> svcnt = 1
Constraint 2 --> svcnt = 2
Dummy Output Column --> svcnt
Then join based on the keyA to get the joined desired output.
Input--> transformer --> join--> output
Else you can also get individual output :
Constraint : svcnt = 1 or svcnt = 2
This will output single records for highest and second highest.
You can use this solution to get any number of highest. Only the constraints condition would change accordingly.
Last edited by ssnegi on Thu Feb 27, 2014 5:43 pm, edited 10 times in total.
anu123
Premium Member
Premium Member
Posts: 143
Joined: Sun Feb 05, 2006 1:05 pm
Location: Columbus, OH, USA

Post by anu123 »

Ray, I am trying to get first and second highest records for each group, where a is key column and highest is decided based on column b.

Ans to your second question, I am trying to get the both the records as 1 records, that is why I am performing a join in the end.

Now I realized that my output will be wrong.
Thank you,
Anu
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

You don't need sequential mode; just make sure that your data are partitioned on your first sorting/grouping key.

The rest is managed with a Sort stage (generating a Key Change column) and a Transformer stage (using stage variables). You not only have to detect the change, you also have to detect whether the current row is first after the change (use more stage variables for this).

The remainder follows pretty much as you'd planned.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
anu123
Premium Member
Premium Member
Posts: 143
Joined: Sun Feb 05, 2006 1:05 pm
Location: Columbus, OH, USA

Post by anu123 »

Ray, I did understood what you said. Can you please explain it clrealy.

Thanks
Thank you,
Anu
anu123
Premium Member
Premium Member
Posts: 143
Joined: Sun Feb 05, 2006 1:05 pm
Location: Columbus, OH, USA

Re: Getting Highest, second highest and so on

Post by anu123 »

Thank you, snegi.

I have partiotined Based on a and sorted on both a (asc) and on b (desc) in the input partitioning tab in transformer.

svcnt -> if i/p.a = svcolA then svcntOld+1 else 1
svcolA -> i/p.a
svcntOld -> svcnt

This helped me to generate sequence numbers in the order of values in col b, grouped by col a.....
Thank you,
Anu
Post Reply