I have a sequential file with the below data.
column names: a,b,c,d,e,f
101,5,A,20,30,X
101,4,S,10,26,Z
101,1,C,15,24,R
101,3,B,20,44,S
102,7,W,20,20,D
102,2,F,40,53,T
102,1,Z,75,34,A
102,3,V,10,34,S
103,4,V,20,31,X
103,11,W,90,22,D
103,2,C,35,32,D
103,9,S,30,45,S
I am trying to get highest and second highest values for each group. To do that I am using the following steps.
After reading from Seq_File, passing to Sort Stage (key column a, assigning key change column) , filtering 1's and 0's (2 out put links) based of value in key change column and passing records with value 0 in key change column to another sort stage (key column a, assigning key change column) , filtering out 1's and passing joining the records based on key.
I think there should be a better approach than this. Can any one please suggest me the best approach ?
Expected Output:
101,5,A,20,30,X,4,S,10,26,Z
102,7,W,20,20,D,3,V,10,34,S
103,4,V,20,31,X,9,S,30,45,S
Thanks,
Anu
Sorting in Datastage
Moderators: chulett, rschirm, roy
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
First off, highest and second highest based on what? For example, if it's column a and column b, then your expected output is incorrect.
Second, why does your expected output not include six rows (highest and second highest from each of three groups)?
Perhaps you need to give more thought to exactly what you are trying to accomplish. Document the transformation rules and this will form a specification for your ETL design.
Second, why does your expected output not include six rows (highest and second highest from each of three groups)?
Perhaps you need to give more thought to exactly what you are trying to accomplish. Document the transformation rules and this will form a specification for your ETL design.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Getting Highest, second highest and so on
Hash Partition colA and sort ascending. Dont partition, only Sort colB,C... descending. Then use stage variable to keep count. So for key 101 the first record would have count 1, second 2 and so on. The stage variable should get reset when 101 becomes 102. Then put constraint in transformer where stage variable is 1 or 2. This will ensure that only the first two records are output.
svcnt --> if InputColumnA = svcolAOld then svcntold+1 else 1
svcolAold --> svcolA
svcntold --> svcnt
Constraint1 --> svcnt = 1
Constraint 2 --> svcnt = 2
Dummy Output Column --> svcnt
Then join based on the keyA to get the joined desired output.
Input--> transformer --> join--> output
Else you can also get individual output :
Constraint : svcnt = 1 or svcnt = 2
This will output single records for highest and second highest.
You can use this solution to get any number of highest. Only the constraints condition would change accordingly.
svcnt --> if InputColumnA = svcolAOld then svcntold+1 else 1
svcolAold --> svcolA
svcntold --> svcnt
Constraint1 --> svcnt = 1
Constraint 2 --> svcnt = 2
Dummy Output Column --> svcnt
Then join based on the keyA to get the joined desired output.
Input--> transformer --> join--> output
Else you can also get individual output :
Constraint : svcnt = 1 or svcnt = 2
This will output single records for highest and second highest.
You can use this solution to get any number of highest. Only the constraints condition would change accordingly.
Last edited by ssnegi on Thu Feb 27, 2014 5:43 pm, edited 10 times in total.
Ray, I am trying to get first and second highest records for each group, where a is key column and highest is decided based on column b.
Ans to your second question, I am trying to get the both the records as 1 records, that is why I am performing a join in the end.
Now I realized that my output will be wrong.
Ans to your second question, I am trying to get the both the records as 1 records, that is why I am performing a join in the end.
Now I realized that my output will be wrong.
Thank you,
Anu
Anu
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
You don't need sequential mode; just make sure that your data are partitioned on your first sorting/grouping key.
The rest is managed with a Sort stage (generating a Key Change column) and a Transformer stage (using stage variables). You not only have to detect the change, you also have to detect whether the current row is first after the change (use more stage variables for this).
The remainder follows pretty much as you'd planned.
The rest is managed with a Sort stage (generating a Key Change column) and a Transformer stage (using stage variables). You not only have to detect the change, you also have to detect whether the current row is first after the change (use more stage variables for this).
The remainder follows pretty much as you'd planned.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Re: Getting Highest, second highest and so on
Thank you, snegi.
I have partiotined Based on a and sorted on both a (asc) and on b (desc) in the input partitioning tab in transformer.
svcnt -> if i/p.a = svcolA then svcntOld+1 else 1
svcolA -> i/p.a
svcntOld -> svcnt
This helped me to generate sequence numbers in the order of values in col b, grouped by col a.....
I have partiotined Based on a and sorted on both a (asc) and on b (desc) in the input partitioning tab in transformer.
svcnt -> if i/p.a = svcolA then svcntOld+1 else 1
svcolA -> i/p.a
svcntOld -> svcnt
This helped me to generate sequence numbers in the order of values in col b, grouped by col a.....
Thank you,
Anu
Anu