Page 1 of 1

Sorting in Datastage

Posted: Wed Feb 26, 2014 6:16 pm
by anu123
I have a sequential file with the below data.

column names: a,b,c,d,e,f
101,5,A,20,30,X
101,4,S,10,26,Z
101,1,C,15,24,R
101,3,B,20,44,S
102,7,W,20,20,D
102,2,F,40,53,T
102,1,Z,75,34,A
102,3,V,10,34,S
103,4,V,20,31,X
103,11,W,90,22,D
103,2,C,35,32,D
103,9,S,30,45,S

I am trying to get highest and second highest values for each group. To do that I am using the following steps.

After reading from Seq_File, passing to Sort Stage (key column a, assigning key change column) , filtering 1's and 0's (2 out put links) based of value in key change column and passing records with value 0 in key change column to another sort stage (key column a, assigning key change column) , filtering out 1's and passing joining the records based on key.

I think there should be a better approach than this. Can any one please suggest me the best approach ?

Expected Output:
101,5,A,20,30,X,4,S,10,26,Z
102,7,W,20,20,D,3,V,10,34,S
103,4,V,20,31,X,9,S,30,45,S

Thanks,
Anu

Posted: Wed Feb 26, 2014 6:29 pm
by ray.wurlod
First off, highest and second highest based on what? For example, if it's column a and column b, then your expected output is incorrect.

Second, why does your expected output not include six rows (highest and second highest from each of three groups)?

Perhaps you need to give more thought to exactly what you are trying to accomplish. Document the transformation rules and this will form a specification for your ETL design.

Getting Highest, second highest and so on

Posted: Wed Feb 26, 2014 6:42 pm
by ssnegi
Hash Partition colA and sort ascending. Dont partition, only Sort colB,C... descending. Then use stage variable to keep count. So for key 101 the first record would have count 1, second 2 and so on. The stage variable should get reset when 101 becomes 102. Then put constraint in transformer where stage variable is 1 or 2. This will ensure that only the first two records are output.
svcnt --> if InputColumnA = svcolAOld then svcntold+1 else 1
svcolAold --> svcolA
svcntold --> svcnt
Constraint1 --> svcnt = 1
Constraint 2 --> svcnt = 2
Dummy Output Column --> svcnt
Then join based on the keyA to get the joined desired output.
Input--> transformer --> join--> output
Else you can also get individual output :
Constraint : svcnt = 1 or svcnt = 2
This will output single records for highest and second highest.
You can use this solution to get any number of highest. Only the constraints condition would change accordingly.

Posted: Wed Feb 26, 2014 7:01 pm
by anu123
Ray, I am trying to get first and second highest records for each group, where a is key column and highest is decided based on column b.

Ans to your second question, I am trying to get the both the records as 1 records, that is why I am performing a join in the end.

Now I realized that my output will be wrong.

Posted: Thu Feb 27, 2014 12:33 am
by ray.wurlod
You don't need sequential mode; just make sure that your data are partitioned on your first sorting/grouping key.

The rest is managed with a Sort stage (generating a Key Change column) and a Transformer stage (using stage variables). You not only have to detect the change, you also have to detect whether the current row is first after the change (use more stage variables for this).

The remainder follows pretty much as you'd planned.

Posted: Thu Feb 27, 2014 11:41 am
by anu123
Ray, I did understood what you said. Can you please explain it clrealy.

Thanks

Re: Getting Highest, second highest and so on

Posted: Thu Feb 27, 2014 1:40 pm
by anu123
Thank you, snegi.

I have partiotined Based on a and sorted on both a (asc) and on b (desc) in the input partitioning tab in transformer.

svcnt -> if i/p.a = svcolA then svcntOld+1 else 1
svcolA -> i/p.a
svcntOld -> svcnt

This helped me to generate sequence numbers in the order of values in col b, grouped by col a.....