Page 1 of 1

Sort with remove Dups

Posted: Wed Feb 18, 2015 11:12 am
by edward_m
To get the latest row based on the key columns i am using sort with remove duplicates stage, however its working for some records but not all, also seeing dups.

Input--->Sort--->remove Dups--->Output

Here is the data..

COL1 COL2 COL3 COL4 COL5
1 02/01/2015 02/28/2015 02/18/2015 01
1 03/01/2015 12/31/9999 02/18/2015 02
2 01/01/2015 02/28/2015 02/18/2015 01
3 01/01/2015 01/30/2015 02/18/2015 01
3 02/01/2015 12/31/9999 02/18/2015 02
3 03/01/2015 12/31/2015 02/18/2015 03

expected Output..
1 03/01/2015 12/31/9999 02/18/2015 02
2 01/01/2015 02/28/2015 02/18/2015 01
3 03/01/2015 12/31/2015 02/18/2015 03

In sort stage..partitiong set to Auto
Sorting Keys
key=COL1
Sort Key Mode=sort
Sort Order=Ascending
key=COL4
Sort Key Mode=sort
Sort Order=Ascending
key=COL2
Sort Key Mode=sort
Sort Order=Descending
key=COL4
Sort Key Mode=sort
Sort Order=Descending

In Remove Dups stage..partitiong set to Auto

Keys that define duplicates
Key=COl1
Duplicates to retain=First

I am not sure where i was wrong..please point me in right direction.

Thanks in advance.

Posted: Wed Feb 18, 2015 2:29 pm
by rkashyap
What is "SQL Type" of COL1 thru COL5?

Posted: Wed Feb 18, 2015 2:35 pm
by ray.wurlod
Seems to me that you want Duplicates to Retain=Last.

Posted: Fri Feb 20, 2015 2:18 pm
by edward_m
its working for some records not all, do i need to change anyhting with partitioning property as job runs with 4 node config file..

Thanks.

Posted: Sat Feb 21, 2015 1:20 am
by mobashshar
You have to make sure that you are selecting partitioning and sorting for COL1 and only sorting for the remaining columns.

Posted: Sat Feb 21, 2015 1:23 am
by mobashshar
and use only Input--->Remove Dups--->Output.
Use the input column sorting/partitioning in Remove Duplicate stage. No need to use Sort Stage for your requirement.