facing issue with remving duplicates

SPuneet · Post by **SPuneet** » Fri Sep 07, 2012 12:40 am

I have a job where the data is a s follows

ID Seq col1 col2 col3
___________________
10 1 Y Null Null
10 2 N Null O
10 3 N Null Null
10 4 Y Null O
11 1 N Null Null
11 2 N A Null
11 3 Y B O

I have the data sorted according to id and seq both ascending using a sort operator. Now i need to retain the row with the last seq number i.e the output should be

ID Seq col1 col2 col3
___________________
10 4 Y Null O
11 3 Y B O

I am using a sort operator ( wherei sort by id and seq ascending) followed by remove duplicate operator . here i specify the key as ID and the Duplicate to retain as 'Last'

But i am not getting teh desired result. It picks any row but not teh one with the last sequence.

need help where i am doing it wrong.

Regards,
SPuneet

ray.wurlod · Post by **ray.wurlod** » Fri Sep 07, 2012 1:13 am

Specify how the data are partitioned.

SPuneet · Post by **SPuneet** » Fri Sep 07, 2012 1:32 am

I am using auto partioning throughout

ArndW · Post by **ArndW** » Fri Sep 07, 2012 2:33 am

Explicity hash partition on "ID" as early in the job as possible and see if the result changes.

Sagnik Mukherjee · Post by **Sagnik Mukherjee** » Fri Sep 07, 2012 5:58 am

Hi,
Is it ok if you can get your desired output using only a transformer??
Please let me know.

chulett · Post by **chulett** » Fri Sep 07, 2012 6:51 am

We already have topics on that from SPuneet.

They seem to be experimenting as we keep seeing basically the same set of incoming data with different requirements and techniques posted. We've already done aggregator and transformer solutions, guess it's now time for sorting and Remove Duplicates.