How to remove duplicates using LastRowInGroup?

deepa_shenoy · Post by **deepa_shenoy** » Thu Nov 10, 2011 12:25 am

Hi,

How to identify duplicates using LastRowInGroup() in the Transformer?

My input data is

ID EFF_D NAME COMPANY
1 2001-01-01 ABCD TET
1 2011-01-01 ABCD TET
2 2001-01-01 XYZ TS
3 1999-01-01 PQR WRO

My output data should be

ID EFF_D NAME COMPANY duplicate
1 2001-01-01 ABCD TET N
1 2011-01-01 ABCD TET Y
2 2001-01-01 XYZ TS N
3 1999-01-01 PQR WRO N

Thanks.

chulett · Post by **chulett** » Thu Nov 10, 2011 7:56 am

What have you tried?

And LastRowInGroup() doesn't identify duplicates, it simply let's you know if you are looking at the last row in any given "group".

ray.wurlod · Post by **ray.wurlod** » Thu Nov 10, 2011 3:21 pm

If you're keeping a count of the records in the group you can identify that the last record in the group is a duplicate of the first (and every other), but that does not give you the capacity to remove the duplicate(s). Why not use a Remove Duplicates stage?

chulett · Post by **chulett** » Thu Nov 10, 2011 4:52 pm

Looks like they really need to identify duplicates rather than remove them.

ray.wurlod · Post by **ray.wurlod** » Thu Nov 10, 2011 5:31 pm

Then all you need to do is to use stage variables to compare the current row with the previous row (assuming that they're appropriately sorted and partitioned) and set an indicator column value on the output.