How to delete all occurences of duplicates from seq file

kmsekhar · Post by **kmsekhar** » Thu Oct 14, 2010 9:40 am

Hi All,
I need to delete all occurences of duplicates from Sequential File as Input...

Input.txt
RNO|NAME|GRP|AGE
1|S|10|52
2|X|10|52
3|Y|20|52
1|Z|10|52
4|A|30|52

My Desired Output should be:

RNO|NAME|GRP|AGE
2|X|10|52
3|Y|20|52
4|A|30|52

I acheived using Shell Script & Loading into DB & by using having count(1)> 1..

still is there any better we can achieve this using any of the parallel stages..

Thanks in Advance

Regards,
Sekhar

vinothkumar · Post by **vinothkumar** » Thu Oct 14, 2010 10:01 am

Use remove duplicates stage with GRP and AGE as Key fields.

anbu · Post by **anbu** » Thu Oct 14, 2010 10:06 am

Use aggregator to find count on RNO and remove rows having count greater than 1. Join this result with your input to remove the duplicates

siauchun84 · Post by **siauchun84** » Thu Oct 14, 2010 7:34 pm

I assume that you take the highest RNO number as you survival record if duplication occur in GPR and AGE.

In this case, I will sort the records by GPR, AGE then follow by RNO. After that, use the remove duplicate stage with GPR and AGE as key and retain first/last (depends on you sort the RNO asc or desc).

kmsekhar · Post by **kmsekhar** » Thu Oct 14, 2010 11:29 pm

I want to delete all occurences....

ray.wurlod · Post by **ray.wurlod** » Fri Oct 15, 2010 12:13 am

This is a classic fork join design as others have indicated. Downstream of the Join stage use a Filter stage or Transformer stage to allow past only those records that have a count of 1.

DSXchange

How to delete all occurences of duplicates from seq file

How to delete all occurences of duplicates from seq file

Re: How to delete all occurences of duplicates from seq file

Re: How to delete all occurences of duplicates from seq file