How to delete all occurences of duplicates from seq file

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
kmsekhar
Premium Member
Premium Member
Posts: 58
Joined: Fri Apr 16, 2010 12:58 pm
Location: Chn

How to delete all occurences of duplicates from seq file

Post by kmsekhar »

Hi All,
I need to delete all occurences of duplicates from Sequential File as Input...

Input.txt
RNO|NAME|GRP|AGE
1|S|10|52
2|X|10|52
3|Y|20|52
1|Z|10|52
4|A|30|52

My Desired Output should be:

RNO|NAME|GRP|AGE
2|X|10|52
3|Y|20|52
4|A|30|52

I acheived using Shell Script & Loading into DB & by using having count(1)> 1..

still is there any better we can achieve this using any of the parallel stages..

Thanks in Advance

Regards,
Sekhar
vinothkumar
Participant
Posts: 342
Joined: Tue Nov 04, 2008 10:38 am
Location: Chennai, India

Post by vinothkumar »

Use remove duplicates stage with GRP and AGE as Key fields.
anbu
Premium Member
Premium Member
Posts: 596
Joined: Sat Feb 18, 2006 2:25 am
Location: india

Post by anbu »

Use aggregator to find count on RNO and remove rows having count greater than 1. Join this result with your input to remove the duplicates
You are the creator of your destiny - Swami Vivekananda
siauchun84
Participant
Posts: 63
Joined: Mon Oct 20, 2008 12:01 am
Location: Malaysia

Re: How to delete all occurences of duplicates from seq file

Post by siauchun84 »

I assume that you take the highest RNO number as you survival record if duplication occur in GPR and AGE.

In this case, I will sort the records by GPR, AGE then follow by RNO. After that, use the remove duplicate stage with GPR and AGE as key and retain first/last (depends on you sort the RNO asc or desc).
kmsekhar
Premium Member
Premium Member
Posts: 58
Joined: Fri Apr 16, 2010 12:58 pm
Location: Chn

Re: How to delete all occurences of duplicates from seq file

Post by kmsekhar »

I want to delete all occurences....
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

This is a classic fork join design as others have indicated. Downstream of the Join stage use a Filter stage or Transformer stage to allow past only those records that have a count of 1.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Post Reply