How to Remove Duplicates from Flatfile?

Post questions here relative to DataStage Server Edition for such areas as Server job design, DS Basic, Routines, Job Sequences, etc.

Moderators: chulett, rschirm, roy

Post Reply
nraj
Participant
Posts: 15
Joined: Tue Feb 22, 2005 9:22 am

How to Remove Duplicates from Flatfile?

Post by nraj »

Hi,

could you please let us know how we can remove duplicates in Datastage jobs and commond in unix which does this job.


Thanks
Nraj
loveojha2
Participant
Posts: 362
Joined: Thu May 26, 2005 12:59 am

Post by loveojha2 »

There could be many solutions:
one would be, Read the whole row of the Sequential file as a column, do a sort, use a transformer with a stage variable having previous row and write the next coming row only if it is not equal to previous row to another sequential file.
Copy the new file with overwrite option to the source file through after job subroutine.
Last edited by loveojha2 on Thu Dec 01, 2005 2:23 am, edited 1 time in total.
Success consists of getting up just one more time than you fall.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

You need to define "duplicate". This answer uses "has same key column(s)" as the definition.

The UNIX command is sort -u (plus any other command line options needed).

Most server job developers use a hashed file to remove duplicates, relying on the fact that any write to a hashed file with the same key is a destructive overwrite.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Post Reply