Page 1 of 1

How to Remove Duplicates from Flatfile?

Posted: Thu Dec 01, 2005 1:44 am
by nraj
Hi,

could you please let us know how we can remove duplicates in Datastage jobs and commond in unix which does this job.


Thanks
Nraj

Posted: Thu Dec 01, 2005 2:22 am
by loveojha2
There could be many solutions:
one would be, Read the whole row of the Sequential file as a column, do a sort, use a transformer with a stage variable having previous row and write the next coming row only if it is not equal to previous row to another sequential file.
Copy the new file with overwrite option to the source file through after job subroutine.

Posted: Thu Dec 01, 2005 2:23 am
by ray.wurlod
You need to define "duplicate". This answer uses "has same key column(s)" as the definition.

The UNIX command is sort -u (plus any other command line options needed).

Most server job developers use a hashed file to remove duplicates, relying on the fact that any write to a hashed file with the same key is a destructive overwrite.