How to Remove Duplicates from Flatfile?

nraj · Post by **nraj** » Thu Dec 01, 2005 1:44 am

Hi,

could you please let us know how we can remove duplicates in Datastage jobs and commond in unix which does this job.

Thanks
Nraj

loveojha2 · Post by **loveojha2** » Thu Dec 01, 2005 2:22 am

There could be many solutions:
one would be, Read the whole row of the Sequential file as a column, do a sort, use a transformer with a stage variable having previous row and write the next coming row only if it is not equal to previous row to another sequential file.
Copy the new file with overwrite option to the source file through after job subroutine.

ray.wurlod · Post by **ray.wurlod** » Thu Dec 01, 2005 2:23 am

You need to define "duplicate". This answer uses "has same key column(s)" as the definition.

The UNIX command is sort -u (plus any other command line options needed).

Most server job developers use a hashed file to remove duplicates, relying on the fact that any write to a hashed file with the same key is a destructive overwrite.