Remove duplicate in server jobs

ketanshah123 · Post by **ketanshah123** » Thu Aug 16, 2007 6:33 am

hi
we want to remove duplicate records from seq. file in server job
our server is windows os server.

please suggest how to do it?

rumu · Post by **rumu** » Thu Aug 16, 2007 6:42 am

Use a Hashed File stage to remove duplicate.

ie your job would be

Seqfile Stage---->Hashed file Stage---->Target

kris · Post by **kris** » Thu Aug 16, 2007 11:24 am

ketanshah123 wrote:we want to remove duplicate records from seq. file in server job
our server is windows os server.

Hashed file is one of the ways but not the best solution always for de-duplication.

The approach you need will depend on few other things like 1.what is the logical key to the file that you want to dedupe on 2. how big the file it is

If the file is too big (>2GB), writing everything to the hashed file is not a solution unless your server is 64-bit and you have turned on 64-bit hashed file option.

Writing only the Key fields with one output link and reading(lookup) the same hashed file with reference link to find out if it is already there is one solution. That way you are not writing the whole file to the Hashed file and exceeding the size limit.

The other solution would be to sort the file on the logical key, feed to transformer and figuring out which is a duplicate with few stage variables or 'PreviousRowCompare' routine.

You have to choose what is best in your case.

Best regards,

ArndW · Post by **ArndW** » Thu Aug 16, 2007 3:24 pm

I agree that deduplicate via hashed files is not the most efficient approach. Sort the incoming data using the fields you need for detecting duplication, then use 2 stage variables in a transform stage (one to get the result of comparing this record with the last, the other to store the last record)

ray.wurlod · Post by **ray.wurlod** » Thu Aug 16, 2007 4:12 pm

If you have MKS Toolkit, CygWin, or some other UNIX emulator, you could use a UNIX command such as uniq or sort -u to de-duplicate your data. If you have version 7.5x2, you get MKS Toolkit installled with it.

DSXchange

Remove duplicate in server jobs

Remove duplicate in server jobs

Re: Remove duplicate in server jobs