Remove duplicate in server jobs

Post questions here relative to DataStage Server Edition for such areas as Server job design, DS Basic, Routines, Job Sequences, etc.

Moderators: chulett, rschirm, roy

Post Reply
ketanshah123
Participant
Posts: 88
Joined: Wed Apr 05, 2006 1:04 am

Remove duplicate in server jobs

Post by ketanshah123 »

hi
we want to remove duplicate records from seq. file in server job
our server is windows os server.

please suggest how to do it?
rumu
Participant
Posts: 286
Joined: Mon Jun 06, 2005 4:07 am

Post by rumu »

Use a Hashed File stage to remove duplicate.

ie your job would be

Seqfile Stage---->Hashed file Stage---->Target
kris
Participant
Posts: 160
Joined: Tue Dec 09, 2003 2:45 pm
Location: virginia, usa

Re: Remove duplicate in server jobs

Post by kris »

ketanshah123 wrote:we want to remove duplicate records from seq. file in server job
our server is windows os server.
Hashed file is one of the ways but not the best solution always for de-duplication.

The approach you need will depend on few other things like 1.what is the logical key to the file that you want to dedupe on 2. how big the file it is

If the file is too big (>2GB), writing everything to the hashed file is not a solution unless your server is 64-bit and you have turned on 64-bit hashed file option.

Writing only the Key fields with one output link and reading(lookup) the same hashed file with reference link to find out if it is already there is one solution. That way you are not writing the whole file to the Hashed file and exceeding the size limit.

The other solution would be to sort the file on the logical key, feed to transformer and figuring out which is a duplicate with few stage variables or 'PreviousRowCompare' routine.

You have to choose what is best in your case.

Best regards,
~Kris
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Post by ArndW »

I agree that deduplicate via hashed files is not the most efficient approach. Sort the incoming data using the fields you need for detecting duplication, then use 2 stage variables in a transform stage (one to get the result of comparing this record with the last, the other to store the last record)
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

If you have MKS Toolkit, CygWin, or some other UNIX emulator, you could use a UNIX command such as uniq or sort -u to de-duplicate your data. If you have version 7.5x2, you get MKS Toolkit installled with it.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Post Reply