Page 1 of 1

How to remove duplicates

Posted: Fri Dec 04, 2009 10:03 pm
by vemisr
Hi Experts,

DS 7.5

how to remove duplicates values . in only one column AcctID.

i have 3 seq input files , i need to merge then and need to remove duplicate values, but Remove duplicate stage is not there in Server jobs.

but i have to do in Server jobs only.

total num of records are more than 1 B.


thx
Vemisr

Posted: Fri Dec 04, 2009 10:22 pm
by chulett
The way you do everything in a Server job - in a Transformer. Sort the data and then use stage variables and a constraint to allow only the first row to pass out from each 'duplicate group'. That or leverage the destructive overwrite functionality of a Hashed file with the proper fields set as Keys, last one in wins, no duplicates allowed.

Posted: Fri Dec 04, 2009 10:58 pm
by vemisr
But the total volume of the input data near 80 M records even some times it's more than 1 B records.
what about the performance ? performance is so critical.

But there is no other option i have to do in Server jobs only

thx
Vemisr

Posted: Fri Dec 04, 2009 11:25 pm
by chulett
That little fact should probably have been in the original post, don'tcha think? :?

Regardless, the only issue with that volume is the sorting. Probably best to attempt that outside of a Server job, say via a command line high speed sort utility. I would also consider bulk loading all three files into a work table in your database of choice and then letting it do the de-duplication.

Raw processing speed will be dependant on your hardware, including the disk subsystem. For whatever it is worth as a metric, I'm parsing 10M apache log records with a Server job chock full o' Field functions on a not exactly large Linux system in less than 4 minutes.

Posted: Fri Dec 04, 2009 11:33 pm
by chulett
Nice that you went back and added the volume in the original post after the fact. Classy. And if speed is oh so critical then they should be willing to invest in the tools needed to make that happen, whether it's the Enterprise Edition or adequate hardware or a high-speed command line sort utility.

We need you to chop up at least a cord of firewood, up to perhaps 10 cords on a good day and performance is so critical. Here's your butter knife. Good luck son.

Disclaimer: In no way am I implying that Server is the butter knife of the ETL world. :wink:

Posted: Thu Dec 17, 2009 1:44 am
by biju.chiramel
May be


We can have link collector with 3 input links for the sequential files with same meta data.... then sort stage on AcctID.... then aggregator stage with group by on AcctID and "first" function on other fields...

Thanks

Posted: Thu Dec 17, 2009 7:27 am
by chulett
Unless the data is sorted on the grouping keys and properly 'asserted' in the stage, the Server aggregator stage will fall over dead if you push 80M records into it.