How to remove duplicates

vemisr · Post by **vemisr** » Fri Dec 04, 2009 10:03 pm

Hi Experts,

DS 7.5

how to remove duplicates values . in only one column AcctID.

i have 3 seq input files , i need to merge then and need to remove duplicate values, but Remove duplicate stage is not there in Server jobs.

but i have to do in Server jobs only.

total num of records are more than 1 B.

thx
Vemisr

chulett · Post by **chulett** » Fri Dec 04, 2009 10:22 pm

The way you do everything in a Server job - in a Transformer. Sort the data and then use stage variables and a constraint to allow only the first row to pass out from each 'duplicate group'. That or leverage the destructive overwrite functionality of a Hashed file with the proper fields set as Keys, last one in wins, no duplicates allowed.

vemisr · Post by **vemisr** » Fri Dec 04, 2009 10:58 pm

But the total volume of the input data near 80 M records even some times it's more than 1 B records.
what about the performance ? performance is so critical.

But there is no other option i have to do in Server jobs only

thx
Vemisr

chulett · Post by **chulett** » Fri Dec 04, 2009 11:25 pm

That little fact should probably have been in the original post, don'tcha think?

Regardless, the only issue with that volume is the sorting. Probably best to attempt that outside of a Server job, say via a command line high speed sort utility. I would also consider bulk loading all three files into a work table in your database of choice and then letting it do the de-duplication.

Raw processing speed will be dependant on your hardware, including the disk subsystem. For whatever it is worth as a metric, I'm parsing 10M apache log records with a Server job chock full o' Field functions on a not exactly large Linux system in less than 4 minutes.

chulett · Post by **chulett** » Fri Dec 04, 2009 11:33 pm

Nice that you went back and added the volume in the original post after the fact. Classy. And if speed is oh so critical then they should be willing to invest in the tools needed to make that happen, whether it's the Enterprise Edition or adequate hardware or a high-speed command line sort utility.

We need you to chop up at least a cord of firewood, up to perhaps 10 cords on a good day and performance is so critical. Here's your butter knife. Good luck son.

Disclaimer: In no way am I implying that Server is the butter knife of the ETL world.

biju.chiramel · Post by **biju.chiramel** » Thu Dec 17, 2009 1:44 am

May be

We can have link collector with 3 input links for the sequential files with same meta data.... then sort stage on AcctID.... then aggregator stage with group by on AcctID and "first" function on other fields...

Thanks

chulett · Post by **chulett** » Thu Dec 17, 2009 7:27 am

Unless the data is sorted on the grouping keys and properly 'asserted' in the stage, the Server aggregator stage will fall over dead if you push 80M records into it.