How to remove duplicates

Post questions here relative to DataStage Server Edition for such areas as Server job design, DS Basic, Routines, Job Sequences, etc.

Moderators: chulett, rschirm, roy

Post Reply
vemisr
Participant
Posts: 72
Joined: Thu Sep 11, 2008 1:31 pm

How to remove duplicates

Post by vemisr »

Hi Experts,

DS 7.5

how to remove duplicates values . in only one column AcctID.

i have 3 seq input files , i need to merge then and need to remove duplicate values, but Remove duplicate stage is not there in Server jobs.

but i have to do in Server jobs only.

total num of records are more than 1 B.


thx
Vemisr
Last edited by vemisr on Fri Dec 04, 2009 11:24 pm, edited 1 time in total.
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

The way you do everything in a Server job - in a Transformer. Sort the data and then use stage variables and a constraint to allow only the first row to pass out from each 'duplicate group'. That or leverage the destructive overwrite functionality of a Hashed file with the proper fields set as Keys, last one in wins, no duplicates allowed.
-craig

"You can never have too many knives" -- Logan Nine Fingers
vemisr
Participant
Posts: 72
Joined: Thu Sep 11, 2008 1:31 pm

Post by vemisr »

But the total volume of the input data near 80 M records even some times it's more than 1 B records.
what about the performance ? performance is so critical.

But there is no other option i have to do in Server jobs only

thx
Vemisr
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

That little fact should probably have been in the original post, don'tcha think? :?

Regardless, the only issue with that volume is the sorting. Probably best to attempt that outside of a Server job, say via a command line high speed sort utility. I would also consider bulk loading all three files into a work table in your database of choice and then letting it do the de-duplication.

Raw processing speed will be dependant on your hardware, including the disk subsystem. For whatever it is worth as a metric, I'm parsing 10M apache log records with a Server job chock full o' Field functions on a not exactly large Linux system in less than 4 minutes.
-craig

"You can never have too many knives" -- Logan Nine Fingers
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

Nice that you went back and added the volume in the original post after the fact. Classy. And if speed is oh so critical then they should be willing to invest in the tools needed to make that happen, whether it's the Enterprise Edition or adequate hardware or a high-speed command line sort utility.

We need you to chop up at least a cord of firewood, up to perhaps 10 cords on a good day and performance is so critical. Here's your butter knife. Good luck son.

Disclaimer: In no way am I implying that Server is the butter knife of the ETL world. :wink:
-craig

"You can never have too many knives" -- Logan Nine Fingers
biju.chiramel
Participant
Posts: 5
Joined: Mon Oct 29, 2007 9:55 pm
Location: Mumbai

Post by biju.chiramel »

May be


We can have link collector with 3 input links for the sequential files with same meta data.... then sort stage on AcctID.... then aggregator stage with group by on AcctID and "first" function on other fields...

Thanks
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

Unless the data is sorted on the grouping keys and properly 'asserted' in the stage, the Server aggregator stage will fall over dead if you push 80M records into it.
-craig

"You can never have too many knives" -- Logan Nine Fingers
Post Reply