Hi Experts,
DS 7.5
how to remove duplicates values . in only one column AcctID.
i have 3 seq input files , i need to merge then and need to remove duplicate values, but Remove duplicate stage is not there in Server jobs.
but i have to do in Server jobs only.
total num of records are more than 1 B.
thx
Vemisr
How to remove duplicates
Moderators: chulett, rschirm, roy
How to remove duplicates
Last edited by vemisr on Fri Dec 04, 2009 11:24 pm, edited 1 time in total.
The way you do everything in a Server job - in a Transformer. Sort the data and then use stage variables and a constraint to allow only the first row to pass out from each 'duplicate group'. That or leverage the destructive overwrite functionality of a Hashed file with the proper fields set as Keys, last one in wins, no duplicates allowed.
-craig
"You can never have too many knives" -- Logan Nine Fingers
"You can never have too many knives" -- Logan Nine Fingers
That little fact should probably have been in the original post, don'tcha think?
Regardless, the only issue with that volume is the sorting. Probably best to attempt that outside of a Server job, say via a command line high speed sort utility. I would also consider bulk loading all three files into a work table in your database of choice and then letting it do the de-duplication.
Raw processing speed will be dependant on your hardware, including the disk subsystem. For whatever it is worth as a metric, I'm parsing 10M apache log records with a Server job chock full o' Field functions on a not exactly large Linux system in less than 4 minutes.
![Confused :?](./images/smilies/icon_confused.gif)
Regardless, the only issue with that volume is the sorting. Probably best to attempt that outside of a Server job, say via a command line high speed sort utility. I would also consider bulk loading all three files into a work table in your database of choice and then letting it do the de-duplication.
Raw processing speed will be dependant on your hardware, including the disk subsystem. For whatever it is worth as a metric, I'm parsing 10M apache log records with a Server job chock full o' Field functions on a not exactly large Linux system in less than 4 minutes.
-craig
"You can never have too many knives" -- Logan Nine Fingers
"You can never have too many knives" -- Logan Nine Fingers
Nice that you went back and added the volume in the original post after the fact. Classy. And if speed is oh so critical then they should be willing to invest in the tools needed to make that happen, whether it's the Enterprise Edition or adequate hardware or a high-speed command line sort utility.
We need you to chop up at least a cord of firewood, up to perhaps 10 cords on a good day and performance is so critical. Here's your butter knife. Good luck son.
Disclaimer: In no way am I implying that Server is the butter knife of the ETL world.![Wink :wink:](./images/smilies/icon_wink.gif)
We need you to chop up at least a cord of firewood, up to perhaps 10 cords on a good day and performance is so critical. Here's your butter knife. Good luck son.
Disclaimer: In no way am I implying that Server is the butter knife of the ETL world.
![Wink :wink:](./images/smilies/icon_wink.gif)
-craig
"You can never have too many knives" -- Logan Nine Fingers
"You can never have too many knives" -- Logan Nine Fingers
-
- Participant
- Posts: 5
- Joined: Mon Oct 29, 2007 9:55 pm
- Location: Mumbai