Page 1 of 2

Removing duplicates from 20 million records

Posted: Tue Jul 18, 2006 12:27 am
by m_keerthi2005
Hi all,

We are facing one problem in removing duplicates. we have 2 files. Each file has 10 million records. When we remove duplicates using Aggregator stage on 3 key columns, we are getting limitation on Aggregation memory. The job is getting aborted after the memory reaches to 2 GB, i.e after 15 lakh records the job is getting rejected.

Could you please suggest any approch to resolve this issue.

Thanks in advance.

Re: Removing duplicates from 20 million records

Posted: Tue Jul 18, 2006 12:44 am
by kris007
Are you using server or parallel jobs?.
If server jobs:
One way to bypass Aggregator limitaion is to sort the file externally as internal sort also consumes space.. and then check for duplicates within the transformer using stage variables.
If you are using parallel jobs you can use Remove duplicate stage (in conjunction with the sort stage).

Posted: Tue Jul 18, 2006 12:55 am
by m_keerthi2005
Thanks for fast replying
we are using server jobs. could you please tell us how exactly we need to use stage variables to get this done.

Re: Removing duplicates from 20 million records

Posted: Tue Jul 18, 2006 12:58 am
by rumu
kris007 wrote:Are you using server or parallel jobs?.
If server jobs:
One way to bypass Aggregator limitaion is to sort the file externally as internal sort also consumes space.. and then check for duplicates within the transformer using stage variables.
If you are using parallel jobs you can use Remove duplicate stage (in conjunction with the sort stage).
Hi,
If your input is sequential file then check Filter option to use /bin/sort command to sort ur files.
if you are using unix server then sort it using a script.

Posted: Tue Jul 18, 2006 1:05 am
by balajisr
You can also preprocess the file using unix sort - u command.

Posted: Tue Jul 18, 2006 1:12 am
by m_keerthi2005
we can sort the data before doing aggregation. Would this work for 10 million records.

Any thoughts on this

Posted: Tue Jul 18, 2006 1:14 am
by ArndW
Sorting a mere 10 million records should be no problem. You might need to tell the sort command to use a different temporary directory if your /tmp is dimensioned rather small. Using sorted data on the aggregator stage will speed up the processing and also have the effect that the stage will use very little memory.

Posted: Tue Jul 18, 2006 1:32 am
by DSguru2B
Actually, if you are using unix commands then you might as well use the unix uniq command to remove duplicates.
Another question, why are you using aggregator to remove duplicates, why not just pass the input through a hashed file?

Posted: Tue Jul 18, 2006 1:36 am
by m_keerthi2005
we have also tried with hash file. but that also was taking lot of time than DB query. we are using only windows and server jobs to resolve this issue. No unix commands we are using

Posted: Tue Jul 18, 2006 1:42 am
by ray.wurlod
QualityStage can perform single file unduplication (even using fuzzy matching criteria) as well as two-file matches and removal of duplicates therefrom using various strategies.

Posted: Tue Jul 18, 2006 2:08 am
by m_keerthi2005
we do not have any Quality stage. we have to resolve this using server jobs.

could you please through some more light how to resove this issue.

Posted: Tue Jul 18, 2006 2:21 am
by ArndW
M_keerthi2005,

you have now been given some possible solutions. What more do you need? You can pre-sort your data in UNIX and then feed to it DataStage for deduplication. You can sort and remove duplicate directly in the UNIX sort.

Posted: Tue Jul 18, 2006 2:40 am
by kumar_s
UNIX will also fail with this number. I have tried in AIX.

Posted: Tue Jul 18, 2006 3:08 am
by ArndW
Kumar,

on AIX I have sorted many more records than that. I actually did call a UNIX sort from a DataStage job earlier today of about 48 million records.

Posted: Tue Jul 18, 2006 3:20 am
by kumar_s
ArndW wrote:Kumar,

on AIX I have sorted many more records than that. I actually did call a UNIX sort from a DataStage job earlier today of about 48 million records.
Perhaps I need to check for any parameter change required. :?