DSXchange

Posted: **Tue Jul 18, 2006 12:27 am**

Hi all,

We are facing one problem in removing duplicates. we have 2 files. Each file has 10 million records. When we remove duplicates using Aggregator stage on 3 key columns, we are getting limitation on Aggregation memory. The job is getting aborted after the memory reaches to 2 GB, i.e after 15 lakh records the job is getting rejected.

Could you please suggest any approch to resolve this issue.

Thanks in advance.

Posted: **Tue Jul 18, 2006 12:44 am**

Are you using server or parallel jobs?.
If server jobs:
One way to bypass Aggregator limitaion is to sort the file externally as internal sort also consumes space.. and then check for duplicates within the transformer using stage variables.
If you are using parallel jobs you can use Remove duplicate stage (in conjunction with the sort stage).

Posted: **Tue Jul 18, 2006 12:55 am**

Thanks for fast replying
we are using server jobs. could you please tell us how exactly we need to use stage variables to get this done.

Posted: **Tue Jul 18, 2006 12:58 am**

kris007 wrote:Are you using server or parallel jobs?.
If server jobs:
One way to bypass Aggregator limitaion is to sort the file externally as internal sort also consumes space.. and then check for duplicates within the transformer using stage variables.
If you are using parallel jobs you can use Remove duplicate stage (in conjunction with the sort stage).

Hi,
If your input is sequential file then check Filter option to use /bin/sort command to sort ur files.
if you are using unix server then sort it using a script.

Posted: **Tue Jul 18, 2006 1:05 am**

You can also preprocess the file using unix sort - u command.

Posted: **Tue Jul 18, 2006 1:12 am**

we can sort the data before doing aggregation. Would this work for 10 million records.

Any thoughts on this

Posted: **Tue Jul 18, 2006 1:14 am**

Sorting a mere 10 million records should be no problem. You might need to tell the sort command to use a different temporary directory if your /tmp is dimensioned rather small. Using sorted data on the aggregator stage will speed up the processing and also have the effect that the stage will use very little memory.

Posted: **Tue Jul 18, 2006 1:32 am**

Actually, if you are using unix commands then you might as well use the unix uniq command to remove duplicates.
Another question, why are you using aggregator to remove duplicates, why not just pass the input through a hashed file?

Posted: **Tue Jul 18, 2006 1:36 am**

we have also tried with hash file. but that also was taking lot of time than DB query. we are using only windows and server jobs to resolve this issue. No unix commands we are using

Posted: **Tue Jul 18, 2006 1:42 am**

QualityStage can perform single file unduplication (even using fuzzy matching criteria) as well as two-file matches and removal of duplicates therefrom using various strategies.

Posted: **Tue Jul 18, 2006 2:08 am**

we do not have any Quality stage. we have to resolve this using server jobs.

could you please through some more light how to resove this issue.

Posted: **Tue Jul 18, 2006 2:21 am**

M_keerthi2005,

you have now been given some possible solutions. What more do you need? You can pre-sort your data in UNIX and then feed to it DataStage for deduplication. You can sort and remove duplicate directly in the UNIX sort.

Posted: **Tue Jul 18, 2006 2:40 am**

UNIX will also fail with this number. I have tried in AIX.

Posted: **Tue Jul 18, 2006 3:08 am**

Kumar,

on AIX I have sorted many more records than that. I actually did call a UNIX sort from a DataStage job earlier today of about 48 million records.

Posted: **Tue Jul 18, 2006 3:20 am**

ArndW wrote:Kumar,

on AIX I have sorted many more records than that. I actually did call a UNIX sort from a DataStage job earlier today of about 48 million records.

Perhaps I need to check for any parameter change required.

DSXchange

Removing duplicates from 20 million records

Removing duplicates from 20 million records

Re: Removing duplicates from 20 million records

Re: Removing duplicates from 20 million records