Delete the duplicates from Database

rafidwh · Post by **rafidwh** » Tue Sep 08, 2009 11:15 am

Hi All,

I have to create a one time job that has to identfy duplicate records based on 20 fields and delete the latest record (can be identified based on DATE1). There are millions of duplicate records in the table.

Please suggest the approach to resolve the issue.

chulett · Post by **chulett** » Tue Sep 08, 2009 11:22 am

Does it have to be a job? You can do this in SQL... and I'd suggest you save the unique records in a new table and then rename (if possible) rather than all of those deletes if the number to save is smaller than the number to get rid of.

IBM Analytics Champion 2009 - 2020 · Post by **asorrell** » Tue Sep 08, 2009 12:25 pm

However, if it must be done in a job, I think that I remember doing something similar and the fastest way to do it was something like this:

1) Create one big string containing a concatenation of all of the 20 variables.
2) Sort data by "Big String" then By descending date.
3) Feed into a transformer with a Stage Variable that tracks the previous "Big String" value and compares it to the current value. Use a constraint to only output the row when they are NOT identical.

I seem to remember that it was a lot faster than doing an extended "If Then Else" construct to compare 20 values.

rafidwh · Post by **rafidwh** » Wed Sep 09, 2009 9:51 pm

Hi Creg,

The records need to be deleted and thats the requirement and we don't want to have a table exclusively to remove the dups.

If there are any more options to do this task would help greatly.

ArndW · Post by **ArndW** » Thu Sep 10, 2009 12:54 am

Select the data in a job, sort the data in DataStage, use remove duplicates and the "deleted" entries from that stage to delete the records in the original table. Not fast, but it works.

Sainath.Srinivasan · Post by **Sainath.Srinivasan** » Thu Sep 10, 2009 2:45 am

Send the columns to a sort stage and enable the 'change' flag.

Write into a sequential file only when the flag is on.

Write another job to read from seq file and write into the DataSet.

chulett · Post by **chulett** » Thu Sep 10, 2009 5:28 am

rafidwh wrote:The records need to be deleted and thats the requirement and we don't want to have a table exclusively to remove the dups.

The technique I mentioned is pretty basic stuff and it is used when the number of records to be deleted is greater than the number of records to be saved. Is that the case here? It does 'delete records' but it does so by only saving the ones you want to keep, not by removing the ones you want to get rid of:

a) Write good records to new table.
b) Drop old table.
c) Rename new table to old table.

In other words, why delete 90% of the data when it can be orders of magnitudes faster to save the 10%? I don't know your ratio so put this as an option on the table and went through this as you seemed to misunderstand what I meant.

And you still don't need a job to simply delete duplicates from a database, but if you want to build something that will be (as noted) your slowest option, be my guest.

aoriano · Post by **aoriano** » Mon Sep 14, 2009 9:25 am

chulett wrote: a) Write good records to new table.
b) Drop old table.
c) Rename new table to old table.

That's a good solution if you don't have constraints like foreign key pointing on this table.

But I agree with you, a job is not needed for this kind of task.
rafidwh, why don't you use SQL ? It's much more faster for this kind of task than DataStage. And if you really want to launch the process to delete duplicate Data in DataStage you can still call a stored procedure from DataStage.

chulett · Post by **chulett** » Mon Sep 14, 2009 9:53 am

True, constraints do make that a little... trickier.

And another reason I suggested SQL was the fact that this was a 'one time' thing.