Delete the duplicates from Database

Post questions here relative to DataStage Server Edition for such areas as Server job design, DS Basic, Routines, Job Sequences, etc.

Moderators: chulett, rschirm, roy

Post Reply
rafidwh
Participant
Posts: 179
Joined: Mon Oct 10, 2005 11:30 pm

Delete the duplicates from Database

Post by rafidwh »

Hi All,

I have to create a one time job that has to identfy duplicate records based on 20 fields and delete the latest record (can be identified based on DATE1). There are millions of duplicate records in the table.

Please suggest the approach to resolve the issue.
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

Does it have to be a job? You can do this in SQL... and I'd suggest you save the unique records in a new table and then rename (if possible) rather than all of those deletes if the number to save is smaller than the number to get rid of.
-craig

"You can never have too many knives" -- Logan Nine Fingers
asorrell
Posts: 1707
Joined: Fri Apr 04, 2003 2:00 pm
Location: Colleyville, Texas

Post by asorrell »

However, if it must be done in a job, I think that I remember doing something similar and the fastest way to do it was something like this:

1) Create one big string containing a concatenation of all of the 20 variables.
2) Sort data by "Big String" then By descending date.
3) Feed into a transformer with a Stage Variable that tracks the previous "Big String" value and compares it to the current value. Use a constraint to only output the row when they are NOT identical.

I seem to remember that it was a lot faster than doing an extended "If Then Else" construct to compare 20 values.
Andy Sorrell
Certified DataStage Consultant
IBM Analytics Champion 2009 - 2020
rafidwh
Participant
Posts: 179
Joined: Mon Oct 10, 2005 11:30 pm

Post by rafidwh »

Hi Creg,

The records need to be deleted and thats the requirement and we don't want to have a table exclusively to remove the dups.

If there are any more options to do this task would help greatly.
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Post by ArndW »

Select the data in a job, sort the data in DataStage, use remove duplicates and the "deleted" entries from that stage to delete the records in the original table. Not fast, but it works.
Sainath.Srinivasan
Participant
Posts: 3337
Joined: Mon Jan 17, 2005 4:49 am
Location: United Kingdom

Post by Sainath.Srinivasan »

Send the columns to a sort stage and enable the 'change' flag.

Write into a sequential file only when the flag is on.

Write another job to read from seq file and write into the DataSet.
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

rafidwh wrote:The records need to be deleted and thats the requirement and we don't want to have a table exclusively to remove the dups.
The technique I mentioned is pretty basic stuff and it is used when the number of records to be deleted is greater than the number of records to be saved. Is that the case here? It does 'delete records' but it does so by only saving the ones you want to keep, not by removing the ones you want to get rid of:

a) Write good records to new table.
b) Drop old table.
c) Rename new table to old table.

In other words, why delete 90% of the data when it can be orders of magnitudes faster to save the 10%? I don't know your ratio so put this as an option on the table and went through this as you seemed to misunderstand what I meant.

And you still don't need a job to simply delete duplicates from a database, but if you want to build something that will be (as noted) your slowest option, be my guest.
-craig

"You can never have too many knives" -- Logan Nine Fingers
aoriano
Participant
Posts: 15
Joined: Fri Apr 25, 2008 8:00 am

Post by aoriano »

chulett wrote: a) Write good records to new table.
b) Drop old table.
c) Rename new table to old table.
That's a good solution if you don't have constraints like foreign key pointing on this table.

But I agree with you, a job is not needed for this kind of task.
rafidwh, why don't you use SQL ? It's much more faster for this kind of task than DataStage. And if you really want to launch the process to delete duplicate Data in DataStage you can still call a stored procedure from DataStage.
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

True, constraints do make that a little... trickier. :wink:

And another reason I suggested SQL was the fact that this was a 'one time' thing.
-craig

"You can never have too many knives" -- Logan Nine Fingers
Post Reply