Delete the duplicates from Database
Moderators: chulett, rschirm, roy
Delete the duplicates from Database
Hi All,
I have to create a one time job that has to identfy duplicate records based on 20 fields and delete the latest record (can be identified based on DATE1). There are millions of duplicate records in the table.
Please suggest the approach to resolve the issue.
I have to create a one time job that has to identfy duplicate records based on 20 fields and delete the latest record (can be identified based on DATE1). There are millions of duplicate records in the table.
Please suggest the approach to resolve the issue.
Does it have to be a job? You can do this in SQL... and I'd suggest you save the unique records in a new table and then rename (if possible) rather than all of those deletes if the number to save is smaller than the number to get rid of.
-craig
"You can never have too many knives" -- Logan Nine Fingers
"You can never have too many knives" -- Logan Nine Fingers
However, if it must be done in a job, I think that I remember doing something similar and the fastest way to do it was something like this:
1) Create one big string containing a concatenation of all of the 20 variables.
2) Sort data by "Big String" then By descending date.
3) Feed into a transformer with a Stage Variable that tracks the previous "Big String" value and compares it to the current value. Use a constraint to only output the row when they are NOT identical.
I seem to remember that it was a lot faster than doing an extended "If Then Else" construct to compare 20 values.
1) Create one big string containing a concatenation of all of the 20 variables.
2) Sort data by "Big String" then By descending date.
3) Feed into a transformer with a Stage Variable that tracks the previous "Big String" value and compares it to the current value. Use a constraint to only output the row when they are NOT identical.
I seem to remember that it was a lot faster than doing an extended "If Then Else" construct to compare 20 values.
Select the data in a job, sort the data in DataStage, use remove duplicates and the "deleted" entries from that stage to delete the records in the original table. Not fast, but it works.
<a href=http://www.worldcommunitygrid.org/team/ ... TZ9H4CGVP1 target="WCGWin">
</a>
</a>
-
- Participant
- Posts: 3337
- Joined: Mon Jan 17, 2005 4:49 am
- Location: United Kingdom
The technique I mentioned is pretty basic stuff and it is used when the number of records to be deleted is greater than the number of records to be saved. Is that the case here? It does 'delete records' but it does so by only saving the ones you want to keep, not by removing the ones you want to get rid of:rafidwh wrote:The records need to be deleted and thats the requirement and we don't want to have a table exclusively to remove the dups.
a) Write good records to new table.
b) Drop old table.
c) Rename new table to old table.
In other words, why delete 90% of the data when it can be orders of magnitudes faster to save the 10%? I don't know your ratio so put this as an option on the table and went through this as you seemed to misunderstand what I meant.
And you still don't need a job to simply delete duplicates from a database, but if you want to build something that will be (as noted) your slowest option, be my guest.
-craig
"You can never have too many knives" -- Logan Nine Fingers
"You can never have too many knives" -- Logan Nine Fingers
That's a good solution if you don't have constraints like foreign key pointing on this table.chulett wrote: a) Write good records to new table.
b) Drop old table.
c) Rename new table to old table.
But I agree with you, a job is not needed for this kind of task.
rafidwh, why don't you use SQL ? It's much more faster for this kind of task than DataStage. And if you really want to launch the process to delete duplicate Data in DataStage you can still call a stored procedure from DataStage.