to get duplicate records
Moderators: chulett, rschirm, roy
to get duplicate records
i hav 2 databases first includes duplicates data and second one after removing duplicate data
eg first one contain 100 records and second one 90 records
and 10 are the duplicate entries, i want to find these 10 entries, how to do that?
eg first one contain 100 records and second one 90 records
and 10 are the duplicate entries, i want to find these 10 entries, how to do that?
Assuming that 'databases' mean 'tables' from your statement "i hav 2 databases...".
You can find the duplicate records without actually relating (e.g. join) the two tables, by firing the appropriate SQL statement against the table that has duplicate rows.
You can find the duplicate records without actually relating (e.g. join) the two tables, by firing the appropriate SQL statement against the table that has duplicate rows.
Last edited by koolnitz on Tue Nov 27, 2007 4:32 am, edited 2 times in total.
Nitin Jain | India
If everything seems to be going well, you have obviously overlooked something.
If everything seems to be going well, you have obviously overlooked something.
-
- Participant
- Posts: 247
- Joined: Thu Apr 27, 2006 6:38 am
- Location: Hyderabad
Take the first dataset with 100 records and use the sort stage to sort and remove duplicate records.
Set Allow duplicates to true in the sort stage
We can capture the duplicate records using the create change key column option under the stage-> properties tab
This new key would have the value of 1 for the first record and the duplicate records as 0's. Use a filter stage next and filter the records which are 0's
Set Allow duplicates to true in the sort stage
We can capture the duplicate records using the create change key column option under the stage-> properties tab
This new key would have the value of 1 for the first record and the duplicate records as 0's. Use a filter stage next and filter the records which are 0's
koolnitz wrote:Assuming that 'databases' mean 'tables' from your statement "i hav 2 databases...".
You can find the duplicate records without actually relating (e.g. join) the two tables, by firing the appropriate SQL statement against the table that has duplicate rows.
pls post that sql statement that should work in datastage...
I'm assuming that your table has 3 columns with name col1, col2 and col3.
Here is a generic query that meets your requirement...
SELECT col1, col2, col3, count(*) as Count
FROM tab_name
GROUP BY col1, col2, col3
HAVING COUNT(*)>1
In the Database stage in your DS job, define four columns - col1, col2, col3 and Count.
Here is a generic query that meets your requirement...
SELECT col1, col2, col3, count(*) as Count
FROM tab_name
GROUP BY col1, col2, col3
HAVING COUNT(*)>1
In the Database stage in your DS job, define four columns - col1, col2, col3 and Count.
Nitin Jain | India
If everything seems to be going well, you have obviously overlooked something.
If everything seems to be going well, you have obviously overlooked something.
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
this will work fine but will show duplicate data only once even if more than 2 records are therekoolnitz wrote:I'm assuming that your table has 3 columns with name col1, col2 and col3.
Here is a generic query that meets your requirement...
SELECT col1, col2, col3, count(*) as Count
FROM tab_name
GROUP BY col1, col2, col3
HAVING COUNT(*)>1
In the Database stage in your DS job, define four columns - col1, col2, col3 and Count.
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
HAVING COUNT(*) = 1 gives you the 90 records and HAVING COUNT(*) > 1 gives you the other 10. Isn't that what you wanted?
You can do this within DataStage, effectively emulating the HAVING clause. Use a "fork join" with an Aggregator stage to get the count for each group, then run one copy of these results through a Filter stage based on the count being greater than 1. These (tgt2) are your duplicate rows. The other copy you run through a Remove Duplicates stage to get your unique rows including one copy of each duplicate row.
Make very sure that your partitioning is key-based on the grouping key(s).
You can do this within DataStage, effectively emulating the HAVING clause. Use a "fork join" with an Aggregator stage to get the count for each group, then run one copy of these results through a Filter stage based on the count being greater than 1. These (tgt2) are your duplicate rows. The other copy you run through a Remove Duplicates stage to get your unique rows including one copy of each duplicate row.
Code: Select all
-----> Copy ---------------> Join ----> Copy ----> Filter ----> tgt2
| ^ |
| | |
+----> Aggregator ----+ +----> RemDups ----> tgt1
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
no it will give 10 records only when each of those 10 records have occured only 2 times, suppose the same record has occured eleven times than it will give only one record since all other 9 are same, but i want all the 10 records...ray.wurlod wrote:HAVING COUNT(*) = 1 gives you the 90 records and HAVING COUNT(*) > 1 gives you the other 10. Isn't that what you wanted?
You can do this within DataStage, effectively emulating the HAVING claus ...
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
Pre-emptive post
For less than 30c per day (approx Rs12) you can purchase premium membership that will allow you to read the entire posts of the five premium posters. Revenue from premium memberships is 100% allocated to the hosting and bandwidth costs of DSXchange; like all other posters the premium posters are unpaid.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.