Page 1 of 2

to get duplicate records

Posted: Tue Nov 27, 2007 3:57 am
by abhay10
i hav 2 databases first includes duplicates data and second one after removing duplicate data
eg first one contain 100 records and second one 90 records
and 10 are the duplicate entries, i want to find these 10 entries, how to do that?

Posted: Tue Nov 27, 2007 4:17 am
by koolnitz
Assuming that 'databases' mean 'tables' from your statement "i hav 2 databases...".
You can find the duplicate records without actually relating (e.g. join) the two tables, by firing the appropriate SQL statement against the table that has duplicate rows.

Posted: Tue Nov 27, 2007 4:17 am
by Karine
It would be easier to run a query against your first database with a query like:

SELECT COL1 FROM DBASE GROUP BY COL1 HAVING COUNT(*) > 1

assuming col1 is your unique key

Posted: Tue Nov 27, 2007 4:23 am
by devidotcom
Take the first dataset with 100 records and use the sort stage to sort and remove duplicate records.

Set Allow duplicates to true in the sort stage
We can capture the duplicate records using the create change key column option under the stage-> properties tab
This new key would have the value of 1 for the first record and the duplicate records as 0's. Use a filter stage next and filter the records which are 0's

Posted: Tue Nov 27, 2007 4:43 am
by abhay10
Karine wrote:It would be easier to run a query against your first database with a query like:

SELECT COL1 FROM DBASE GROUP BY COL1 HAVING COUNT(*) > 1

assuming col1 is your unique key

COUNT DOESNT WORK IN DATASTAGE...

Posted: Tue Nov 27, 2007 4:46 am
by abhay10
koolnitz wrote:Assuming that 'databases' mean 'tables' from your statement "i hav 2 databases...".
You can find the duplicate records without actually relating (e.g. join) the two tables, by firing the appropriate SQL statement against the table that has duplicate rows.

pls post that sql statement that should work in datastage...

Posted: Tue Nov 27, 2007 4:47 am
by koolnitz
Pls post what is your Database.
Btw, you could also google for the command :idea:

Posted: Tue Nov 27, 2007 4:55 am
by koolnitz
I'm assuming that your table has 3 columns with name col1, col2 and col3.
Here is a generic query that meets your requirement...

SELECT col1, col2, col3, count(*) as Count
FROM tab_name
GROUP BY col1, col2, col3
HAVING COUNT(*)>1

In the Database stage in your DS job, define four columns - col1, col2, col3 and Count.

Posted: Tue Nov 27, 2007 5:48 am
by ray.wurlod
abhay10 wrote:COUNT DOESNT WORK IN DATASTAGE...
Yes it does, when it's done right.

Posted: Tue Nov 27, 2007 6:32 am
by abhay10
koolnitz wrote:I'm assuming that your table has 3 columns with name col1, col2 and col3.
Here is a generic query that meets your requirement...

SELECT col1, col2, col3, count(*) as Count
FROM tab_name
GROUP BY col1, col2, col3
HAVING COUNT(*)>1

In the Database stage in your DS job, define four columns - col1, col2, col3 and Count.
this will work fine but will show duplicate data only once even if more than 2 records are there

Posted: Tue Nov 27, 2007 6:35 am
by abhay10
ray.wurlod wrote:
abhay10 wrote:COUNT DOESNT WORK IN DATASTAGE...
Yes it does, when it's done right.
ya, sry for that...

Posted: Tue Nov 27, 2007 5:05 pm
by ray.wurlod
HAVING COUNT(*) = 1 gives you the 90 records and HAVING COUNT(*) > 1 gives you the other 10. Isn't that what you wanted?

You can do this within DataStage, effectively emulating the HAVING clause. Use a "fork join" with an Aggregator stage to get the count for each group, then run one copy of these results through a Filter stage based on the count being greater than 1. These (tgt2) are your duplicate rows. The other copy you run through a Remove Duplicates stage to get your unique rows including one copy of each duplicate row.

Code: Select all

 ----->  Copy  --------------->  Join  ---->  Copy  ---->  Filter  ---->  tgt2
          |                       ^             |
          |                       |             |
          +---->  Aggregator  ----+             +---->  RemDups  ---->  tgt1
Make very sure that your partitioning is key-based on the grouping key(s).

Posted: Tue Nov 27, 2007 9:19 pm
by abhay10
ray.wurlod wrote:HAVING COUNT(*) = 1 gives you the 90 records and HAVING COUNT(*) > 1 gives you the other 10. Isn't that what you wanted?

You can do this within DataStage, effectively emulating the HAVING claus ...
no it will give 10 records only when each of those 10 records have occured only 2 times, suppose the same record has occured eleven times than it will give only one record since all other 9 are same, but i want all the 10 records...

Posted: Tue Nov 27, 2007 9:42 pm
by ray.wurlod
The job design I posted will give you what you want.

Pre-emptive post

Posted: Tue Nov 27, 2007 9:45 pm
by ray.wurlod
For less than 30c per day (approx Rs12) you can purchase premium membership that will allow you to read the entire posts of the five premium posters. Revenue from premium memberships is 100% allocated to the hosting and bandwidth costs of DSXchange; like all other posters the premium posters are unpaid.