Catch Duplicate rows

amitendra_amit · Post by **amitendra_amit** » Fri Jul 28, 2006 2:41 pm

Hi All,

I am new to DataStage and struck on one problem.I am unable to find any solution using help provided by datastage.

I am using an oca stage as source.I have to catch the duplicate records and put those duplicate records in sequential file.

The records are in this format:

Col1 Col2 Col3
x y z
x y z
a b c
x y a

Now I have to catch Row1,Row2 and Row4 and in insert into sequential file.

Thanks in advance.

meena · Post by **meena** » Fri Jul 28, 2006 2:54 pm

Hi Amit,
Do a lookup. Pass the table data into a hash file. Give this hash file as a lookup to the original table. Give one output to your table or sequential file and other output to a sequential file for rejected rows. You can get more information on this here(dsxchange) if you do a search.

Thank you

narasimha · Post by **narasimha** » Fri Jul 28, 2006 3:01 pm

Welcome to DSXchange

Write a custom sql to find the duplicate rows, assuming you have oci stage as your source stage.
Push the results to the sequential file

dsdesigner · Post by **dsdesigner** » Fri Jul 28, 2006 4:33 pm

Step 1.
Identify the duplicate rows using the following sql
select col, col2, count(1)
from table
group by col1, col2
having count(1) > 1.
Load these into a hash file by specifying col1 and col2 as keys.

Step 2.
Read the source again. Use the hash file created above as a lookup. For each row see if you fiind the row in the lookup. If you do then it is a duplicate .

Thanks,
Shekar

amitendra_amit · Post by **amitendra_amit** » Sat Jul 29, 2006 2:59 am

The group by solution is not working.....

My problem is I have catch duplicates depending upon the values of two column and if 2 rows are matching then I have two catch both rows not one.

Like
a,b
a,b
x,y
y,x
c,d
c,d

in this case

I have to catch
a,b
a,b
c,d
c,d

Please help me in solving this.

Thanks

kumar_s · Post by **kumar_s** » Sat Jul 29, 2006 3:25 am

If you want to have two different files, have another column as indicator from source with the above query modified.
I dont have database access now to test the query, but it should look something like below.

Code: Select all

Select col1, col2, (select Decode(sign(count(*)-1),0,'N','Y') from table a where  a.col1 = b.col1 and a.col2=b.col2 group by a.col1, a.col2)  AS INDICATOR
from table b

If the INDICATOR = 'Y' in constraint of transformer to a duplicate file.

dsdesigner · Post by **dsdesigner** » Sat Jul 29, 2006 2:37 pm

Did you try both steps?? Step 1 identifies the the values of col1 and col2 that are duplicates.

select col, col2, count(1)
from table
group by col1, col2
having count(1) > 1.
Load these into a hash file by specifying col1 and col2 as keys.

For the example you have given
a,b
a,b
x,y
y,x
c,d
c,d

thsi will yield
a, b, 2
c, d, 2

This you will load in a hash file specifying col1, col2 as he keys

In step 2
you will read te entire source. Look up azgainst the had file. So when you get the first a,b it iwll find a match in the hash file. Load it to seq2. When the second a, b comes form the source, it will agian find a match in the hash file. It will also go to seq2. When x, y comes next it will not find a match in the hash file. Load it to seq1. Same for c,d rows.

Code: Select all

             
                      a, b, 2
                      c, d, 2
                          
a,b                      |
a,b                      |
x,y                      |
y,x--------------- --- xfm ------------------seq1
c,d                         \ 
c,d                           \
                                 \
                                seq2

Thanks,
Shekar

chowmunyee · Post by **chowmunyee** » Sun Aug 06, 2006 9:22 pm

i'm a newbie as well.....

This is something similar with my case where my source is ascii file but my size of the file is more than millions. i have to check back the duplicate records. Lets take back the scenario,

a,b
a,b
x,y
y,x
c,d
c,d

with this example,
a,b -2
c,d-2

Target - i have to generate one ascii file which contains all the data but with the remark on a,b -duplicate records...
column A, column B, remark
a,b duplicate records
a,b duplicate records
x,y
y,x
c,d duplicate records
c,d duplicate records

then i have to load one row data a,b into target db instead 2 rows a,b.

Is there anyone give me some advice on this? i'm concern on my performance.

thank you

kumar_s · Post by **kumar_s** » Sun Aug 06, 2006 9:54 pm

then i have to load one row data a,b into target db instead 2 rows a,b.

The solution was give to you earliar. You need to tweak the given code.
If you have understood the given query and the job design, you might have done it by yourself.

Code: Select all

Select col1, col2, (select Decode(sign(count(*)-1),0,'','Duplicate Records') from table a where  a.col1 = b.col1 and a.col2=b.col2 group by a.col1, a.col2)  AS Remark 
from table b

The above will give you the required ascii file.
Use aggregator and fetch the first row among the group. Or you can use the stage variables to pick the last/last row among the group.

chowmunyee · Post by **chowmunyee** » Sun Aug 06, 2006 10:20 pm

kumar_s wrote:
then i have to load one row data a,b into target db instead 2 rows a,b.
The solution was give to you earliar. You need to tweak the given code.
If you have understood the given query and the job design, you might have done it by yourself.
Code: Select all
Select col1, col2, (select Decode(sign(count(*)-1),0,'','Duplicate Records') from table a where  a.col1 = b.col1 and a.col2=b.col2 group by a.col1, a.col2)  AS Remark 
from table b
The above will give you the required ascii file.
Use aggregator and fetch the first row among the group. Or you can use the stage variables to pick the last/last row among the group.

i need to make double confirm on the table A and Table B....is it u use inner join or join 2 tables?

Thank you very much kumar.... this is the way to solve my case....i think i know how im going to solve my case.... thank you.

kumar_s · Post by **kumar_s** » Sun Aug 06, 2006 10:59 pm

Any join on the same table without any filter will give the same result. Because you will find all the records getting matched.

jdmiceli · Post by **jdmiceli** » Fri Aug 11, 2006 6:31 pm

Hi all,

Doesn't Oracle have a 'DISTINCT' keyword or equivalent?

If your sequel went something like this (borrowing some of the existing stuff from previous posts):

select DISTINCT col, col2
from table
group by col1, col2

You would then just use 'User Defined SQL' in your source stage and the sequential file will then have what you want. Maybe I'm just missing something so I apologize if I'm off.

Bestest!

meena · Post by **meena** » Fri Aug 11, 2006 8:12 pm

Hi jdmiceli...
We can use DISTINCT command to extract rows with no duplicates. But here Amit wants to capture rejected rows...So,we do a lookup and send rejected rows to a sequential file....

jdmiceli wrote:Hi all,

Doesn't Oracle have a 'DISTINCT' keyword or equivalent?

If your sequel went something like this (borrowing some of the existing stuff from previous posts):

select DISTINCT col, col2
from table
group by col1, col2

You would then just use 'User Defined SQL' in your source stage and the sequential file will then have what you want. Maybe I'm just missing something so I apologize if I'm off.

Bestest!

DSXchange

Catch Duplicate rows

Catch Duplicate rows

Re: Catch Duplicate rows

Re: Catch Duplicate rows

Possibly Oracle ignorant here...

Re: Possibly Oracle ignorant here...