Catch Duplicate rows

Post questions here relative to DataStage Server Edition for such areas as Server job design, DS Basic, Routines, Job Sequences, etc.

Moderators: chulett, rschirm, roy

Post Reply
amitendra_amit
Participant
Posts: 27
Joined: Fri Jul 28, 2006 2:11 pm

Catch Duplicate rows

Post by amitendra_amit »

Hi All,

I am new to DataStage and struck on one problem.I am unable to find any solution using help provided by datastage.

I am using an oca stage as source.I have to catch the duplicate records and put those duplicate records in sequential file.

The records are in this format:

Col1 Col2 Col3
x y z
x y z
a b c
x y a

Now I have to catch Row1,Row2 and Row4 and in insert into sequential file.

Thanks in advance.
Sunshine
meena
Participant
Posts: 430
Joined: Tue Sep 13, 2005 12:17 pm

Post by meena »

Hi Amit,
Do a lookup. Pass the table data into a hash file. Give this hash file as a lookup to the original table. Give one output to your table or sequential file and other output to a sequential file for rejected rows. You can get more information on this here(dsxchange) if you do a search.

Thank you
narasimha
Charter Member
Charter Member
Posts: 1236
Joined: Fri Oct 22, 2004 8:59 am
Location: Staten Island, NY

Post by narasimha »

Welcome to DSXchange

Write a custom sql to find the duplicate rows, assuming you have oci stage as your source stage.
Push the results to the sequential file
Narasimha Kade

Finding answers is simple, all you need to do is come up with the correct questions.
dsdesigner
Participant
Posts: 34
Joined: Thu Jul 29, 2004 1:03 pm

Post by dsdesigner »

Step 1.
Identify the duplicate rows using the following sql
select col, col2, count(1)
from table
group by col1, col2
having count(1) > 1.
Load these into a hash file by specifying col1 and col2 as keys.

Step 2.
Read the source again. Use the hash file created above as a lookup. For each row see if you fiind the row in the lookup. If you do then it is a duplicate .

Thanks,
Shekar
amitendra_amit
Participant
Posts: 27
Joined: Fri Jul 28, 2006 2:11 pm

Re: Catch Duplicate rows

Post by amitendra_amit »

The group by solution is not working.....

My problem is I have catch duplicates depending upon the values of two column and if 2 rows are matching then I have two catch both rows not one.

Like
a,b
a,b
x,y
y,x
c,d
c,d

in this case

I have to catch
a,b
a,b
c,d
c,d

Please help me in solving this.

Thanks
Sunshine
kumar_s
Charter Member
Charter Member
Posts: 5245
Joined: Thu Jun 16, 2005 11:00 pm

Re: Catch Duplicate rows

Post by kumar_s »

If you want to have two different files, have another column as indicator from source with the above query modified.
I dont have database access now to test the query, but it should look something like below.

Code: Select all

Select col1, col2, (select Decode(sign(count(*)-1),0,'N','Y') from table a where  a.col1 = b.col1 and a.col2=b.col2 group by a.col1, a.col2)  AS INDICATOR
from table b
If the INDICATOR = 'Y' in constraint of transformer to a duplicate file.
Impossible doesn't mean 'it is not possible' actually means... 'NOBODY HAS DONE IT SO FAR'
dsdesigner
Participant
Posts: 34
Joined: Thu Jul 29, 2004 1:03 pm

Post by dsdesigner »

Did you try both steps?? Step 1 identifies the the values of col1 and col2 that are duplicates.

select col, col2, count(1)
from table
group by col1, col2
having count(1) > 1.
Load these into a hash file by specifying col1 and col2 as keys.

For the example you have given
a,b
a,b
x,y
y,x
c,d
c,d

thsi will yield
a, b, 2
c, d, 2

This you will load in a hash file specifying col1, col2 as he keys

In step 2
you will read te entire source. Look up azgainst the had file. So when you get the first a,b it iwll find a match in the hash file. Load it to seq2. When the second a, b comes form the source, it will agian find a match in the hash file. It will also go to seq2. When x, y comes next it will not find a match in the hash file. Load it to seq1. Same for c,d rows.

Code: Select all

             
                      a, b, 2
                      c, d, 2
                          
a,b                      |
a,b                      |
x,y                      |
y,x--------------- --- xfm ------------------seq1
c,d                         \ 
c,d                           \
                                 \
                                seq2

Thanks,
Shekar
chowmunyee
Participant
Posts: 32
Joined: Wed Dec 28, 2005 9:02 pm

Post by chowmunyee »

i'm a newbie as well.....

This is something similar with my case where my source is ascii file but my size of the file is more than millions. i have to check back the duplicate records. Lets take back the scenario,

a,b
a,b
x,y
y,x
c,d
c,d

with this example,
a,b -2
c,d-2

Target - i have to generate one ascii file which contains all the data but with the remark on a,b -duplicate records...
column A, column B, remark
a,b duplicate records
a,b duplicate records
x,y
y,x
c,d duplicate records
c,d duplicate records

then i have to load one row data a,b into target db instead 2 rows a,b.

Is there anyone give me some advice on this? i'm concern on my performance.

thank you
kumar_s
Charter Member
Charter Member
Posts: 5245
Joined: Thu Jun 16, 2005 11:00 pm

Post by kumar_s »

then i have to load one row data a,b into target db instead 2 rows a,b.
The solution was give to you earliar. You need to tweak the given code.
If you have understood the given query and the job design, you might have done it by yourself.

Code: Select all

Select col1, col2, (select Decode(sign(count(*)-1),0,'','Duplicate Records') from table a where  a.col1 = b.col1 and a.col2=b.col2 group by a.col1, a.col2)  AS Remark 
from table b
The above will give you the required ascii file.
Use aggregator and fetch the first row among the group. Or you can use the stage variables to pick the last/last row among the group.
Impossible doesn't mean 'it is not possible' actually means... 'NOBODY HAS DONE IT SO FAR'
chowmunyee
Participant
Posts: 32
Joined: Wed Dec 28, 2005 9:02 pm

Post by chowmunyee »

kumar_s wrote:
then i have to load one row data a,b into target db instead 2 rows a,b.
The solution was give to you earliar. You need to tweak the given code.
If you have understood the given query and the job design, you might have done it by yourself.

Code: Select all

Select col1, col2, (select Decode(sign(count(*)-1),0,'','Duplicate Records') from table a where  a.col1 = b.col1 and a.col2=b.col2 group by a.col1, a.col2)  AS Remark 
from table b
The above will give you the required ascii file.
Use aggregator and fetch the first row among the group. Or you can use the stage variables to pick the last/last row among the group.
i need to make double confirm on the table A and Table B....is it u use inner join or join 2 tables?

Thank you very much kumar.... this is the way to solve my case....i think i know how im going to solve my case.... thank you.
kumar_s
Charter Member
Charter Member
Posts: 5245
Joined: Thu Jun 16, 2005 11:00 pm

Post by kumar_s »

Any join on the same table without any filter will give the same result. Because you will find all the records getting matched.
Impossible doesn't mean 'it is not possible' actually means... 'NOBODY HAS DONE IT SO FAR'
jdmiceli
Premium Member
Premium Member
Posts: 309
Joined: Wed Feb 22, 2006 10:03 am
Location: Urbandale, IA

Possibly Oracle ignorant here...

Post by jdmiceli »

Hi all,

Doesn't Oracle have a 'DISTINCT' keyword or equivalent?

If your sequel went something like this (borrowing some of the existing stuff from previous posts):

select DISTINCT col, col2
from table
group by col1, col2

You would then just use 'User Defined SQL' in your source stage and the sequential file will then have what you want. Maybe I'm just missing something so I apologize if I'm off.

Bestest!
Bestest!

John Miceli
System Specialist, MCP, MCDBA
Berkley Technology Services


"Good Morning. This is God. I will be handling all your problems today. I will not need your help. So have a great day!"
meena
Participant
Posts: 430
Joined: Tue Sep 13, 2005 12:17 pm

Re: Possibly Oracle ignorant here...

Post by meena »

Hi jdmiceli...
We can use DISTINCT command to extract rows with no duplicates. But here Amit wants to capture rejected rows...So,we do a lookup and send rejected rows to a sequential file....
jdmiceli wrote:Hi all,

Doesn't Oracle have a 'DISTINCT' keyword or equivalent?

If your sequel went something like this (borrowing some of the existing stuff from previous posts):

select DISTINCT col, col2
from table
group by col1, col2

You would then just use 'User Defined SQL' in your source stage and the sequential file will then have what you want. Maybe I'm just missing something so I apologize if I'm off.

Bestest!
Post Reply