Duplicate Records

chulett · Post by **chulett** » Fri Jul 28, 2006 6:50 am

Check this post for a downloadable Tech Tip that will explain how to do this. It includes pictures of the job and stage variables design, which can really help bring the methodology across. Picture. 1000 words. You know.

sb_akarmarkar · Post by **sb_akarmarkar** » Fri Jul 28, 2006 7:04 am

Hi,
Use aggregator stage add one more column count (Duplicate records) , in transformer stage if count >1 then transfer record to error table as well as target table....

I think this will help you..

Thnaks,
Anupam

praburaj · Post by **praburaj** » Fri Jul 28, 2006 7:34 am

Ur Code is working fine. it has come some data. But that is not duplicate

records. Once again I explain my condition once again.

Column1 ---------------------------- Column 2

Input Data: Dont considered -----.

Suresh ---- Intel

Raj ---- Oracle

Suresh ---- Intel

Kamal ---- Veltech

Raj ----- Oracle

Acct ----- 35

Moort ----- job

Acct ----- 35

Output:

Suresh ----- Intel

Raj ----- Oracle.

Acct ----- 35

That means duplicate records send only in to Error Table. Others

move to Good Records Table. Can u give me some other logic

.

praburaj · Post by **praburaj** » Fri Jul 28, 2006 7:48 am

Ur code is working fine. But it is did not take any duplicate Records. It

has come some rows. Once again i explain my condition.

Input Data : ( Don't Considered ------)

COLUMN1 ------------------- COLUMN 2

RAJ ------------------- INTEL

KAMAL ------------------- ORACLE

PRABU -------------------- VELTECH

RAJ -------------------- INTEL

KARTI -------------------- CTS

PRABU ------------------- VELTECH

OUTPUT DATA IN ERROR TABLE:

RAJ -------------- INTEL

PRABU -------------- VELTECH

This is the Condition. Duplicate records send in to Error Records Table

Only. Other will send to Good Records. Plz help me Some other Logic

ArndW · Post by **ArndW** » Fri Jul 28, 2006 8:06 am

You've just said "the code doesn't work" without telling anyone what happened. You also stated that you've put in a sort but the data you posted isn't sorted so none of the example or suggestions can work.
Try doing the sort first to a sequential file to ensure that this is working. Then use the transform stage variables. If it doesn't work, add the stage variables as extra columns to the output and make sure that their values are what you expect. You need to take a simple step-by-step approach to this problem.

praburaj · Post by **praburaj** » Fri Jul 28, 2006 12:13 pm

I sorted the data , using sort stage. Some data has come in to the Error

Target Stage. But it has not come all duplicate records. That data has

come like that.

ACCT CODE --------------- DEPT CODE

2345 ----------------- 450000

3456 ---------------- 450000

2345 --------------- 436789

I need output like that:

ACCT CODE ----------------- DEPT CODE

2345 -------------------- 450000

2345 ------------------- 450000

3456 ------------------- 450000

3567 ------------------- 566789

3567 ------------------ 566789.

That means If one record contain one or more duplicate records, that

all duplicate records send in to Error Target table. If one record contain

only one duplicate record, that also go to the Target table. I tried what

balajsir told. I got only a partial output. I need appropriate output.

Plz help me. How can i write condition.

Regards,

prabu

dsdesigner · Post by **dsdesigner** » Fri Jul 28, 2006 5:40 pm

This might help

http://dsxchange.com/viewtopic.php?t=10 ... b6e1c09bc8

Thanks,
Shekar

praburaj · Post by **praburaj** » Sat Jul 29, 2006 12:44 am

Your link did not show anything. It has show only usernamd and

password. can u give me the correct link.

kumar_s · Post by **kumar_s** » Sat Jul 29, 2006 12:53 am

http://dsxchange.com/viewtopic.php?t=10 ... b6e1c09bc8

Perhaps dsdesigner need to work with his sid stuff

chulett · Post by **chulett** » Sat Jul 29, 2006 6:36 am

I was wrong about the 'sid' stuff. As Kim has noted several times, it's because of a lack of the 'www' in front of the domain name in the link. Copy the shortcut and paste it in your address bar, add the www and then go.

Or click here. It's just one of the plethora of posts on the subject of Duplicates that seem to have shown up lately. And I'm sure you've already searched the forum and seen all of them already... right?

DSguru2B · Post by **DSguru2B** » Sat Jul 29, 2006 12:43 pm

To capture duplicates you need to work with stage variables and constraints. Sort the incoming data, detect the change via use of two stage variables and flag the record thats identical. Then in the constraint of the link where you plan to capture duplicates, use the flag value.

chulett · Post by **chulett** » Sat Jul 29, 2006 2:08 pm

And whaddaya know, a link to a Tech Tip at the top of the page that explains exactly how to do that!

dsdesigner · Post by **dsdesigner** » Sat Jul 29, 2006 3:01 pm

For example you have

a,b
a,b
x,y
y,x
c,d
c,d

select col, col2, count(1)
from table
group by col1, col2
having count(1) > 1.
Load these into a hash file by specifying col1 and col2 as keys.

thsi will yield
a, b, 2
c, d, 2

This you will load in a hash file specifying col1, col2 as he keys

In step 2
you will read te entire source. Look up against the hash file. So when you get the first a,b it iwll find a match in the hash file. Load it to seq2. When the second a, b comes form the source, it will agian find a match in the hash file. It will also go to seq2. When x, y comes next it will not find a match in the hash file. Load it to seq1. Same as a,b for c,d rows.

Code: Select all

              
                      a, b, 2 
                      c, d, 2 
                          
a,b                      | 
a,b                      | 
x,y                      | 
y,x--------------- --- xfm ------------------seq1 
c,d                         \ 
c,d                           \ 
                                 \ 
                                seq2

Thanks,
Shekar

dsdesigner · Post by **dsdesigner** » Sat Jul 29, 2006 11:45 pm

Alternatively you can do it in a single job using the sql below. SQL is written for oracle. Should also work for db2. Not sure of other databses.

Code: Select all

src-------xfm------seq1
              |
              |
             seq2

with duplicate_rows as
(select cola, colb, count(1) as cnt
from srcTable
group by cola, colb
having count(1)>1)
select src.cola, src.colb, coalesce(lkp.cnt,1) as num_rows
from srcTable src,
duplicate_rows lkp
where src.cola = lkp.cola(+)
and src.colb = lkp.colb(+)

For the data you have shown above this sql will give you the following o/p
colA, colB, num_rows
2345, 450000, 2
3456, 450000, 1
3567, 566789, 2

Pass this through a transformer, add a constriant to allow rows where num_rows=1. Put a reject link and capture all the other rows.

Thanks,
Shekar

DSguru2B · Post by **DSguru2B** » Sun Jul 30, 2006 12:15 am

The OP should get back to us with more updates on where exactly he/she is stuck. He has more than few different ways of going about doing it.