Duplicate Records
Moderators: chulett, rschirm, roy
-
- Participant
- Posts: 232
- Joined: Fri Sep 30, 2005 4:52 am
- Contact:
Ur Code is working fine. it has come some data. But that is not duplicate
records. Once again I explain my condition once again.
Column1 ---------------------------- Column 2
Input Data: Dont considered -----.
Suresh ---- Intel
Raj ---- Oracle
Suresh ---- Intel
Kamal ---- Veltech
Raj ----- Oracle
Acct ----- 35
Moort ----- job
Acct ----- 35
Output:
Suresh ----- Intel
Raj ----- Oracle.
Acct ----- 35
That means duplicate records send only in to Error Table. Others
move to Good Records Table. Can u give me some other logic .
records. Once again I explain my condition once again.
Column1 ---------------------------- Column 2
Input Data: Dont considered -----.
Suresh ---- Intel
Raj ---- Oracle
Suresh ---- Intel
Kamal ---- Veltech
Raj ----- Oracle
Acct ----- 35
Moort ----- job
Acct ----- 35
Output:
Suresh ----- Intel
Raj ----- Oracle.
Acct ----- 35
That means duplicate records send only in to Error Table. Others
move to Good Records Table. Can u give me some other logic .
Ur code is working fine. But it is did not take any duplicate Records. It
has come some rows. Once again i explain my condition.
Input Data : ( Don't Considered ------)
COLUMN1 ------------------- COLUMN 2
RAJ ------------------- INTEL
KAMAL ------------------- ORACLE
PRABU -------------------- VELTECH
RAJ -------------------- INTEL
KARTI -------------------- CTS
PRABU ------------------- VELTECH
OUTPUT DATA IN ERROR TABLE:
RAJ -------------- INTEL
PRABU -------------- VELTECH
This is the Condition. Duplicate records send in to Error Records Table
Only. Other will send to Good Records. Plz help me Some other Logic
has come some rows. Once again i explain my condition.
Input Data : ( Don't Considered ------)
COLUMN1 ------------------- COLUMN 2
RAJ ------------------- INTEL
KAMAL ------------------- ORACLE
PRABU -------------------- VELTECH
RAJ -------------------- INTEL
KARTI -------------------- CTS
PRABU ------------------- VELTECH
OUTPUT DATA IN ERROR TABLE:
RAJ -------------- INTEL
PRABU -------------- VELTECH
This is the Condition. Duplicate records send in to Error Records Table
Only. Other will send to Good Records. Plz help me Some other Logic
You've just said "the code doesn't work" without telling anyone what happened. You also stated that you've put in a sort but the data you posted isn't sorted so none of the example or suggestions can work.
Try doing the sort first to a sequential file to ensure that this is working. Then use the transform stage variables. If it doesn't work, add the stage variables as extra columns to the output and make sure that their values are what you expect. You need to take a simple step-by-step approach to this problem.
Try doing the sort first to a sequential file to ensure that this is working. Then use the transform stage variables. If it doesn't work, add the stage variables as extra columns to the output and make sure that their values are what you expect. You need to take a simple step-by-step approach to this problem.
<a href=http://www.worldcommunitygrid.org/team/ ... TZ9H4CGVP1 target="WCGWin">
</a>
</a>
I sorted the data , using sort stage. Some data has come in to the Error
Target Stage. But it has not come all duplicate records. That data has
come like that.
ACCT CODE --------------- DEPT CODE
2345 ----------------- 450000
3456 ---------------- 450000
2345 --------------- 436789
I need output like that:
ACCT CODE ----------------- DEPT CODE
2345 -------------------- 450000
2345 ------------------- 450000
3456 ------------------- 450000
3567 ------------------- 566789
3567 ------------------ 566789.
That means If one record contain one or more duplicate records, that
all duplicate records send in to Error Target table. If one record contain
only one duplicate record, that also go to the Target table. I tried what
balajsir told. I got only a partial output. I need appropriate output.
Plz help me. How can i write condition.
Regards,
prabu
Target Stage. But it has not come all duplicate records. That data has
come like that.
ACCT CODE --------------- DEPT CODE
2345 ----------------- 450000
3456 ---------------- 450000
2345 --------------- 436789
I need output like that:
ACCT CODE ----------------- DEPT CODE
2345 -------------------- 450000
2345 ------------------- 450000
3456 ------------------- 450000
3567 ------------------- 566789
3567 ------------------ 566789.
That means If one record contain one or more duplicate records, that
all duplicate records send in to Error Target table. If one record contain
only one duplicate record, that also go to the Target table. I tried what
balajsir told. I got only a partial output. I need appropriate output.
Plz help me. How can i write condition.
Regards,
prabu
-
- Participant
- Posts: 34
- Joined: Thu Jul 29, 2004 1:03 pm
I was wrong about the 'sid' stuff. As Kim has noted several times, it's because of a lack of the 'www' in front of the domain name in the link. Copy the shortcut and paste it in your address bar, add the www and then go.
Or click here. It's just one of the plethora of posts on the subject of Duplicates that seem to have shown up lately. And I'm sure you've already searched the forum and seen all of them already... right?
Or click here. It's just one of the plethora of posts on the subject of Duplicates that seem to have shown up lately. And I'm sure you've already searched the forum and seen all of them already... right?
-craig
"You can never have too many knives" -- Logan Nine Fingers
"You can never have too many knives" -- Logan Nine Fingers
To capture duplicates you need to work with stage variables and constraints. Sort the incoming data, detect the change via use of two stage variables and flag the record thats identical. Then in the constraint of the link where you plan to capture duplicates, use the flag value.
Creativity is allowing yourself to make mistakes. Art is knowing which ones to keep.
-
- Participant
- Posts: 34
- Joined: Thu Jul 29, 2004 1:03 pm
For example you have
a,b
a,b
x,y
y,x
c,d
c,d
select col, col2, count(1)
from table
group by col1, col2
having count(1) > 1.
Load these into a hash file by specifying col1 and col2 as keys.
thsi will yield
a, b, 2
c, d, 2
This you will load in a hash file specifying col1, col2 as he keys
In step 2
you will read te entire source. Look up against the hash file. So when you get the first a,b it iwll find a match in the hash file. Load it to seq2. When the second a, b comes form the source, it will agian find a match in the hash file. It will also go to seq2. When x, y comes next it will not find a match in the hash file. Load it to seq1. Same as a,b for c,d rows.
Thanks,
Shekar
a,b
a,b
x,y
y,x
c,d
c,d
select col, col2, count(1)
from table
group by col1, col2
having count(1) > 1.
Load these into a hash file by specifying col1 and col2 as keys.
thsi will yield
a, b, 2
c, d, 2
This you will load in a hash file specifying col1, col2 as he keys
In step 2
you will read te entire source. Look up against the hash file. So when you get the first a,b it iwll find a match in the hash file. Load it to seq2. When the second a, b comes form the source, it will agian find a match in the hash file. It will also go to seq2. When x, y comes next it will not find a match in the hash file. Load it to seq1. Same as a,b for c,d rows.
Code: Select all
a, b, 2
c, d, 2
a,b |
a,b |
x,y |
y,x--------------- --- xfm ------------------seq1
c,d \
c,d \
\
seq2
Shekar
-
- Participant
- Posts: 34
- Joined: Thu Jul 29, 2004 1:03 pm
Alternatively you can do it in a single job using the sql below. SQL is written for oracle. Should also work for db2. Not sure of other databses.
with duplicate_rows as
(select cola, colb, count(1) as cnt
from srcTable
group by cola, colb
having count(1)>1)
select src.cola, src.colb, coalesce(lkp.cnt,1) as num_rows
from srcTable src,
duplicate_rows lkp
where src.cola = lkp.cola(+)
and src.colb = lkp.colb(+)
For the data you have shown above this sql will give you the following o/p
colA, colB, num_rows
2345, 450000, 2
3456, 450000, 1
3567, 566789, 2
Pass this through a transformer, add a constriant to allow rows where num_rows=1. Put a reject link and capture all the other rows.
Thanks,
Shekar
Code: Select all
src-------xfm------seq1
|
|
seq2
(select cola, colb, count(1) as cnt
from srcTable
group by cola, colb
having count(1)>1)
select src.cola, src.colb, coalesce(lkp.cnt,1) as num_rows
from srcTable src,
duplicate_rows lkp
where src.cola = lkp.cola(+)
and src.colb = lkp.colb(+)
For the data you have shown above this sql will give you the following o/p
colA, colB, num_rows
2345, 450000, 2
3456, 450000, 1
3567, 566789, 2
Pass this through a transformer, add a constriant to allow rows where num_rows=1. Put a reject link and capture all the other rows.
Thanks,
Shekar