Dealing with duplicates
Moderators: chulett, rschirm, roy
Dealing with duplicates
Hello there,
I am trying to remove duplicates from a table with millions of rows. I am using a hash stage to remove the duplicates and collect the unique rows. But, I also want to capture the rows which actually are duplicated in the original table.
Can anyone explain to me how to collect the duplicate rows before writing the unique rows into the hash file.
Thanx in advance.
-Yamini
I am trying to remove duplicates from a table with millions of rows. I am using a hash stage to remove the duplicates and collect the unique rows. But, I also want to capture the rows which actually are duplicated in the original table.
Can anyone explain to me how to collect the duplicate rows before writing the unique rows into the hash file.
Thanx in advance.
-Yamini
-
- Premium Member
- Posts: 385
- Joined: Wed Jun 16, 2004 12:43 pm
- Location: Virginia, USA
- Contact:
Code: Select all
HASH
|
V
Input Stage ---> XFR ---> HASH
|
+---> SEQ
Code: Select all
Not(LookupLinkName.NOTFOUND)
![Wink :wink:](./images/smilies/icon_wink.gif)
Chuck Smith
www.anotheritco.com
www.anotheritco.com
How about 2 jobs, one for dupes and the other uniques?
Job_Get_Dupes:
OCI --> XFM --> SEQ
Write SQL like
Job_Get_Uniques:
OCI --> XFM --> SEQ
Write SQL like
Job_Get_Dupes:
OCI --> XFM --> SEQ
Write SQL like
Code: Select all
select col_a, col_b, col_c, col_d, col_e, col_f, unique_identifier_column
from your_table
where unique_identifier_column in
(select unique_identifier_column
from your_table
group by unique_identifier_column
having COUNT(unique_identifier_column) > 1
)
OCI --> XFM --> SEQ
Write SQL like
Code: Select all
select col_a, col_b, col_c, col_d, col_e, col_f, unique_identifier_column
from your_table
where unique_identifier_column in
(select unique_identifier_column
from your_table
group by unique_identifier_column
having COUNT(unique_identifier_column) = 1
)
Kenneth Bland
Rank: Sempai
Belt: First degree black
Fight name: Captain Hook
Signature knockout: right upper cut followed by left hook
Signature submission: Crucifix combined with leg triangle
Rank: Sempai
Belt: First degree black
Fight name: Captain Hook
Signature knockout: right upper cut followed by left hook
Signature submission: Crucifix combined with leg triangle
Hi Chuck,
I tried developing such a job. I had run the job and there were like 2201 rows in the source and they wer all populated in the output hash stage and also into the sequential file.
I had created the hashfile for lookup populating only the key columns data.I had used the same hashfile which was used for lookup as the output hash file.
Do you think i should check the reject rows box for the output hashfile stage in the constraint of the transformer.
Thanks,
Dj
I tried developing such a job. I had run the job and there were like 2201 rows in the source and they wer all populated in the output hash stage and also into the sequential file.
I had created the hashfile for lookup populating only the key columns data.I had used the same hashfile which was used for lookup as the output hash file.
Do you think i should check the reject rows box for the output hashfile stage in the constraint of the transformer.
Thanks,
Dj
Hi Bland,
Here there are like 8 key columns. Do you think i should use those 8 for col_a, col_b ?
And what do you think the unique_identifier_column should be?
Thanks,
Dj
Job_Get_Dupes:
OCI --> XFM --> SEQ
Write SQL like
Here there are like 8 key columns. Do you think i should use those 8 for col_a, col_b ?
And what do you think the unique_identifier_column should be?
Thanks,
Dj
Job_Get_Dupes:
OCI --> XFM --> SEQ
Write SQL like
Code: Select all
select col_a, col_b, col_c, col_d, col_e, col_f, unique_identifier_column
from your_table
where unique_identifier_column in
(select unique_identifier_column
from your_table
group by unique_identifier_column
having COUNT(unique_identifier_column) > 1
)
I didn't use the word KEY because KEY implies uniqueness. If you have duplicates, how could you have duplicate KEY values? I assumed you are using a different column to uniquely identify rows, thus you could have duplicates.
Since you have 8 columns that uniquely identify a row, then your exercise is to count the occurences for each row and act accordingly.
Your original email talks about 3.3 million rows, you have to deal with performance. If I was you, I'd switch gears and do this:
Build a job OCI --> XFM --> HASH with the following SQL:
Then build a job OCI --> XFM/reference to above HASH --> SEQ
Select the whole source table and reference lookup against the HASH matching on your key columns. If the COUNT column = 1 then it's unique, if it's > 1 then it's not unique and act accordingly.
Since you have 8 columns that uniquely identify a row, then your exercise is to count the occurences for each row and act accordingly.
Your original email talks about 3.3 million rows, you have to deal with performance. If I was you, I'd switch gears and do this:
Build a job OCI --> XFM --> HASH with the following SQL:
Code: Select all
SELECT key_1, key_2, key_3, key_4, key_5, key_6, key_7, ken_8, count(1) from yourtable group by key_1, key_2, key_3, key_4, key_5, key_6, key_7, key_8
Then build a job OCI --> XFM/reference to above HASH --> SEQ
Select the whole source table and reference lookup against the HASH matching on your key columns. If the COUNT column = 1 then it's unique, if it's > 1 then it's not unique and act accordingly.
Kenneth Bland
Rank: Sempai
Belt: First degree black
Fight name: Captain Hook
Signature knockout: right upper cut followed by left hook
Signature submission: Crucifix combined with leg triangle
Rank: Sempai
Belt: First degree black
Fight name: Captain Hook
Signature knockout: right upper cut followed by left hook
Signature submission: Crucifix combined with leg triangle
-
- Premium Member
- Posts: 385
- Joined: Wed Jun 16, 2004 12:43 pm
- Location: Virginia, USA
- Contact:
You could add as the constraint on your output link to the hash file.
Code: Select all
LookupLinkName.NOTFOUND
Chuck Smith
www.anotheritco.com
www.anotheritco.com
Dealing with duplicates
Hello DJ,
I used RowProcCompareWithPreviousValue routine to capture the duplicates. The only condition to meet here is to have the input file sorted on 'Key column'.
Input File -> SEQ1 AND SEQ2. Where SEQ2 is the reject file.
In the constraint of SEQ1 file mention
RowProcCompareWithPreviousValue(Linkname.KeyColumnName) = 1.
I hope this will help you.
-Yamini
I used RowProcCompareWithPreviousValue routine to capture the duplicates. The only condition to meet here is to have the input file sorted on 'Key column'.
Input File -> SEQ1 AND SEQ2. Where SEQ2 is the reject file.
In the constraint of SEQ1 file mention
RowProcCompareWithPreviousValue(Linkname.KeyColumnName) = 1.
I hope this will help you.
-Yamini
Hi
I have found another solution.
I had created a lookup by running only 1 row for the keys into the hash file.
Used the same lookup file for the job as follows
( the lkp and updHash point to the transformer, i dint know how to adjust it)
See to it that the look up and the hash file used in the job have the same file name. check only the 'Create File' and 'Clear before writing' boxes in the UpdHash file.
When you run the job, it performs a look up and writes the rows which are not similar to the updhash and the ones that match(similar) to the Lkp and SEQ(log for dups).
I have done this for 3-4 source files to check for dups and it works.
Thanks to everyone for your suggestions,
-DJ
I have found another solution.
I had created a lookup by running only 1 row for the keys into the hash file.
Used the same lookup file for the job as follows
Lkp UpdHash (copy of Lkp)
\ /
SEQ(Src) -->TRN-->SEQ (Target)
|
v
SEQ(log for dups)
( the lkp and updHash point to the transformer, i dint know how to adjust it)
See to it that the look up and the hash file used in the job have the same file name. check only the 'Create File' and 'Clear before writing' boxes in the UpdHash file.
When you run the job, it performs a look up and writes the rows which are not similar to the updhash and the ones that match(similar) to the Lkp and SEQ(log for dups).
I have done this for 3-4 source files to check for dups and it works.
Thanks to everyone for your suggestions,
-DJ