Dealing with duplicates

yaminids · Post by **yaminids** » Fri Dec 03, 2004 5:00 pm

Hello there,

I am trying to remove duplicates from a table with millions of rows. I am using a hash stage to remove the duplicates and collect the unique rows. But, I also want to capture the rows which actually are duplicated in the original table.
Can anyone explain to me how to collect the duplicate rows before writing the unique rows into the hash file.

Thanx in advance.
-Yamini

djdinjan · Post by **djdinjan** » Fri Dec 03, 2004 5:41 pm

I too would be dealing with the same issue.
I had run a job with 3.3M rows and found that there are 500+ duplicates.
Is there a way we could design a job which would involve capturing the duplicate rows in a sequential file ?

Thanks,
DJ

chucksmith · Post by **chucksmith** » Fri Dec 03, 2004 8:55 pm

Code: Select all

                 HASH
                  |
                  V
Input Stage ---> XFR ---> HASH
                  |
                  +---> SEQ

Where both hash stages are the same file. Put a constraint of

Code: Select all

Not(LookupLinkName.NOTFOUND)

on the output link to the sequential stage, and disable write cache on the output hash file stage.

kcbland · Post by **kcbland** » Fri Dec 03, 2004 9:29 pm

How about 2 jobs, one for dupes and the other uniques?

Job_Get_Dupes:

OCI --> XFM --> SEQ

Write SQL like

Code: Select all

select col_a, col_b, col_c, col_d, col_e, col_f, unique_identifier_column
from your_table
where unique_identifier_column in
   (select unique_identifier_column
    from your_table
    group by unique_identifier_column
    having COUNT(unique_identifier_column) > 1
   )

Job_Get_Uniques:

OCI --> XFM --> SEQ

Write SQL like

Code: Select all

select col_a, col_b, col_c, col_d, col_e, col_f, unique_identifier_column
from your_table
where unique_identifier_column in
   (select unique_identifier_column
    from your_table
    group by unique_identifier_column
    having COUNT(unique_identifier_column) = 1
   )

djdinjan · Post by **djdinjan** » Mon Dec 06, 2004 12:06 pm

Hi Chuck,

I tried developing such a job. I had run the job and there were like 2201 rows in the source and they wer all populated in the output hash stage and also into the sequential file.

I had created the hashfile for lookup populating only the key columns data.I had used the same hashfile which was used for lookup as the output hash file.

Do you think i should check the reject rows box for the output hashfile stage in the constraint of the transformer.

Thanks,
Dj

djdinjan · Post by **djdinjan** » Mon Dec 06, 2004 12:10 pm

Hi Bland,

Here there are like 8 key columns. Do you think i should use those 8 for col_a, col_b ?
And what do you think the unique_identifier_column should be?

Thanks,
Dj

Job_Get_Dupes:

OCI --> XFM --> SEQ

Write SQL like

Code: Select all

select col_a, col_b, col_c, col_d, col_e, col_f, unique_identifier_column
from your_table
where unique_identifier_column in
   (select unique_identifier_column
    from your_table
    group by unique_identifier_column
    having COUNT(unique_identifier_column) > 1
   )

kcbland · Post by **kcbland** » Mon Dec 06, 2004 12:36 pm

I didn't use the word KEY because KEY implies uniqueness. If you have duplicates, how could you have duplicate KEY values? I assumed you are using a different column to uniquely identify rows, thus you could have duplicates.

Since you have 8 columns that uniquely identify a row, then your exercise is to count the occurences for each row and act accordingly.

Your original email talks about 3.3 million rows, you have to deal with performance. If I was you, I'd switch gears and do this:

Build a job OCI --> XFM --> HASH with the following SQL:

Code: Select all

SELECT key_1, key_2, key_3, key_4, key_5, key_6, key_7, ken_8, count(1) from yourtable group by key_1, key_2, key_3, key_4, key_5, key_6, key_7, key_8

Then build a job OCI --> XFM/reference to above HASH --> SEQ

Select the whole source table and reference lookup against the HASH matching on your key columns. If the COUNT column = 1 then it's unique, if it's > 1 then it's not unique and act accordingly.

djdinjan · Post by **djdinjan** » Mon Dec 06, 2004 12:47 pm

Thank you Bland.
I shall try with that.

-Dj

chucksmith · Post by **chucksmith** » Mon Dec 06, 2004 5:57 pm

You could add

Code: Select all

LookupLinkName.NOTFOUND

as the constraint on your output link to the hash file.

PilotBaha · Post by **PilotBaha** » Tue Dec 07, 2004 12:04 pm

How about using the good ole QualityStage/Integrity ? :D

yaminids · Post by **yaminids** » Tue Dec 07, 2004 12:59 pm

Hello DJ,

I used RowProcCompareWithPreviousValue routine to capture the duplicates. The only condition to meet here is to have the input file sorted on 'Key column'.

Input File -> SEQ1 AND SEQ2. Where SEQ2 is the reject file.

In the constraint of SEQ1 file mention
RowProcCompareWithPreviousValue(Linkname.KeyColumnName) = 1.

I hope this will help you.
-Yamini

djdinjan · Post by **djdinjan** » Fri Dec 10, 2004 5:09 pm

Hi

I have found another solution.

I had created a lookup by running only 1 row for the keys into the hash file.

Used the same lookup file for the job as follows

Lkp UpdHash (copy of Lkp)
\ /
SEQ(Src) -->TRN-->SEQ (Target)
|
v
SEQ(log for dups)

( the lkp and updHash point to the transformer, i dint know how to adjust it)

See to it that the look up and the hash file used in the job have the same file name. check only the 'Create File' and 'Clear before writing' boxes in the UpdHash file.

When you run the job, it performs a look up and writes the rows which are not similar to the updhash and the ones that match(similar) to the Lkp and SEQ(log for dups).

I have done this for 3-4 source files to check for dups and it works.

Thanks to everyone for your suggestions,
-DJ