Dealing with duplicates

Post questions here relative to DataStage Server Edition for such areas as Server job design, DS Basic, Routines, Job Sequences, etc.

Moderators: chulett, rschirm, roy

Post Reply
yaminids
Premium Member
Premium Member
Posts: 387
Joined: Mon Oct 18, 2004 1:04 pm

Dealing with duplicates

Post by yaminids »

Hello there,

I am trying to remove duplicates from a table with millions of rows. I am using a hash stage to remove the duplicates and collect the unique rows. But, I also want to capture the rows which actually are duplicated in the original table.
Can anyone explain to me how to collect the duplicate rows before writing the unique rows into the hash file.

Thanx in advance.
-Yamini
djdinjan
Participant
Posts: 36
Joined: Fri Jul 23, 2004 9:35 am

Post by djdinjan »

I too would be dealing with the same issue.
I had run a job with 3.3M rows and found that there are 500+ duplicates.
Is there a way we could design a job which would involve capturing the duplicate rows in a sequential file ?

Thanks,
DJ
chucksmith
Premium Member
Premium Member
Posts: 385
Joined: Wed Jun 16, 2004 12:43 pm
Location: Virginia, USA
Contact:

Post by chucksmith »

Code: Select all

                 HASH
                  |
                  V
Input Stage ---> XFR ---> HASH
                  |
                  +---> SEQ
Where both hash stages are the same file. Put a constraint of

Code: Select all

Not(LookupLinkName.NOTFOUND)
on the output link to the sequential stage, and disable write cache on the output hash file stage. :wink:
kcbland
Participant
Posts: 5208
Joined: Wed Jan 15, 2003 8:56 am
Location: Lutz, FL
Contact:

Post by kcbland »

How about 2 jobs, one for dupes and the other uniques?

Job_Get_Dupes:

OCI --> XFM --> SEQ

Write SQL like

Code: Select all

select col_a, col_b, col_c, col_d, col_e, col_f, unique_identifier_column
from your_table
where unique_identifier_column in
   (select unique_identifier_column
    from your_table
    group by unique_identifier_column
    having COUNT(unique_identifier_column) > 1
   )
Job_Get_Uniques:

OCI --> XFM --> SEQ

Write SQL like

Code: Select all

select col_a, col_b, col_c, col_d, col_e, col_f, unique_identifier_column
from your_table
where unique_identifier_column in
   (select unique_identifier_column
    from your_table
    group by unique_identifier_column
    having COUNT(unique_identifier_column) = 1
   )
Kenneth Bland

Rank: Sempai
Belt: First degree black
Fight name: Captain Hook
Signature knockout: right upper cut followed by left hook
Signature submission: Crucifix combined with leg triangle
djdinjan
Participant
Posts: 36
Joined: Fri Jul 23, 2004 9:35 am

Post by djdinjan »

Hi Chuck,

I tried developing such a job. I had run the job and there were like 2201 rows in the source and they wer all populated in the output hash stage and also into the sequential file.

I had created the hashfile for lookup populating only the key columns data.I had used the same hashfile which was used for lookup as the output hash file.

Do you think i should check the reject rows box for the output hashfile stage in the constraint of the transformer.

Thanks,
Dj
djdinjan
Participant
Posts: 36
Joined: Fri Jul 23, 2004 9:35 am

Post by djdinjan »

Hi Bland,

Here there are like 8 key columns. Do you think i should use those 8 for col_a, col_b ?
And what do you think the unique_identifier_column should be?

Thanks,
Dj





Job_Get_Dupes:

OCI --> XFM --> SEQ

Write SQL like

Code: Select all

select col_a, col_b, col_c, col_d, col_e, col_f, unique_identifier_column
from your_table
where unique_identifier_column in
   (select unique_identifier_column
    from your_table
    group by unique_identifier_column
    having COUNT(unique_identifier_column) > 1
   )
kcbland
Participant
Posts: 5208
Joined: Wed Jan 15, 2003 8:56 am
Location: Lutz, FL
Contact:

Post by kcbland »

I didn't use the word KEY because KEY implies uniqueness. If you have duplicates, how could you have duplicate KEY values? I assumed you are using a different column to uniquely identify rows, thus you could have duplicates.

Since you have 8 columns that uniquely identify a row, then your exercise is to count the occurences for each row and act accordingly.

Your original email talks about 3.3 million rows, you have to deal with performance. If I was you, I'd switch gears and do this:

Build a job OCI --> XFM --> HASH with the following SQL:

Code: Select all

SELECT key_1, key_2, key_3, key_4, key_5, key_6, key_7, ken_8, count(1) from yourtable group by key_1, key_2, key_3, key_4, key_5, key_6, key_7, key_8

Then build a job OCI --> XFM/reference to above HASH --> SEQ

Select the whole source table and reference lookup against the HASH matching on your key columns. If the COUNT column = 1 then it's unique, if it's > 1 then it's not unique and act accordingly.
Kenneth Bland

Rank: Sempai
Belt: First degree black
Fight name: Captain Hook
Signature knockout: right upper cut followed by left hook
Signature submission: Crucifix combined with leg triangle
djdinjan
Participant
Posts: 36
Joined: Fri Jul 23, 2004 9:35 am

Post by djdinjan »

Thank you Bland.
I shall try with that.

-Dj
chucksmith
Premium Member
Premium Member
Posts: 385
Joined: Wed Jun 16, 2004 12:43 pm
Location: Virginia, USA
Contact:

Post by chucksmith »

You could add

Code: Select all

LookupLinkName.NOTFOUND
as the constraint on your output link to the hash file.
PilotBaha
Premium Member
Premium Member
Posts: 202
Joined: Mon Jan 12, 2004 8:05 pm

Post by PilotBaha »

How about using the good ole QualityStage/Integrity ? :D
yaminids
Premium Member
Premium Member
Posts: 387
Joined: Mon Oct 18, 2004 1:04 pm

Dealing with duplicates

Post by yaminids »

Hello DJ,

I used RowProcCompareWithPreviousValue routine to capture the duplicates. The only condition to meet here is to have the input file sorted on 'Key column'.

Input File -> SEQ1 AND SEQ2. Where SEQ2 is the reject file.

In the constraint of SEQ1 file mention
RowProcCompareWithPreviousValue(Linkname.KeyColumnName) = 1.

I hope this will help you.
-Yamini
djdinjan
Participant
Posts: 36
Joined: Fri Jul 23, 2004 9:35 am

Post by djdinjan »

Hi

I have found another solution.

I had created a lookup by running only 1 row for the keys into the hash file.

Used the same lookup file for the job as follows
Lkp UpdHash (copy of Lkp)
\ /
SEQ(Src) -->TRN-->SEQ (Target)
|
v
SEQ(log for dups)


( the lkp and updHash point to the transformer, i dint know how to adjust it)

See to it that the look up and the hash file used in the job have the same file name. check only the 'Create File' and 'Clear before writing' boxes in the UpdHash file.

When you run the job, it performs a look up and writes the rows which are not similar to the updhash and the ones that match(similar) to the Lkp and SEQ(log for dups).

I have done this for 3-4 source files to check for dups and it works.

Thanks to everyone for your suggestions,
-DJ
Post Reply