Counting Duplicates

Post questions here relative to DataStage Server Edition for such areas as Server job design, DS Basic, Routines, Job Sequences, etc.

Moderators: chulett, rschirm, roy

Post Reply
martin
Participant
Posts: 67
Joined: Fri Jul 30, 2004 7:19 am
Location: NewJersy

Counting Duplicates

Post by martin »

I Am Using RowProcCompareWithPreviousValue Function To Count Duplicates On A Key Feild Job Is Runnig Slow.

Is There Any Other Logoic To Count Duplicates.

Thanks
Martin
kduke
Charter Member
Charter Member
Posts: 5227
Joined: Thu May 29, 2003 9:47 am
Location: Dallas, TX
Contact:

Post by kduke »

Do a search. There has been several discussions using stage variables.
Mamu Kim
martin
Participant
Posts: 67
Joined: Fri Jul 30, 2004 7:19 am
Location: NewJersy

Post by martin »

Hi Kudke,
I Tried ,But In Vain...ICouldn't Find Any. If You Have Any Please Forward
Thanks
Martin
kcbland
Participant
Posts: 5208
Joined: Wed Jan 15, 2003 8:56 am
Location: Lutz, FL
Contact:

Post by kcbland »

Describe what you are trying to do, then maybe we can suggest something. Duplicate rows, duplicate primary keys, what is your problem?
Kenneth Bland

Rank: Sempai
Belt: First degree black
Fight name: Captain Hook
Signature knockout: right upper cut followed by left hook
Signature submission: Crucifix combined with leg triangle
kduke
Charter Member
Charter Member
Posts: 5227
Joined: Thu May 29, 2003 9:47 am
Location: Dallas, TX
Contact:

Post by kduke »

Mamu Kim
martin
Participant
Posts: 67
Joined: Fri Jul 30, 2004 7:19 am
Location: NewJersy

Post by martin »

I am Trying To Count Duplicate Rows Basing On Composite Primary Keys

Thanks
Martin
martin
Participant
Posts: 67
Joined: Fri Jul 30, 2004 7:19 am
Location: NewJersy

Post by martin »

svRowCount = If RowProcCompareWithPreviousValue(Col1 : Col2 : Col3: Col4) = 0 Then "Y" Else "N"

And Next Transformation Stage I am Collecteting All N's And Counting.
This Is Working Fine, But Job Is Processing 30 Rows Per Seconds.

With Out This I am Able To Process 300 Rows Per Second.

Would Some One Sujjest Alternate Logic To This.

Thanks
Martin
kcbland
Participant
Posts: 5208
Joined: Wed Jan 15, 2003 8:56 am
Location: Lutz, FL
Contact:

Post by kcbland »

Yes, but is the data in a text file, a hash file, or a table?

You state duplicate composite primary keys, which by definition can't be duplicated in a table. In a hash file, you would only have one occurrence, the last occurrence written to the hash file under that primary key.

So, I'm guessing that your data is in a text file. But, you haven't made this clear.

So, proceding with this assumption, I move on to the next assumption that your data is unsorted. This means that the duplicates do not run back to back.

So, stage variables are only useful if the data is sorted.

Now, I don't know your volume, so I can give one solution for low row counts and another for high row counts. Let's start with low row count. Run your source file into an aggregator generating a count for each primary key.

We can cover "high" row counts after a little more information.
Kenneth Bland

Rank: Sempai
Belt: First degree black
Fight name: Captain Hook
Signature knockout: right upper cut followed by left hook
Signature submission: Crucifix combined with leg triangle
martin
Participant
Posts: 67
Joined: Fri Jul 30, 2004 7:19 am
Location: NewJersy

Post by martin »

My Sources Are From CFF
martin
Participant
Posts: 67
Joined: Fri Jul 30, 2004 7:19 am
Location: NewJersy

Post by martin »

Yes Iam Sortin Incoming Data On Key Feilds and Have Hash Lookup And Next Transform stage Odbc Lookup , Next Transform Stage iam Countind Duplicate Rows And Finally Loading SQL Server
kcbland
Participant
Posts: 5208
Joined: Wed Jan 15, 2003 8:56 am
Location: Lutz, FL
Contact:

Post by kcbland »

What are you doing with the duplicates, are you just taking the last one? Or are you "rolling" them into a final row (meaning insert then update, update, update)? What is your volume relative to your environment (small, medium, large, extreme)?
Kenneth Bland

Rank: Sempai
Belt: First degree black
Fight name: Captain Hook
Signature knockout: right upper cut followed by left hook
Signature submission: Crucifix combined with leg triangle
martin
Participant
Posts: 67
Joined: Fri Jul 30, 2004 7:19 am
Location: NewJersy

Post by martin »

Iam Counting Duplicates To Build Control Report
kcbland
Participant
Posts: 5208
Joined: Wed Jan 15, 2003 8:56 am
Location: Lutz, FL
Contact:

Post by kcbland »

Why not use the Aggregator stage? If your data is sorted then the Aggregator is optimized for that activity.
Kenneth Bland

Rank: Sempai
Belt: First degree black
Fight name: Captain Hook
Signature knockout: right upper cut followed by left hook
Signature submission: Crucifix combined with leg triangle
vmcburney
Participant
Posts: 3593
Joined: Thu Jan 23, 2003 5:25 pm
Location: Australia, Melbourne
Contact:

Post by vmcburney »

Counting the duplicates from a stage variable is very easy, you don't need the RowProcCompareWithPreviousValue function, you can do it with just a couple of stage variables as long as your input data is sorted. The hard part is putting this count into a control report. You have no way of knowing when you output the count, a DataStage transformer doesn't have a "last row" flag so you cannot easily write a final stage variable counter out to a report.

I think you are better off writing your duplicate rows to a duplicate file, either the entire row or just the primary key of the record, and then using a standard documentation routine to do link counts on the job. This will tell you how many input rows there were, how many went down the duplicate link and how many were successfully written to SQL Server.

To identify a duplicate remember that stage variables are derived from top to bottom so the ordering of stage variables let's you set up a simple comparison logic, the input data needs to be sorted by the key fields for this to work:

svNewKey : input.f1:input.f2:input.f3:input.f4
svIsDuplicate : svNewKey = svOldKey
svOldKey : svNewKey

In your constraint you can send duplicates to your duplicate files and non duplicates to your SQL Server link. This will run much faster then what you have now as you are not writing as many rows to SQL Server.
martin
Participant
Posts: 67
Joined: Fri Jul 30, 2004 7:19 am
Location: NewJersy

Post by martin »

Hi vmcburney, Thanks For The Response.
svNewKey : input.f1:input.f2:input.f3:input.f4
svIsDuplicate : svNewKey = svOldKey
svOldKey : svNewKey
I want to count In svIsDuplicate

svIsDuplicate : If svNewKey = svOldKey Then +svIsDuplicate Else svIsDuplicate

Will This Work
Thanks
Martin
Post Reply