Counting Duplicates

martin · Post by **martin** » Mon Nov 29, 2004 9:36 am

I Am Using RowProcCompareWithPreviousValue Function To Count Duplicates On A Key Feild Job Is Runnig Slow.

Is There Any Other Logoic To Count Duplicates.

Thanks
Martin

kduke · Post by **kduke** » Mon Nov 29, 2004 10:03 am

Do a search. There has been several discussions using stage variables.

martin · Post by **martin** » Mon Nov 29, 2004 11:54 am

Hi Kudke,
I Tried ,But In Vain...ICouldn't Find Any. If You Have Any Please Forward
Thanks
Martin

kcbland · Post by **kcbland** » Mon Nov 29, 2004 12:02 pm

Describe what you are trying to do, then maybe we can suggest something. Duplicate rows, duplicate primary keys, what is your problem?

kduke · Post by **kduke** » Mon Nov 29, 2004 12:10 pm

Try viewtopic.php?t=89641.

martin · Post by **martin** » Mon Nov 29, 2004 12:13 pm

I am Trying To Count Duplicate Rows Basing On Composite Primary Keys

Thanks
Martin

martin · Post by **martin** » Mon Nov 29, 2004 12:22 pm

svRowCount = If RowProcCompareWithPreviousValue(Col1 : Col2 : Col3: Col4) = 0 Then "Y" Else "N"

And Next Transformation Stage I am Collecteting All N's And Counting.
This Is Working Fine, But Job Is Processing 30 Rows Per Seconds.

With Out This I am Able To Process 300 Rows Per Second.

Would Some One Sujjest Alternate Logic To This.

Thanks
Martin

kcbland · Post by **kcbland** » Mon Nov 29, 2004 12:24 pm

Yes, but is the data in a text file, a hash file, or a table?

You state duplicate composite primary keys, which by definition can't be duplicated in a table. In a hash file, you would only have one occurrence, the last occurrence written to the hash file under that primary key.

So, I'm guessing that your data is in a text file. But, you haven't made this clear.

So, proceding with this assumption, I move on to the next assumption that your data is unsorted. This means that the duplicates do not run back to back.

So, stage variables are only useful if the data is sorted.

Now, I don't know your volume, so I can give one solution for low row counts and another for high row counts. Let's start with low row count. Run your source file into an aggregator generating a count for each primary key.

We can cover "high" row counts after a little more information.

martin · Post by **martin** » Mon Nov 29, 2004 12:29 pm

My Sources Are From CFF

martin · Post by **martin** » Mon Nov 29, 2004 12:41 pm

Yes Iam Sortin Incoming Data On Key Feilds and Have Hash Lookup And Next Transform stage Odbc Lookup , Next Transform Stage iam Countind Duplicate Rows And Finally Loading SQL Server

kcbland · Post by **kcbland** » Mon Nov 29, 2004 12:44 pm

What are you doing with the duplicates, are you just taking the last one? Or are you "rolling" them into a final row (meaning insert then update, update, update)? What is your volume relative to your environment (small, medium, large, extreme)?

martin · Post by **martin** » Mon Nov 29, 2004 12:50 pm

Iam Counting Duplicates To Build Control Report

kcbland · Post by **kcbland** » Mon Nov 29, 2004 12:51 pm

Why not use the Aggregator stage? If your data is sorted then the Aggregator is optimized for that activity.

vmcburney · Post by **vmcburney** » Mon Nov 29, 2004 5:15 pm

Counting the duplicates from a stage variable is very easy, you don't need the RowProcCompareWithPreviousValue function, you can do it with just a couple of stage variables as long as your input data is sorted. The hard part is putting this count into a control report. You have no way of knowing when you output the count, a DataStage transformer doesn't have a "last row" flag so you cannot easily write a final stage variable counter out to a report.

I think you are better off writing your duplicate rows to a duplicate file, either the entire row or just the primary key of the record, and then using a standard documentation routine to do link counts on the job. This will tell you how many input rows there were, how many went down the duplicate link and how many were successfully written to SQL Server.

To identify a duplicate remember that stage variables are derived from top to bottom so the ordering of stage variables let's you set up a simple comparison logic, the input data needs to be sorted by the key fields for this to work:

svNewKey : input.f1:input.f2:input.f3:input.f4
svIsDuplicate : svNewKey = svOldKey
svOldKey : svNewKey

In your constraint you can send duplicates to your duplicate files and non duplicates to your SQL Server link. This will run much faster then what you have now as you are not writing as many rows to SQL Server.

martin · Post by **martin** » Tue Nov 30, 2004 8:58 am

Hi vmcburney, Thanks For The Response.

svNewKey : input.f1:input.f2:input.f3:input.f4
svIsDuplicate : svNewKey = svOldKey
svOldKey : svNewKey

I want to count In svIsDuplicate

svIsDuplicate : If svNewKey = svOldKey Then +svIsDuplicate Else svIsDuplicate

Will This Work
Thanks
Martin