Counting Duplicates
Moderators: chulett, rschirm, roy
Counting Duplicates
I Am Using RowProcCompareWithPreviousValue Function To Count Duplicates On A Key Feild Job Is Runnig Slow.
Is There Any Other Logoic To Count Duplicates.
Thanks
Martin
Is There Any Other Logoic To Count Duplicates.
Thanks
Martin
Describe what you are trying to do, then maybe we can suggest something. Duplicate rows, duplicate primary keys, what is your problem?
Kenneth Bland
Rank: Sempai
Belt: First degree black
Fight name: Captain Hook
Signature knockout: right upper cut followed by left hook
Signature submission: Crucifix combined with leg triangle
Rank: Sempai
Belt: First degree black
Fight name: Captain Hook
Signature knockout: right upper cut followed by left hook
Signature submission: Crucifix combined with leg triangle
svRowCount = If RowProcCompareWithPreviousValue(Col1 : Col2 : Col3: Col4) = 0 Then "Y" Else "N"
And Next Transformation Stage I am Collecteting All N's And Counting.
This Is Working Fine, But Job Is Processing 30 Rows Per Seconds.
With Out This I am Able To Process 300 Rows Per Second.
Would Some One Sujjest Alternate Logic To This.
Thanks
Martin
And Next Transformation Stage I am Collecteting All N's And Counting.
This Is Working Fine, But Job Is Processing 30 Rows Per Seconds.
With Out This I am Able To Process 300 Rows Per Second.
Would Some One Sujjest Alternate Logic To This.
Thanks
Martin
Yes, but is the data in a text file, a hash file, or a table?
You state duplicate composite primary keys, which by definition can't be duplicated in a table. In a hash file, you would only have one occurrence, the last occurrence written to the hash file under that primary key.
So, I'm guessing that your data is in a text file. But, you haven't made this clear.
So, proceding with this assumption, I move on to the next assumption that your data is unsorted. This means that the duplicates do not run back to back.
So, stage variables are only useful if the data is sorted.
Now, I don't know your volume, so I can give one solution for low row counts and another for high row counts. Let's start with low row count. Run your source file into an aggregator generating a count for each primary key.
We can cover "high" row counts after a little more information.
You state duplicate composite primary keys, which by definition can't be duplicated in a table. In a hash file, you would only have one occurrence, the last occurrence written to the hash file under that primary key.
So, I'm guessing that your data is in a text file. But, you haven't made this clear.
So, proceding with this assumption, I move on to the next assumption that your data is unsorted. This means that the duplicates do not run back to back.
So, stage variables are only useful if the data is sorted.
Now, I don't know your volume, so I can give one solution for low row counts and another for high row counts. Let's start with low row count. Run your source file into an aggregator generating a count for each primary key.
We can cover "high" row counts after a little more information.
Kenneth Bland
Rank: Sempai
Belt: First degree black
Fight name: Captain Hook
Signature knockout: right upper cut followed by left hook
Signature submission: Crucifix combined with leg triangle
Rank: Sempai
Belt: First degree black
Fight name: Captain Hook
Signature knockout: right upper cut followed by left hook
Signature submission: Crucifix combined with leg triangle
What are you doing with the duplicates, are you just taking the last one? Or are you "rolling" them into a final row (meaning insert then update, update, update)? What is your volume relative to your environment (small, medium, large, extreme)?
Kenneth Bland
Rank: Sempai
Belt: First degree black
Fight name: Captain Hook
Signature knockout: right upper cut followed by left hook
Signature submission: Crucifix combined with leg triangle
Rank: Sempai
Belt: First degree black
Fight name: Captain Hook
Signature knockout: right upper cut followed by left hook
Signature submission: Crucifix combined with leg triangle
Why not use the Aggregator stage? If your data is sorted then the Aggregator is optimized for that activity.
Kenneth Bland
Rank: Sempai
Belt: First degree black
Fight name: Captain Hook
Signature knockout: right upper cut followed by left hook
Signature submission: Crucifix combined with leg triangle
Rank: Sempai
Belt: First degree black
Fight name: Captain Hook
Signature knockout: right upper cut followed by left hook
Signature submission: Crucifix combined with leg triangle
-
- Participant
- Posts: 3593
- Joined: Thu Jan 23, 2003 5:25 pm
- Location: Australia, Melbourne
- Contact:
Counting the duplicates from a stage variable is very easy, you don't need the RowProcCompareWithPreviousValue function, you can do it with just a couple of stage variables as long as your input data is sorted. The hard part is putting this count into a control report. You have no way of knowing when you output the count, a DataStage transformer doesn't have a "last row" flag so you cannot easily write a final stage variable counter out to a report.
I think you are better off writing your duplicate rows to a duplicate file, either the entire row or just the primary key of the record, and then using a standard documentation routine to do link counts on the job. This will tell you how many input rows there were, how many went down the duplicate link and how many were successfully written to SQL Server.
To identify a duplicate remember that stage variables are derived from top to bottom so the ordering of stage variables let's you set up a simple comparison logic, the input data needs to be sorted by the key fields for this to work:
svNewKey : input.f1:input.f2:input.f3:input.f4
svIsDuplicate : svNewKey = svOldKey
svOldKey : svNewKey
In your constraint you can send duplicates to your duplicate files and non duplicates to your SQL Server link. This will run much faster then what you have now as you are not writing as many rows to SQL Server.
I think you are better off writing your duplicate rows to a duplicate file, either the entire row or just the primary key of the record, and then using a standard documentation routine to do link counts on the job. This will tell you how many input rows there were, how many went down the duplicate link and how many were successfully written to SQL Server.
To identify a duplicate remember that stage variables are derived from top to bottom so the ordering of stage variables let's you set up a simple comparison logic, the input data needs to be sorted by the key fields for this to work:
svNewKey : input.f1:input.f2:input.f3:input.f4
svIsDuplicate : svNewKey = svOldKey
svOldKey : svNewKey
In your constraint you can send duplicates to your duplicate files and non duplicates to your SQL Server link. This will run much faster then what you have now as you are not writing as many rows to SQL Server.
Certus Solutions
Blog: Tooling Around in the InfoSphere
Twitter: @vmcburney
LinkedIn:Vincent McBurney LinkedIn
Blog: Tooling Around in the InfoSphere
Twitter: @vmcburney
LinkedIn:Vincent McBurney LinkedIn