Remove Duplicates but keep first 5
Moderators: chulett, rschirm, roy
Remove Duplicates but keep first 5
I was hoping I could use 'Remove Duplicates' to remove dups but I need to keep 5 of them. I didn't see any additional properties besides keeping the first or last dup. Any idea on how I would do this?
Thanks,
Nicole
Thanks,
Nicole
So, sort them and keep a counter - when the counter gets to X - stop processing by adding constraint (ie. counter <= 5) Am I understanding you correctly?
chulett wrote:Hmmm... needing to keep X duplicates would probably mean via stage variables in a Transformer, assuming they're sorted properly first. ...
Hmm...I will try this if its the only way - will I have to create a counter for every group? I have a column named 'conditions' and there are several of these conditions - I need 5 of each of these - I'm assuming yes...
chulett wrote:Yup, pretty standard 'group change detection' and you want to make sure you pass no more than 5 records per 'group'. ...
Ok, next question - is using the create cluster key change property in sort = true enough to keep track of when to add 1 to the counter? or else how would I create a copy of the previous row's group in order to increment the counter?
chulett wrote:No, one counter should be fine, alone with a copy of the 'previous' group. Set the counter to one every time the group changes, else increment it. Constrain the output to when the counter is <= 5. ...
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
Yes, that's why the filter operator is <= downstream of the Join stage. Upon re-examination you not use an Aggregator to count - you would use a Transformer to generate a counter per group. Which is close to the original suggestions.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
-
- Participant
- Posts: 251
- Joined: Mon Jun 09, 2008 5:52 am
![Exclamation :!:](./images/smilies/icon_exclaim.gif)
Three doesn't hurt but you only really need two variables and I would number things differently to make it a little more obvious what the end result needs to be.
Code: Select all
svCounter -> If link.column = svPrevious then svCounter + 1 Else 1
svPrevious -> link.column
![Idea :idea:](./images/smilies/icon_idea.gif)
Code: Select all
svCurrent -> Link.ColA : '|' : Link.ColB : '|' : Link.ColC
svCounter -> If svCurrent = svPrevious then svCounter + 1 Else 1
svPrevious -> svCurrent
-craig
"You can never have too many knives" -- Logan Nine Fingers
"You can never have too many knives" -- Logan Nine Fingers
This is the solution I decided to use -
Using the sort stage - sorted by the 2 columns that I need to keep 5 duplicates. I set 'create key change column = true' and in the transformer stage - COUNTER stage variable derivation "If (DSLink20.keyChange = "1") Then COUNTER = 1 Else COUNTER + 1"
So I reset the counter to 1 each time the columns changed and added the constraint COUNTER <= 5
This seems to work and thanks to the group for helping me figure this out!
Using the sort stage - sorted by the 2 columns that I need to keep 5 duplicates. I set 'create key change column = true' and in the transformer stage - COUNTER stage variable derivation "If (DSLink20.keyChange = "1") Then COUNTER = 1 Else COUNTER + 1"
So I reset the counter to 1 each time the columns changed and added the constraint COUNTER <= 5
This seems to work and thanks to the group for helping me figure this out!