Page 1 of 1

Remove Duplicates but keep first 5

Posted: Mon Jan 25, 2010 1:24 pm
by Nicole
I was hoping I could use 'Remove Duplicates' to remove dups but I need to keep 5 of them. I didn't see any additional properties besides keeping the first or last dup. Any idea on how I would do this?

Thanks,

Nicole

Posted: Mon Jan 25, 2010 1:32 pm
by chulett
Hmmm... needing to keep X duplicates would probably mean via stage variables in a Transformer, assuming they're sorted properly first.

Posted: Mon Jan 25, 2010 1:37 pm
by Nicole
So, sort them and keep a counter - when the counter gets to X - stop processing by adding constraint (ie. counter <= 5) Am I understanding you correctly?
chulett wrote:Hmmm... needing to keep X duplicates would probably mean via stage variables in a Transformer, assuming they're sorted properly first. ...

Posted: Mon Jan 25, 2010 1:49 pm
by chulett
Yup, pretty standard 'group change detection' and you want to make sure you pass no more than 5 records per 'group'.

Posted: Mon Jan 25, 2010 2:17 pm
by Nicole
Hmm...I will try this if its the only way - will I have to create a counter for every group? I have a column named 'conditions' and there are several of these conditions - I need 5 of each of these - I'm assuming yes...
chulett wrote:Yup, pretty standard 'group change detection' and you want to make sure you pass no more than 5 records per 'group'. ...

Posted: Mon Jan 25, 2010 2:51 pm
by chulett
No, one counter should be fine, alone with a copy of the 'previous' group. Set the counter to one every time the group changes, else increment it. Constrain the output to when the counter is <= 5.

Posted: Mon Jan 25, 2010 4:19 pm
by Nicole
Ok, next question - is using the create cluster key change property in sort = true enough to keep track of when to add 1 to the counter? or else how would I create a copy of the previous row's group in order to increment the counter?
chulett wrote:No, one counter should be fine, alone with a copy of the 'previous' group. Set the counter to one every time the group changes, else increment it. Constrain the output to when the counter is <= 5. ...

Posted: Mon Jan 25, 2010 5:06 pm
by chulett
That's one way, sure: 1 = new group, 0 = not. Or a simple stage variable holding the concatenated values from all fields that make up the 'group'. First time checking a 'previous' value against a current value?

Posted: Mon Jan 25, 2010 7:29 pm
by ray.wurlod
Easier approach. Use a standard fork-join design, with an Aggregator to calculate the count per group. After the join filter on count <= 5.

Posted: Mon Jan 25, 2010 9:26 pm
by chulett
Hmmm... they want the first five per group, not just groups that have that many.

Posted: Tue Jan 26, 2010 5:14 am
by ray.wurlod
Yes, that's why the filter operator is <= downstream of the Join stage. Upon re-examination you not use an Aggregator to count - you would use a Transformer to generate a counter per group. Which is close to the original suggestions.

Posted: Tue Jan 26, 2010 6:31 am
by chulett
Ah... counter per group, gotcha. Saying 'count per group' made me misunderestimate you. :wink:

Posted: Wed Jan 27, 2010 6:26 am
by srinivas.g
Just use stage varaibles:

Prev-->curr
Curr-->input coloumn
If(prev=curr) then c+1 esle 0 -->C

In Constraint use C<5

Posted: Wed Jan 27, 2010 8:48 am
by chulett
:!: In that order? If you are going to suggest a stage variable solution that relies on them being in the correct order to function properly, please make sure you post them in the correct order.

Three doesn't hurt but you only really need two variables and I would number things differently to make it a little more obvious what the end result needs to be.

Code: Select all

svCounter ->  If link.column = svPrevious then svCounter + 1 Else 1
svPrevious -> link.column
In the output link, use a constraint of svCounter <= 5.

:idea: Me, I would probably use a third stage variable named svCurrent if the 'current group' consisted of several fields concatenated together. This to do the concatenation only once per record, especially if the comparison needs to be made more than once and (again) to keep things a little easier to read.

Code: Select all

svCurrent ->  Link.ColA : '|' : Link.ColB : '|' : Link.ColC
svCounter ->  If svCurrent = svPrevious then svCounter + 1 Else 1
svPrevious -> svCurrent

Posted: Tue Feb 02, 2010 8:52 am
by Nicole
This is the solution I decided to use -

Using the sort stage - sorted by the 2 columns that I need to keep 5 duplicates. I set 'create key change column = true' and in the transformer stage - COUNTER stage variable derivation "If (DSLink20.keyChange = "1") Then COUNTER = 1 Else COUNTER + 1"

So I reset the counter to 1 each time the columns changed and added the constraint COUNTER <= 5

This seems to work and thanks to the group for helping me figure this out!