Remove Duplicates but keep first 5

Nicole · Post by **Nicole** » Mon Jan 25, 2010 1:24 pm

I was hoping I could use 'Remove Duplicates' to remove dups but I need to keep 5 of them. I didn't see any additional properties besides keeping the first or last dup. Any idea on how I would do this?

Thanks,

Nicole

chulett · Post by **chulett** » Mon Jan 25, 2010 1:32 pm

Hmmm... needing to keep X duplicates would probably mean via stage variables in a Transformer, assuming they're sorted properly first.

Nicole · Post by **Nicole** » Mon Jan 25, 2010 1:37 pm

So, sort them and keep a counter - when the counter gets to X - stop processing by adding constraint (ie. counter <= 5) Am I understanding you correctly?

chulett wrote:Hmmm... needing to keep X duplicates would probably mean via stage variables in a Transformer, assuming they're sorted properly first. ...

chulett · Post by **chulett** » Mon Jan 25, 2010 1:49 pm

Yup, pretty standard 'group change detection' and you want to make sure you pass no more than 5 records per 'group'.

Nicole · Post by **Nicole** » Mon Jan 25, 2010 2:17 pm

Hmm...I will try this if its the only way - will I have to create a counter for every group? I have a column named 'conditions' and there are several of these conditions - I need 5 of each of these - I'm assuming yes...

chulett wrote:Yup, pretty standard 'group change detection' and you want to make sure you pass no more than 5 records per 'group'. ...

chulett · Post by **chulett** » Mon Jan 25, 2010 2:51 pm

No, one counter should be fine, alone with a copy of the 'previous' group. Set the counter to one every time the group changes, else increment it. Constrain the output to when the counter is <= 5.

Nicole · Post by **Nicole** » Mon Jan 25, 2010 4:19 pm

Ok, next question - is using the create cluster key change property in sort = true enough to keep track of when to add 1 to the counter? or else how would I create a copy of the previous row's group in order to increment the counter?

chulett wrote:No, one counter should be fine, alone with a copy of the 'previous' group. Set the counter to one every time the group changes, else increment it. Constrain the output to when the counter is <= 5. ...

chulett · Post by **chulett** » Mon Jan 25, 2010 5:06 pm

That's one way, sure: 1 = new group, 0 = not. Or a simple stage variable holding the concatenated values from all fields that make up the 'group'. First time checking a 'previous' value against a current value?

ray.wurlod · Post by **ray.wurlod** » Mon Jan 25, 2010 7:29 pm

Easier approach. Use a standard fork-join design, with an Aggregator to calculate the count per group. After the join filter on count <= 5.

chulett · Post by **chulett** » Mon Jan 25, 2010 9:26 pm

Hmmm... they want the first five per group, not just groups that have that many.

ray.wurlod · Post by **ray.wurlod** » Tue Jan 26, 2010 5:14 am

Yes, that's why the filter operator is <= downstream of the Join stage. Upon re-examination you not use an Aggregator to count - you would use a Transformer to generate a counter per group. Which is close to the original suggestions.

chulett · Post by **chulett** » Tue Jan 26, 2010 6:31 am

Ah... counter per group, gotcha. Saying 'count per group' made me misunderestimate you.

srinivas.g · Post by **srinivas.g** » Wed Jan 27, 2010 6:26 am

Just use stage varaibles:

Prev-->curr
Curr-->input coloumn
If(prev=curr) then c+1 esle 0 -->C

In Constraint use C<5

chulett · Post by **chulett** » Wed Jan 27, 2010 8:48 am

In that order? If you are going to suggest a stage variable solution that relies on them being in the correct order to function properly, please make sure you post them in the correct order.

Three doesn't hurt but you only really need two variables and I would number things differently to make it a little more obvious what the end result needs to be.

Code: Select all

svCounter ->  If link.column = svPrevious then svCounter + 1 Else 1
svPrevious -> link.column

In the output link, use a constraint of svCounter <= 5.

Me, I would probably use a third stage variable named svCurrent if the 'current group' consisted of several fields concatenated together. This to do the concatenation only once per record, especially if the comparison needs to be made more than once and (again) to keep things a little easier to read.

Code: Select all

svCurrent ->  Link.ColA : '|' : Link.ColB : '|' : Link.ColC
svCounter ->  If svCurrent = svPrevious then svCounter + 1 Else 1
svPrevious -> svCurrent

Nicole · Post by **Nicole** » Tue Feb 02, 2010 8:52 am

This is the solution I decided to use -

Using the sort stage - sorted by the 2 columns that I need to keep 5 duplicates. I set 'create key change column = true' and in the transformer stage - COUNTER stage variable derivation "If (DSLink20.keyChange = "1") Then COUNTER = 1 Else COUNTER + 1"

So I reset the counter to 1 each time the columns changed and added the constraint COUNTER <= 5

This seems to work and thanks to the group for helping me figure this out!