Remove Duplicates but keep first 5

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
Nicole
Premium Member
Premium Member
Posts: 93
Joined: Mon Oct 22, 2007 12:55 pm
Location: Albany
Contact:

Remove Duplicates but keep first 5

Post by Nicole »

I was hoping I could use 'Remove Duplicates' to remove dups but I need to keep 5 of them. I didn't see any additional properties besides keeping the first or last dup. Any idea on how I would do this?

Thanks,

Nicole
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

Hmmm... needing to keep X duplicates would probably mean via stage variables in a Transformer, assuming they're sorted properly first.
-craig

"You can never have too many knives" -- Logan Nine Fingers
Nicole
Premium Member
Premium Member
Posts: 93
Joined: Mon Oct 22, 2007 12:55 pm
Location: Albany
Contact:

Post by Nicole »

So, sort them and keep a counter - when the counter gets to X - stop processing by adding constraint (ie. counter <= 5) Am I understanding you correctly?
chulett wrote:Hmmm... needing to keep X duplicates would probably mean via stage variables in a Transformer, assuming they're sorted properly first. ...
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

Yup, pretty standard 'group change detection' and you want to make sure you pass no more than 5 records per 'group'.
-craig

"You can never have too many knives" -- Logan Nine Fingers
Nicole
Premium Member
Premium Member
Posts: 93
Joined: Mon Oct 22, 2007 12:55 pm
Location: Albany
Contact:

Post by Nicole »

Hmm...I will try this if its the only way - will I have to create a counter for every group? I have a column named 'conditions' and there are several of these conditions - I need 5 of each of these - I'm assuming yes...
chulett wrote:Yup, pretty standard 'group change detection' and you want to make sure you pass no more than 5 records per 'group'. ...
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

No, one counter should be fine, alone with a copy of the 'previous' group. Set the counter to one every time the group changes, else increment it. Constrain the output to when the counter is <= 5.
-craig

"You can never have too many knives" -- Logan Nine Fingers
Nicole
Premium Member
Premium Member
Posts: 93
Joined: Mon Oct 22, 2007 12:55 pm
Location: Albany
Contact:

Post by Nicole »

Ok, next question - is using the create cluster key change property in sort = true enough to keep track of when to add 1 to the counter? or else how would I create a copy of the previous row's group in order to increment the counter?
chulett wrote:No, one counter should be fine, alone with a copy of the 'previous' group. Set the counter to one every time the group changes, else increment it. Constrain the output to when the counter is <= 5. ...
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

That's one way, sure: 1 = new group, 0 = not. Or a simple stage variable holding the concatenated values from all fields that make up the 'group'. First time checking a 'previous' value against a current value?
-craig

"You can never have too many knives" -- Logan Nine Fingers
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Easier approach. Use a standard fork-join design, with an Aggregator to calculate the count per group. After the join filter on count <= 5.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

Hmmm... they want the first five per group, not just groups that have that many.
-craig

"You can never have too many knives" -- Logan Nine Fingers
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Yes, that's why the filter operator is <= downstream of the Join stage. Upon re-examination you not use an Aggregator to count - you would use a Transformer to generate a counter per group. Which is close to the original suggestions.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

Ah... counter per group, gotcha. Saying 'count per group' made me misunderestimate you. :wink:
-craig

"You can never have too many knives" -- Logan Nine Fingers
srinivas.g
Participant
Posts: 251
Joined: Mon Jun 09, 2008 5:52 am

Post by srinivas.g »

Just use stage varaibles:

Prev-->curr
Curr-->input coloumn
If(prev=curr) then c+1 esle 0 -->C

In Constraint use C<5
Srinu Gadipudi
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

:!: In that order? If you are going to suggest a stage variable solution that relies on them being in the correct order to function properly, please make sure you post them in the correct order.

Three doesn't hurt but you only really need two variables and I would number things differently to make it a little more obvious what the end result needs to be.

Code: Select all

svCounter ->  If link.column = svPrevious then svCounter + 1 Else 1
svPrevious -> link.column
In the output link, use a constraint of svCounter <= 5.

:idea: Me, I would probably use a third stage variable named svCurrent if the 'current group' consisted of several fields concatenated together. This to do the concatenation only once per record, especially if the comparison needs to be made more than once and (again) to keep things a little easier to read.

Code: Select all

svCurrent ->  Link.ColA : '|' : Link.ColB : '|' : Link.ColC
svCounter ->  If svCurrent = svPrevious then svCounter + 1 Else 1
svPrevious -> svCurrent
-craig

"You can never have too many knives" -- Logan Nine Fingers
Nicole
Premium Member
Premium Member
Posts: 93
Joined: Mon Oct 22, 2007 12:55 pm
Location: Albany
Contact:

Post by Nicole »

This is the solution I decided to use -

Using the sort stage - sorted by the 2 columns that I need to keep 5 duplicates. I set 'create key change column = true' and in the transformer stage - COUNTER stage variable derivation "If (DSLink20.keyChange = "1") Then COUNTER = 1 Else COUNTER + 1"

So I reset the counter to 1 each time the columns changed and added the constraint COUNTER <= 5

This seems to work and thanks to the group for helping me figure this out!
Post Reply