derive max occurrence

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
nishantrk
Premium Member
Premium Member
Posts: 23
Joined: Fri May 27, 2011 11:43 am

derive max occurrence

Post by nishantrk »

Hi,
Have a requirement to find max occurrence value in a column for a group.
Have given the sample data below.
One approach that have used is to group by key1 key2 and value1 get the count(*) ,then pass it through remove duplicate stage to keep the max count row alone.
But since I have to find the max occurrence in 3(could be more) columns here
I would have to repeat it for all 3 and join the data back.

Looking for a simpler approach here , to be done with say some aggregation function???

INPUT
KEY1 KEY2 VALUE1 VALUE2 VALUE3
A Z 10 50 10
A Z 0 0 10
A Z 10 0 10
A Z 10 0 20
B Y 100 0 1
B Y 15 5 1
B Y 15 5 1

output
KEY1 KEY2 VALUE1 VALUE2 VALUE3
A Z 10 0 10
B Y 15 5 1
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Post by ArndW »

I'm a bit confused by your result, shouldn't it be "A Z 10 50 10"?

Also, which is bigger?

A Z 10 50 10
A Z 10 99 99
A Z 11 0 0

(if you are using 3 columns for the comparison, are they added to together, sub-sorted by column from left to right or some other logic)
?
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Shouldn't the result be

Code: Select all

KEY1 KEY2 VALUE1 VALUE2 VALUE3
  A    Z    10     50     20
  B    Y   100      5      1 
You should be able to use an Aggregator stage for this. Group by KEY1 and KEY2 (not forgetting to partition by KEY1) and calculate MAX for each of VALUE1, VALUE2 and VALUE3. To keep them as integers, assert Preserve Type property.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Ah, we seem to have a nomenclature problem here. You mean "most frequently occurring set of values" rather than "maximum".

Probably the easiest approach is to form a single string made up of all the source values, possible with delimiters. Then you could use an fork-join or stage variables to record the number of occurrences of each and decompose downstream of that.

Take note of Arnd's question: which is larger?

Also what do you do in the event of a tie? That is, where more than one set of values ties for most frequently occurring.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
nishantrk
Premium Member
Premium Member
Posts: 23
Joined: Fri May 27, 2011 11:43 am

Post by nishantrk »

Sorry for the confusion ..What I meant was "most frequently occurring set of values" as Ray correctly pointed out.
Thanks for ur replies
Post Reply