derive max occurrence

nishantrk · Post by **nishantrk** » Sat Jun 01, 2013 1:18 am

Hi,
Have a requirement to find max occurrence value in a column for a group.
Have given the sample data below.
One approach that have used is to group by key1 key2 and value1 get the count(*) ,then pass it through remove duplicate stage to keep the max count row alone.
But since I have to find the max occurrence in 3(could be more) columns here
I would have to repeat it for all 3 and join the data back.

Looking for a simpler approach here , to be done with say some aggregation function???

INPUT
KEY1 KEY2 VALUE1 VALUE2 VALUE3
A Z 10 50 10
A Z 0 0 10
A Z 10 0 10
A Z 10 0 20
B Y 100 0 1
B Y 15 5 1
B Y 15 5 1

output
KEY1 KEY2 VALUE1 VALUE2 VALUE3
A Z 10 0 10
B Y 15 5 1

ArndW · Post by **ArndW** » Sat Jun 01, 2013 2:21 am

I'm a bit confused by your result, shouldn't it be "A Z 10 50 10"?

Also, which is bigger?

A Z 10 50 10
A Z 10 99 99
A Z 11 0 0

(if you are using 3 columns for the comparison, are they added to together, sub-sorted by column from left to right or some other logic)
?

ray.wurlod · Post by **ray.wurlod** » Sat Jun 01, 2013 2:22 am

Shouldn't the result be

Code: Select all

KEY1 KEY2 VALUE1 VALUE2 VALUE3
  A    Z    10     50     20
  B    Y   100      5      1

You should be able to use an Aggregator stage for this. Group by KEY1 and KEY2 (not forgetting to partition by KEY1) and calculate MAX for each of VALUE1, VALUE2 and VALUE3. To keep them as integers, assert Preserve Type property.

ray.wurlod · Post by **ray.wurlod** » Sat Jun 01, 2013 2:32 am

Ah, we seem to have a nomenclature problem here. You mean "most frequently occurring set of values" rather than "maximum".

Probably the easiest approach is to form a single string made up of all the source values, possible with delimiters. Then you could use an fork-join or stage variables to record the number of occurrences of each and decompose downstream of that.

Take note of Arnd's question: which is larger?

Also what do you do in the event of a tie? That is, where more than one set of values ties for most frequently occurring.

nishantrk · Post by **nishantrk** » Sat Jun 01, 2013 5:58 am

Sorry for the confusion ..What I meant was "most frequently occurring set of values" as Ray correctly pointed out.
Thanks for ur replies