Hi,
Have a requirement to find max occurrence value in a column for a group.
Have given the sample data below.
One approach that have used is to group by key1 key2 and value1 get the count(*) ,then pass it through remove duplicate stage to keep the max count row alone.
But since I have to find the max occurrence in 3(could be more) columns here
I would have to repeat it for all 3 and join the data back.
Looking for a simpler approach here , to be done with say some aggregation function???
INPUT
KEY1 KEY2 VALUE1 VALUE2 VALUE3
A Z 10 50 10
A Z 0 0 10
A Z 10 0 10
A Z 10 0 20
B Y 100 0 1
B Y 15 5 1
B Y 15 5 1
output
KEY1 KEY2 VALUE1 VALUE2 VALUE3
A Z 10 0 10
B Y 15 5 1
derive max occurrence
Moderators: chulett, rschirm, roy
I'm a bit confused by your result, shouldn't it be "A Z 10 50 10"?
Also, which is bigger?
A Z 10 50 10
A Z 10 99 99
A Z 11 0 0
(if you are using 3 columns for the comparison, are they added to together, sub-sorted by column from left to right or some other logic)
?
Also, which is bigger?
A Z 10 50 10
A Z 10 99 99
A Z 11 0 0
(if you are using 3 columns for the comparison, are they added to together, sub-sorted by column from left to right or some other logic)
?
<a href=http://www.worldcommunitygrid.org/team/ ... TZ9H4CGVP1 target="WCGWin">
</a>
</a>
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
Shouldn't the result be
You should be able to use an Aggregator stage for this. Group by KEY1 and KEY2 (not forgetting to partition by KEY1) and calculate MAX for each of VALUE1, VALUE2 and VALUE3. To keep them as integers, assert Preserve Type property.
Code: Select all
KEY1 KEY2 VALUE1 VALUE2 VALUE3
A Z 10 50 20
B Y 100 5 1
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
Ah, we seem to have a nomenclature problem here. You mean "most frequently occurring set of values" rather than "maximum".
Probably the easiest approach is to form a single string made up of all the source values, possible with delimiters. Then you could use an fork-join or stage variables to record the number of occurrences of each and decompose downstream of that.
Take note of Arnd's question: which is larger?
Also what do you do in the event of a tie? That is, where more than one set of values ties for most frequently occurring.
Probably the easiest approach is to form a single string made up of all the source values, possible with delimiters. Then you could use an fork-join or stage variables to record the number of occurrences of each and decompose downstream of that.
Take note of Arnd's question: which is larger?
Also what do you do in the event of a tie? That is, where more than one set of values ties for most frequently occurring.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.