Grouping data

devsonali · Post by **devsonali** » Wed Jan 09, 2013 1:54 pm

Hello All

I am pretty sure somebody might have asked this before, but really not sure what phrase to use to search this situation
I have data with 2 columns say Col 1 and Col 2 and rows record 1, record2 etc

I have input file in the format
Rows col1 col2
record 1 1 0
record 2 0 1
record 1 0 1
record 4 1 0
record 4 0 1
so on and so forth

I am looking at output that groups records as follows
rows col1 col 2
record 1 1 1
record 2 0 1
record 4 1 1

Can you please provide me pointers on how to approach this ?

Appreciate you replying with your inputs.
Thanks

jwiles · Post by **jwiles** » Wed Jan 09, 2013 2:21 pm

Partition/Sort on your grouping key column(s), then use a transformer to
Use key change detection to know when a new key group is present (either keyChange from a sort stage or compare to previous row's values)
Use stage variables to save the column values you wish to output for a key group and place them into the output row

In 8.5 or later, use LastRowInGroup() to output the desired row for each key group. In earlier versions, output them all and use a RemDup to keep the last.

Regards,

devsonali · Post by **devsonali** » Wed Jan 09, 2013 4:12 pm

jwiles
you mentioned that
"In 8.5 or later, use LastRowInGroup() to output the desired row for each key group"

I think i didn't quite get it so if I have say 4 rows with same value for Name and different column values e .g

Name col1 col 2

a 1 0
a 0 1
b 1 0
b 0 1

If I take last row in the group which will be value a and b corresponding to '1' in column 2
How do I get a single row for a and b with both columns holding values 1 .

a 1 1
b 1 1

Can you please elaborate ?

jwiles · Post by **jwiles** » Wed Jan 09, 2013 4:40 pm

For the first record you encounter in a key group, copy the column values into stage variables. For subsequent records, copy only the 1s into stage variables.

In the constraint of your output link, use the LastRowInGroup function as the constraint. Populate the output columns with the values stored in the stage variables. LastRowInGroup is documented in the Information Server Information Center. http://publib.boulder.ibm.com/infocente ... eaks2.html

BTW...one of the terms used to describe what you are attempting to accomplish is "column survivorship". It's also a form of data aggregation, but a little different than is provided by the Aggregator stage.

Regards,

devsonali · Post by **devsonali** » Wed Jan 09, 2013 8:36 pm

James

I went through the documentation and tested a few jobs with last row in the group .
However I am still stuck at the part where you say

"For the first record you encounter in a key group, copy the column values into stage variables. For subsequent records, copy only the 1s into stage variables."

How do I achieve this ? This function pretty much identifies weather a record is last in the group or not , so if I say for all rows of the key column that is not last in the group , capture the value of all column values into stage variables . (if thats what you mean above , then this can be done ) however how do I capture only 1 s into stage variable for all the remaining rows within the group? Because for my remaining rows 1 s can be in any column or no columns?

I hope I am clear in explaining my question here ?

devsonali · Post by **devsonali** » Wed Jan 09, 2013 9:47 pm

Any suggestions/inputs from the members ?

ray.wurlod · Post by **ray.wurlod** » Wed Jan 09, 2013 10:00 pm

devsonali wrote:Any suggestions/inputs from the members ?

Learn patience. Don't push. You waited barely over an hour, at a time when many members are sleeping.

DSXchange is an all-volunteer site whose members post if, when and as they can. "Pushing" is considered offensive.

devsonali · Post by **devsonali** » Wed Jan 09, 2013 10:10 pm

Point noted . Though the intention of asking was never to offend.

jwiles · Post by **jwiles** » Wed Jan 09, 2013 10:44 pm

I mentioned how you could detect the key change in my first post.

Use key change detection to know when a new key group is present (either keyChange from a sort stage or compare to previous row's values)

How to generate the keyChange column in Sort is documented.

Stage variables (plural). You need a stage variable for each column you're wishing to capture the values for. As for the stage variable derivation logic, it's very straightforward if you think about it a little bit. Here's the basic rules:

If a Key Change is detected, Capture the Input Column Value
Else if Input Column Value = your desired value (1), Capture the Input Column Value
Else keep the Stage Variable Value unchanged

Just translate that into the proper derivation logic.

Regards,

thanush9sep · Post by **thanush9sep** » Wed Jan 09, 2013 11:32 pm

I am not sure whether I understood the question correctly ....

However, seems simple work of Aggregator stage.

sarathcr · Post by **sarathcr** » Thu Jan 10, 2013 12:17 am

Hi LakshmiNarayanan,

Could you please explain us how to achieve the end result using Aggregator stage?

Thank You,

chulett · Post by **chulett** » Thu Jan 10, 2013 7:44 am

Group on the first field, take MAX() of the other two.

devsonali · Post by **devsonali** » Thu Jan 10, 2013 7:48 am

James ,
That is correct but I am curious if we have to do looping within the transformer to achieve this (that was my initial concern ) but from your response it looks like just to check key change would take care of that .
Its pretty confusing to me and I will go ahead and test as you mentioned below .

Please let me know if there is looping to be done (which I am guessing the answer is 'no' from your suggestions below.

Thanks again

chulett · Post by **chulett** » Thu Jan 10, 2013 8:13 am

How accurate are the data samples you posted? From what you posted, I don't see a need for anything other than an aggregator.

jwiles · Post by **jwiles** » Thu Jan 10, 2013 9:41 am

Good point. My initial read on it was keeping a "best value", but based on the samples, the requirement could be a simple summation for which an aggregator is the better approach.

To answer the looping question: Looping can be used, but is not necessary for the simple best value approach I was discussing. You could definitely use it as an exercise to learn looping if you wish...the aggregation example in the documentation is a good starting point.

Regards,