Grouping data
Moderators: chulett, rschirm, roy
Grouping data
Hello All
I am pretty sure somebody might have asked this before, but really not sure what phrase to use to search this situation
I have data with 2 columns say Col 1 and Col 2 and rows record 1, record2 etc
I have input file in the format
Rows col1 col2
record 1 1 0
record 2 0 1
record 1 0 1
record 4 1 0
record 4 0 1
so on and so forth
I am looking at output that groups records as follows
rows col1 col 2
record 1 1 1
record 2 0 1
record 4 1 1
Can you please provide me pointers on how to approach this ?
Appreciate you replying with your inputs.
Thanks
I am pretty sure somebody might have asked this before, but really not sure what phrase to use to search this situation
I have data with 2 columns say Col 1 and Col 2 and rows record 1, record2 etc
I have input file in the format
Rows col1 col2
record 1 1 0
record 2 0 1
record 1 0 1
record 4 1 0
record 4 0 1
so on and so forth
I am looking at output that groups records as follows
rows col1 col 2
record 1 1 1
record 2 0 1
record 4 1 1
Can you please provide me pointers on how to approach this ?
Appreciate you replying with your inputs.
Thanks
Partition/Sort on your grouping key column(s), then use a transformer to
Use key change detection to know when a new key group is present (either keyChange from a sort stage or compare to previous row's values)
Use stage variables to save the column values you wish to output for a key group and place them into the output row
In 8.5 or later, use LastRowInGroup() to output the desired row for each key group. In earlier versions, output them all and use a RemDup to keep the last.
Regards,
Use key change detection to know when a new key group is present (either keyChange from a sort stage or compare to previous row's values)
Use stage variables to save the column values you wish to output for a key group and place them into the output row
In 8.5 or later, use LastRowInGroup() to output the desired row for each key group. In earlier versions, output them all and use a RemDup to keep the last.
Regards,
- james wiles
All generalizations are false, including this one - Mark Twain.
All generalizations are false, including this one - Mark Twain.
jwiles
you mentioned that
"In 8.5 or later, use LastRowInGroup() to output the desired row for each key group"
I think i didn't quite get it so if I have say 4 rows with same value for Name and different column values e .g
Name col1 col 2
a 1 0
a 0 1
b 1 0
b 0 1
If I take last row in the group which will be value a and b corresponding to '1' in column 2
How do I get a single row for a and b with both columns holding values 1 .
a 1 1
b 1 1
Can you please elaborate ?
you mentioned that
"In 8.5 or later, use LastRowInGroup() to output the desired row for each key group"
I think i didn't quite get it so if I have say 4 rows with same value for Name and different column values e .g
Name col1 col 2
a 1 0
a 0 1
b 1 0
b 0 1
If I take last row in the group which will be value a and b corresponding to '1' in column 2
How do I get a single row for a and b with both columns holding values 1 .
a 1 1
b 1 1
Can you please elaborate ?
For the first record you encounter in a key group, copy the column values into stage variables. For subsequent records, copy only the 1s into stage variables.
In the constraint of your output link, use the LastRowInGroup function as the constraint. Populate the output columns with the values stored in the stage variables. LastRowInGroup is documented in the Information Server Information Center. http://publib.boulder.ibm.com/infocente ... eaks2.html
BTW...one of the terms used to describe what you are attempting to accomplish is "column survivorship". It's also a form of data aggregation, but a little different than is provided by the Aggregator stage.
Regards,
In the constraint of your output link, use the LastRowInGroup function as the constraint. Populate the output columns with the values stored in the stage variables. LastRowInGroup is documented in the Information Server Information Center. http://publib.boulder.ibm.com/infocente ... eaks2.html
BTW...one of the terms used to describe what you are attempting to accomplish is "column survivorship". It's also a form of data aggregation, but a little different than is provided by the Aggregator stage.
Regards,
- james wiles
All generalizations are false, including this one - Mark Twain.
All generalizations are false, including this one - Mark Twain.
James
I went through the documentation and tested a few jobs with last row in the group .
However I am still stuck at the part where you say
"For the first record you encounter in a key group, copy the column values into stage variables. For subsequent records, copy only the 1s into stage variables."
How do I achieve this ? This function pretty much identifies weather a record is last in the group or not , so if I say for all rows of the key column that is not last in the group , capture the value of all column values into stage variables . (if thats what you mean above , then this can be done ) however how do I capture only 1 s into stage variable for all the remaining rows within the group? Because for my remaining rows 1 s can be in any column or no columns?
I hope I am clear in explaining my question here ?
I went through the documentation and tested a few jobs with last row in the group .
However I am still stuck at the part where you say
"For the first record you encounter in a key group, copy the column values into stage variables. For subsequent records, copy only the 1s into stage variables."
How do I achieve this ? This function pretty much identifies weather a record is last in the group or not , so if I say for all rows of the key column that is not last in the group , capture the value of all column values into stage variables . (if thats what you mean above , then this can be done ) however how do I capture only 1 s into stage variable for all the remaining rows within the group? Because for my remaining rows 1 s can be in any column or no columns?
I hope I am clear in explaining my question here ?
Last edited by devsonali on Wed Jan 09, 2013 10:15 pm, edited 1 time in total.
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
Learn patience. Don't push. You waited barely over an hour, at a time when many members are sleeping.devsonali wrote:Any suggestions/inputs from the members ?
DSXchange is an all-volunteer site whose members post if, when and as they can. "Pushing" is considered offensive.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
I mentioned how you could detect the key change in my first post.
Stage variables (plural). You need a stage variable for each column you're wishing to capture the values for. As for the stage variable derivation logic, it's very straightforward if you think about it a little bit. Here's the basic rules:
If a Key Change is detected, Capture the Input Column Value
Else if Input Column Value = your desired value (1), Capture the Input Column Value
Else keep the Stage Variable Value unchanged
Just translate that into the proper derivation logic.
Regards,
How to generate the keyChange column in Sort is documented.Use key change detection to know when a new key group is present (either keyChange from a sort stage or compare to previous row's values)
Stage variables (plural). You need a stage variable for each column you're wishing to capture the values for. As for the stage variable derivation logic, it's very straightforward if you think about it a little bit. Here's the basic rules:
If a Key Change is detected, Capture the Input Column Value
Else if Input Column Value = your desired value (1), Capture the Input Column Value
Else keep the Stage Variable Value unchanged
Just translate that into the proper derivation logic.
Regards,
- james wiles
All generalizations are false, including this one - Mark Twain.
All generalizations are false, including this one - Mark Twain.
-
- Premium Member
- Posts: 54
- Joined: Thu Oct 18, 2007 4:20 am
- Location: Chennai
James ,
That is correct but I am curious if we have to do looping within the transformer to achieve this (that was my initial concern ) but from your response it looks like just to check key change would take care of that .
Its pretty confusing to me and I will go ahead and test as you mentioned below .
Please let me know if there is looping to be done (which I am guessing the answer is 'no' from your suggestions below.
Thanks again
That is correct but I am curious if we have to do looping within the transformer to achieve this (that was my initial concern ) but from your response it looks like just to check key change would take care of that .
Its pretty confusing to me and I will go ahead and test as you mentioned below .
Please let me know if there is looping to be done (which I am guessing the answer is 'no' from your suggestions below.
Thanks again
Good point. My initial read on it was keeping a "best value", but based on the samples, the requirement could be a simple summation for which an aggregator is the better approach.
To answer the looping question: Looping can be used, but is not necessary for the simple best value approach I was discussing. You could definitely use it as an exercise to learn looping if you wish...the aggregation example in the documentation is a good starting point.
Regards,
To answer the looping question: Looping can be used, but is not necessary for the simple best value approach I was discussing. You could definitely use it as an exercise to learn looping if you wish...the aggregation example in the documentation is a good starting point.
Regards,
- james wiles
All generalizations are false, including this one - Mark Twain.
All generalizations are false, including this one - Mark Twain.