Grouping data

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

devsonali
Premium Member
Premium Member
Posts: 155
Joined: Sun Mar 18, 2007 2:31 pm

Grouping data

Post by devsonali »

Hello All

I am pretty sure somebody might have asked this before, but really not sure what phrase to use to search this situation
I have data with 2 columns say Col 1 and Col 2 and rows record 1, record2 etc

I have input file in the format
Rows col1 col2
record 1 1 0
record 2 0 1
record 1 0 1
record 4 1 0
record 4 0 1
so on and so forth

I am looking at output that groups records as follows
rows col1 col 2
record 1 1 1
record 2 0 1
record 4 1 1

Can you please provide me pointers on how to approach this ?

Appreciate you replying with your inputs.
Thanks
jwiles
Premium Member
Premium Member
Posts: 1274
Joined: Sun Nov 14, 2004 8:50 pm
Contact:

Post by jwiles »

Partition/Sort on your grouping key column(s), then use a transformer to
Use key change detection to know when a new key group is present (either keyChange from a sort stage or compare to previous row's values)
Use stage variables to save the column values you wish to output for a key group and place them into the output row

In 8.5 or later, use LastRowInGroup() to output the desired row for each key group. In earlier versions, output them all and use a RemDup to keep the last.

Regards,
- james wiles


All generalizations are false, including this one - Mark Twain.
devsonali
Premium Member
Premium Member
Posts: 155
Joined: Sun Mar 18, 2007 2:31 pm

Post by devsonali »

jwiles
you mentioned that
"In 8.5 or later, use LastRowInGroup() to output the desired row for each key group"

I think i didn't quite get it so if I have say 4 rows with same value for Name and different column values e .g


Name col1 col 2

a 1 0
a 0 1
b 1 0
b 0 1

If I take last row in the group which will be value a and b corresponding to '1' in column 2
How do I get a single row for a and b with both columns holding values 1 .

a 1 1
b 1 1

Can you please elaborate ?
jwiles
Premium Member
Premium Member
Posts: 1274
Joined: Sun Nov 14, 2004 8:50 pm
Contact:

Post by jwiles »

For the first record you encounter in a key group, copy the column values into stage variables. For subsequent records, copy only the 1s into stage variables.

In the constraint of your output link, use the LastRowInGroup function as the constraint. Populate the output columns with the values stored in the stage variables. LastRowInGroup is documented in the Information Server Information Center. http://publib.boulder.ibm.com/infocente ... eaks2.html

BTW...one of the terms used to describe what you are attempting to accomplish is "column survivorship". It's also a form of data aggregation, but a little different than is provided by the Aggregator stage.

Regards,
- james wiles


All generalizations are false, including this one - Mark Twain.
devsonali
Premium Member
Premium Member
Posts: 155
Joined: Sun Mar 18, 2007 2:31 pm

Post by devsonali »

James

I went through the documentation and tested a few jobs with last row in the group .
However I am still stuck at the part where you say

"For the first record you encounter in a key group, copy the column values into stage variables. For subsequent records, copy only the 1s into stage variables."

How do I achieve this ? This function pretty much identifies weather a record is last in the group or not , so if I say for all rows of the key column that is not last in the group , capture the value of all column values into stage variables . (if thats what you mean above , then this can be done ) however how do I capture only 1 s into stage variable for all the remaining rows within the group? Because for my remaining rows 1 s can be in any column or no columns?

I hope I am clear in explaining my question here ?
Last edited by devsonali on Wed Jan 09, 2013 10:15 pm, edited 1 time in total.
devsonali
Premium Member
Premium Member
Posts: 155
Joined: Sun Mar 18, 2007 2:31 pm

Post by devsonali »

Any suggestions/inputs from the members ?
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

devsonali wrote:Any suggestions/inputs from the members ?
Learn patience. Don't push. You waited barely over an hour, at a time when many members are sleeping.

DSXchange is an all-volunteer site whose members post if, when and as they can. "Pushing" is considered offensive.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
devsonali
Premium Member
Premium Member
Posts: 155
Joined: Sun Mar 18, 2007 2:31 pm

Post by devsonali »

Point noted . Though the intention of asking was never to offend.
jwiles
Premium Member
Premium Member
Posts: 1274
Joined: Sun Nov 14, 2004 8:50 pm
Contact:

Post by jwiles »

I mentioned how you could detect the key change in my first post.
Use key change detection to know when a new key group is present (either keyChange from a sort stage or compare to previous row's values)
How to generate the keyChange column in Sort is documented.

Stage variables (plural). You need a stage variable for each column you're wishing to capture the values for. As for the stage variable derivation logic, it's very straightforward if you think about it a little bit. Here's the basic rules:

If a Key Change is detected, Capture the Input Column Value
Else if Input Column Value = your desired value (1), Capture the Input Column Value
Else keep the Stage Variable Value unchanged

Just translate that into the proper derivation logic.

Regards,
- james wiles


All generalizations are false, including this one - Mark Twain.
thanush9sep
Premium Member
Premium Member
Posts: 54
Joined: Thu Oct 18, 2007 4:20 am
Location: Chennai

Post by thanush9sep »

I am not sure whether I understood the question correctly ....

However, seems simple work of Aggregator stage.
Regards
LakshmiNarayanan
sarathcr
Premium Member
Premium Member
Posts: 3
Joined: Thu May 29, 2008 11:23 pm

Post by sarathcr »

Hi LakshmiNarayanan,

Could you please explain us how to achieve the end result using Aggregator stage?

Thank You,
With Regards
Sarath
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

Group on the first field, take MAX() of the other two.
-craig

"You can never have too many knives" -- Logan Nine Fingers
devsonali
Premium Member
Premium Member
Posts: 155
Joined: Sun Mar 18, 2007 2:31 pm

Post by devsonali »

James ,
That is correct but I am curious if we have to do looping within the transformer to achieve this (that was my initial concern ) but from your response it looks like just to check key change would take care of that .
Its pretty confusing to me and I will go ahead and test as you mentioned below .

Please let me know if there is looping to be done (which I am guessing the answer is 'no' from your suggestions below.

Thanks again
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

How accurate are the data samples you posted? From what you posted, I don't see a need for anything other than an aggregator. :?
-craig

"You can never have too many knives" -- Logan Nine Fingers
jwiles
Premium Member
Premium Member
Posts: 1274
Joined: Sun Nov 14, 2004 8:50 pm
Contact:

Post by jwiles »

Good point. My initial read on it was keeping a "best value", but based on the samples, the requirement could be a simple summation for which an aggregator is the better approach.

To answer the looping question: Looping can be used, but is not necessary for the simple best value approach I was discussing. You could definitely use it as an exercise to learn looping if you wish...the aggregation example in the documentation is a good starting point.

Regards,
- james wiles


All generalizations are false, including this one - Mark Twain.
Post Reply