group by issue

Svengiyil · Post by **Svengiyil** » Thu Sep 03, 2009 4:44 am

Hi

I have to do a group by on the key column in DS job, tried to use a sort stage but realized sort stage in parallel does not have the group by option.
tried to use an aggregator, though the aggregator has a group by option,
i cannot pass the rest of the column values(other than the key column) as it is into the target sequential file without specifying some kind of calculation. how do i proceed?

Thanks,
Svengiyil

dxk9 · Post by **dxk9** » Thu Sep 03, 2009 5:07 am

Can you explain the requirement clearly. Its kind of confusing to me.

Regards,
Divya

Sainath.Srinivasan · Post by **Sainath.Srinivasan** » Thu Sep 03, 2009 5:26 am

You can split them into two links - with one link to agg and returning back via a join stage.

Simplest will be to use a transformer to do the calculation.

chulett · Post by **chulett** » Thu Sep 03, 2009 5:45 am

So, what is it that you want to do, sort or group? You can't "group" records and keep all of the detail, the end result will be fewer record so some sort of "calculation" (sum,max,last,etc) needs to be done on the non-grouped fields.

Seems like you may just want to sort your output data.

Svengiyil · Post by **Svengiyil** » Thu Sep 03, 2009 6:35 am

ok let me explain...i have an input file in which characters from 1 to 10 are the key columns, i'm splitting this file into two parts , characters 1 to 10 as col1 and the rest as col2, now i need to group the output using col1 and store it into a target file(target file contains both col1 and col2), How do i achieve this?

chulett · Post by **chulett** » Thu Sep 03, 2009 7:39 am

Hmmm... still clear as mud. Can you provide examples of what it looks like before, then split and then the desired output? Hopefully that will help light the

ray.wurlod · Post by **ray.wurlod** » Thu Sep 03, 2009 3:13 pm

The "fork join" model suggested by Sainath seems to me to be the way to go.

RAJARP · Post by **RAJARP** » Thu Sep 03, 2009 5:07 pm

Hi,
If what i've understood is correct,then

i/p file---->Transformer(T)--------------->Join Stage(J)------->Target

(i/p Link of (A) from transfomer(T)) ( o/p link of (A) connect to join stage(J))
-------------------------------> Aggregator(A)---------->

1.From trasfomer to join stage, pass all the columns
2.From Transfomer to the aggregator ,the columns you want to do a 'group by' operation.
3.In aggregator do the 'group by' operation.
4.In the join stage do the join using col1 as key and get the grouby value and pass it to the target.

So you would be having col1, col2 and the coulmn you have done 'group by' in target.

Regards,
Raja R P

dxk9 · Post by **dxk9** » Thu Sep 03, 2009 9:53 pm

Svengiyil,

As per your requirement, you have 2 columns, want to group by Col1 and also want both the Cols in the output... which is similar to sorting by Col1.

Plz let me know if my understanding of the requirement is wrong.

Regards,
Divya

Svengiyil · Post by **Svengiyil** » Mon Sep 07, 2009 11:59 pm

Hi Raja, i tried ur method and it works fine . Thanks a lot.

Svengiyil · Post by **Svengiyil** » Tue Sep 08, 2009 12:01 am

Hi Divya,

Ur understanding is correct and i tried the method suggested by Raja, it works.

dxk9 · Post by **dxk9** » Tue Sep 08, 2009 3:00 am

Svengiyil,

Good to hear that you got a good solution!!!

But jus curious to know whether you get the same output when you sort it by the key column ?? As per my understanding, you should.

Regards,
Divya

Svengiyil · Post by **Svengiyil** » Tue Sep 08, 2009 4:40 am

I may have to group more than one column in future, in which case only sort would not serve the purpose.

Kryt0n · Post by **Kryt0n** » Tue Sep 08, 2009 4:20 pm

Amazed that came to a resolution... or even if there was an issue...

By doing Xfm->Join (non-group cols) and Xfm->Agg->Join (group cols) then your non-group columns will just explode the group back to the starting number of rows... so what is achieved?? Your non-group cols would need to go Xfm->Dedup->Join

Alternatively, use the Min/Max option of the aggregator and select the preserve source column type... (or something like that).