Finding max value for a column using aggregator.

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
videsh77
Premium Member
Premium Member
Posts: 97
Joined: Thu Dec 02, 2004 10:43 am
Contact:

Finding max value for a column using aggregator.

Post by videsh77 »

Hi

I have sort stage followed by aggregator stage. In aggregation I want to know row for max of particular column.

After execution, I am getting an abort with an error description as - "main_program: APT_StatReducer: field Name is already present in the output interface schema.".

Anyone has any idea, regarding which direction should I adopt this error?
Thanks with regards,
videsh.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Are you using sort or hash method in the Aggregator?

Are there fields on the input of the Aggregator that are not used? If so, get rid of them.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
videsh77
Premium Member
Premium Member
Posts: 97
Joined: Thu Dec 02, 2004 10:43 am
Contact:

Post by videsh77 »

I am not using sort / hash method in the aggregator.

Partitioning type I used is 'Same'. Previous stage employed 'Entire' partitioning.

As an input to Aggregator stage I have 2 columns as - Name / Year.

In aggregator -

Grouping is done as per - Name
In aggregations - type is 'Calculation'.
Column for calculation - 'Year'
Maximum value output column - 'Name' ... So I can get name of someone whose Year col has max value.

I am getting the same error as I mentioned already.

Thanks.
Thanks with regards,
videsh.
balajisr
Charter Member
Charter Member
Posts: 785
Joined: Thu Jul 28, 2005 8:58 am

Post by balajisr »

Change Maximum value output column name to something other than 'Name'. You already have an input column named 'Name' and your generated column cannot have the same name as input column.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

videsh77 wrote:I am not using sort / hash method in the aggregator.
Yes you are. These are the only choices.

Without knowing which method you're using, it is fatuous to offer any suggestions.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
videsh77
Premium Member
Premium Member
Posts: 97
Joined: Thu Dec 02, 2004 10:43 am
Contact:

Post by videsh77 »

Hi Ray

I overlooked Options category in aggregator properties.

Method = Hash.

Input of the aggregator I am having only 2 fields, which are used in the aggregation process.

Also, I have changed the column name at the o/p of aggregator stage to the other than what I have in i/p.

Thanks.
Thanks with regards,
videsh.
thebird
Participant
Posts: 254
Joined: Thu Jan 06, 2005 12:11 am
Location: India
Contact:

Post by thebird »

Since the data to the Aggregator stage is sorted - the "Sort" method is preferred over the "Hash" - so that DataStage can expect sorted data and doesnt use a hash table for sorting it again.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

The Hash aggregation method is appropriate when there aren't too many distinct grouping values - up to, say, 1000 per partition. This method does not require sorted input, so you could lose the Sort stage with no detriment.

If you are sorting for some other reason, change the aggregation method to Sort, but don't forget to assert in the Aggregator stage that the data are sorted on these particular keys.

The calculation of MAX then simply involves choosing the output column for calculation, and specifying the Max() function to apply to it.

Unfortunately, to get the value (or values - there may be more than one) from the ungrouped Name column associated with the particular row in which the maximum value of Year occurs is not readily achieved in an Aggregator stage.

Think about how you'd achieve it in SQL - you need a subquery, right? So in DataStage you're going to need some downstream processing once you've found the maximum Year value.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
videsh77
Premium Member
Premium Member
Posts: 97
Joined: Thu Dec 02, 2004 10:43 am
Contact:

Post by videsh77 »

Column name of an aggregated column has been changed to other than what comes in input. Sort method has kept to Hash.

Partitioning set to Auto.

Column is grouped by 'Name' & calculation is done as per 'Year'.

It worked as expected.
Thanks with regards,
videsh.
Post Reply