Page 1 of 1

Finding max value for a column using aggregator.

Posted: Mon Dec 11, 2006 5:37 am
by videsh77
Hi

I have sort stage followed by aggregator stage. In aggregation I want to know row for max of particular column.

After execution, I am getting an abort with an error description as - "main_program: APT_StatReducer: field Name is already present in the output interface schema.".

Anyone has any idea, regarding which direction should I adopt this error?

Posted: Mon Dec 11, 2006 2:03 pm
by ray.wurlod
Are you using sort or hash method in the Aggregator?

Are there fields on the input of the Aggregator that are not used? If so, get rid of them.

Posted: Wed Dec 13, 2006 11:30 pm
by videsh77
I am not using sort / hash method in the aggregator.

Partitioning type I used is 'Same'. Previous stage employed 'Entire' partitioning.

As an input to Aggregator stage I have 2 columns as - Name / Year.

In aggregator -

Grouping is done as per - Name
In aggregations - type is 'Calculation'.
Column for calculation - 'Year'
Maximum value output column - 'Name' ... So I can get name of someone whose Year col has max value.

I am getting the same error as I mentioned already.

Thanks.

Posted: Thu Dec 14, 2006 12:57 am
by balajisr
Change Maximum value output column name to something other than 'Name'. You already have an input column named 'Name' and your generated column cannot have the same name as input column.

Posted: Thu Dec 14, 2006 1:24 am
by ray.wurlod
videsh77 wrote:I am not using sort / hash method in the aggregator.
Yes you are. These are the only choices.

Without knowing which method you're using, it is fatuous to offer any suggestions.

Posted: Thu Dec 14, 2006 1:35 am
by videsh77
Hi Ray

I overlooked Options category in aggregator properties.

Method = Hash.

Input of the aggregator I am having only 2 fields, which are used in the aggregation process.

Also, I have changed the column name at the o/p of aggregator stage to the other than what I have in i/p.

Thanks.

Posted: Thu Dec 14, 2006 8:54 am
by thebird
Since the data to the Aggregator stage is sorted - the "Sort" method is preferred over the "Hash" - so that DataStage can expect sorted data and doesnt use a hash table for sorting it again.

Posted: Thu Dec 14, 2006 2:39 pm
by ray.wurlod
The Hash aggregation method is appropriate when there aren't too many distinct grouping values - up to, say, 1000 per partition. This method does not require sorted input, so you could lose the Sort stage with no detriment.

If you are sorting for some other reason, change the aggregation method to Sort, but don't forget to assert in the Aggregator stage that the data are sorted on these particular keys.

The calculation of MAX then simply involves choosing the output column for calculation, and specifying the Max() function to apply to it.

Unfortunately, to get the value (or values - there may be more than one) from the ungrouped Name column associated with the particular row in which the maximum value of Year occurs is not readily achieved in an Aggregator stage.

Think about how you'd achieve it in SQL - you need a subquery, right? So in DataStage you're going to need some downstream processing once you've found the maximum Year value.

Posted: Fri Dec 15, 2006 12:35 am
by videsh77
Column name of an aggregated column has been changed to other than what comes in input. Sort method has kept to Hash.

Partitioning set to Auto.

Column is grouped by 'Name' & calculation is done as per 'Year'.

It worked as expected.