Finding max value for a column using aggregator.

videsh77 · Post by **videsh77** » Mon Dec 11, 2006 5:37 am

Hi

I have sort stage followed by aggregator stage. In aggregation I want to know row for max of particular column.

After execution, I am getting an abort with an error description as - "main_program: APT_StatReducer: field Name is already present in the output interface schema.".

Anyone has any idea, regarding which direction should I adopt this error?

ray.wurlod · Post by **ray.wurlod** » Mon Dec 11, 2006 2:03 pm

Are you using sort or hash method in the Aggregator?

Are there fields on the input of the Aggregator that are not used? If so, get rid of them.

videsh77 · Post by **videsh77** » Wed Dec 13, 2006 11:30 pm

I am not using sort / hash method in the aggregator.

Partitioning type I used is 'Same'. Previous stage employed 'Entire' partitioning.

As an input to Aggregator stage I have 2 columns as - Name / Year.

In aggregator -

Grouping is done as per - Name
In aggregations - type is 'Calculation'.
Column for calculation - 'Year'
Maximum value output column - 'Name' ... So I can get name of someone whose Year col has max value.

I am getting the same error as I mentioned already.

Thanks.

balajisr · Post by **balajisr** » Thu Dec 14, 2006 12:57 am

Change Maximum value output column name to something other than 'Name'. You already have an input column named 'Name' and your generated column cannot have the same name as input column.

ray.wurlod · Post by **ray.wurlod** » Thu Dec 14, 2006 1:24 am

videsh77 wrote:I am not using sort / hash method in the aggregator.

Yes you are. These are the only choices.

Without knowing which method you're using, it is fatuous to offer any suggestions.

videsh77 · Post by **videsh77** » Thu Dec 14, 2006 1:35 am

Hi Ray

I overlooked Options category in aggregator properties.

Method = Hash.

Input of the aggregator I am having only 2 fields, which are used in the aggregation process.

Also, I have changed the column name at the o/p of aggregator stage to the other than what I have in i/p.

Thanks.

thebird · Post by **thebird** » Thu Dec 14, 2006 8:54 am

Since the data to the Aggregator stage is sorted - the "Sort" method is preferred over the "Hash" - so that DataStage can expect sorted data and doesnt use a hash table for sorting it again.

ray.wurlod · Post by **ray.wurlod** » Thu Dec 14, 2006 2:39 pm

The Hash aggregation method is appropriate when there aren't too many distinct grouping values - up to, say, 1000 per partition. This method does not require sorted input, so you could lose the Sort stage with no detriment.

If you are sorting for some other reason, change the aggregation method to Sort, but don't forget to assert in the Aggregator stage that the data are sorted on these particular keys.

The calculation of MAX then simply involves choosing the output column for calculation, and specifying the Max() function to apply to it.

Unfortunately, to get the value (or values - there may be more than one) from the ungrouped Name column associated with the particular row in which the maximum value of Year occurs is not readily achieved in an Aggregator stage.

Think about how you'd achieve it in SQL - you need a subquery, right? So in DataStage you're going to need some downstream processing once you've found the maximum Year value.

videsh77 · Post by **videsh77** » Fri Dec 15, 2006 12:35 am

Column name of an aggregated column has been changed to other than what comes in input. Sort method has kept to Hash.

Partitioning set to Auto.

Column is grouped by 'Name' & calculation is done as per 'Year'.

It worked as expected.