Count of distinct keys - Aggregator stage

Ragunathan Gunasekaran · Sat Nov 21, 2009 1:06 pm

Hi

I am having a stream with a design like Copy -----> Aggregate ---->seq file.

The data flow in the job is not enormous and hence i have made all the stages to work sequential.

I am processing just two columns in the above stream.
Copy:
====
1) Retrieve key and measure calculation column from upstream processing.

Aggregate:
=======
1) Group by Key ( Hash option) and count(distinct Measure_ columm)

Seq file
======
1) write the key and the count to the sequential file.

The issue for me is .. i am not able to perform a distinct count rather the aggregate stage returns count(*). I have tried using remove duplicate stage before aggregator to remove the duplicate in measure_col but not getting proper result.

Any guidance on this please...

anbu · Post by **anbu** » Sat Nov 21, 2009 3:10 pm

Make column Measure_ columm as key.

PhilHibbs · Post by **PhilHibbs** » Wed May 09, 2012 3:48 am

anbu wrote:Make column Measure_ columm as key.

That isn't going to work - not on its own anyway. That will give you one output row per value of Measure_column. You could then pass that through a Transformer that sets a counter value to 1 and then aggregate again on your real key, and the sum of those counter values will be your distinct count.

Can anyone suggest a way of doing this without aggregating twice?

ray.wurlod · Post by **ray.wurlod** » Wed May 09, 2012 3:55 pm

Generate a copy of Measure_Column. Group by key and by Copy_of_Measure_Column

PhilHibbs · Post by **PhilHibbs** » Fri May 11, 2012 8:02 am

I don't see how grouping by a copy of the column is any different to grouping by the column itself - you will still end up with one output record per value in measure_column which you will then have to aggregate a second time to get the right cardinality.