Count of distinct keys - Aggregator stage

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
Ragunathan Gunasekaran
Participant
Posts: 247
Joined: Mon Jan 22, 2007 11:33 pm

Count of distinct keys - Aggregator stage

Post by Ragunathan Gunasekaran »

Hi

I am having a stream with a design like Copy -----> Aggregate ---->seq file.

The data flow in the job is not enormous and hence i have made all the stages to work sequential.

I am processing just two columns in the above stream.
Copy:
====
1) Retrieve key and measure calculation column from upstream processing.

Aggregate:
=======
1) Group by Key ( Hash option) and count(distinct Measure_ columm)

Seq file
======
1) write the key and the count to the sequential file.


The issue for me is .. i am not able to perform a distinct count rather the aggregate stage returns count(*). I have tried using remove duplicate stage before aggregator to remove the duplicate in measure_col but not getting proper result.


Any guidance on this please...
Regards
Ragu
anbu
Premium Member
Premium Member
Posts: 596
Joined: Sat Feb 18, 2006 2:25 am
Location: india

Post by anbu »

Make column Measure_ columm as key.
PhilHibbs
Premium Member
Premium Member
Posts: 1044
Joined: Wed Sep 29, 2004 3:30 am
Location: Nottingham, UK
Contact:

Post by PhilHibbs »

anbu wrote:Make column Measure_ columm as key.
That isn't going to work - not on its own anyway. That will give you one output row per value of Measure_column. You could then pass that through a Transformer that sets a counter value to 1 and then aggregate again on your real key, and the sum of those counter values will be your distinct count.

Can anyone suggest a way of doing this without aggregating twice?
Phil Hibbs | Capgemini
Technical Consultant
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Generate a copy of Measure_Column. Group by key and by Copy_of_Measure_Column
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
PhilHibbs
Premium Member
Premium Member
Posts: 1044
Joined: Wed Sep 29, 2004 3:30 am
Location: Nottingham, UK
Contact:

Post by PhilHibbs »

I don't see how grouping by a copy of the column is any different to grouping by the column itself - you will still end up with one output record per value in measure_column which you will then have to aggregate a second time to get the right cardinality.
Phil Hibbs | Capgemini
Technical Consultant
Post Reply