Hi
I am having a stream with a design like Copy -----> Aggregate ---->seq file.
The data flow in the job is not enormous and hence i have made all the stages to work sequential.
I am processing just two columns in the above stream.
Copy:
====
1) Retrieve key and measure calculation column from upstream processing.
Aggregate:
=======
1) Group by Key ( Hash option) and count(distinct Measure_ columm)
Seq file
======
1) write the key and the count to the sequential file.
The issue for me is .. i am not able to perform a distinct count rather the aggregate stage returns count(*). I have tried using remove duplicate stage before aggregator to remove the duplicate in measure_col but not getting proper result.
Any guidance on this please...
Count of distinct keys - Aggregator stage
Moderators: chulett, rschirm, roy
-
- Participant
- Posts: 247
- Joined: Mon Jan 22, 2007 11:33 pm
Count of distinct keys - Aggregator stage
Regards
Ragu
Ragu
-
- Premium Member
- Posts: 1044
- Joined: Wed Sep 29, 2004 3:30 am
- Location: Nottingham, UK
- Contact:
That isn't going to work - not on its own anyway. That will give you one output row per value of Measure_column. You could then pass that through a Transformer that sets a counter value to 1 and then aggregate again on your real key, and the sum of those counter values will be your distinct count.anbu wrote:Make column Measure_ columm as key.
Can anyone suggest a way of doing this without aggregating twice?
Phil Hibbs | Capgemini
Technical Consultant
Technical Consultant
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact: