How do I count distinct rows in the aggregator?

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
vmcburney
Participant
Posts: 3593
Joined: Thu Jan 23, 2003 5:25 pm
Location: Australia, Melbourne
Contact:

How do I count distinct rows in the aggregator?

Post by vmcburney »

I'm struggling to find a good distinct count function within the aggregation stage. It's something I use a lot in SQL group bys.

The aggregator has a Count Rows aggregation type but this is quite inflexible, it counts total rows and cannot count distinct instances of particular columns.

When I switch to Calculation mode and choose a column to count, such as customer ID, I don't see any count or distinct count aggregation options. I see fancy smancy functions such as percent coefficient and variance but not the plain old distinct count.

Anyone know how to do this?
Sainath.Srinivasan
Participant
Posts: 3337
Joined: Mon Jan 17, 2005 4:49 am
Location: United Kingdom

Post by Sainath.Srinivasan »

You can achieve this by using 2 agg stages.

First one - group by your key
Second one - count your key and group by some dummy column (say constant 1) or use max(@OUTROWNUM)
vmcburney
Participant
Posts: 3593
Joined: Thu Jan 23, 2003 5:25 pm
Location: Australia, Melbourne
Contact:

Post by vmcburney »

I am astonished that you cannot do a simple count aggregation at the same time as other aggregations such as max and min and that you cannot do a distinct count at all. It seems like someone built a car and left the wheels out.

I will play around with sorts and transformers and maybe custom stages to find some way around this.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Then you'll want to be able to do more than one! :roll:

How many databases allow multiple SELECT DISTINCT in the one query? For example

Code: Select all

SELECT DISTINCT COLUMN_01, DISTINCT COLUMN_02 FROM TABLE;
or

Code: Select all

SELECT COUNT(DISTINCT COLUMN_01), COUNT(DISTINCT COLUMN_02) FROM TABLENAME;
Not many. (Red Brick 6.20 and later can, I don't know any other.)
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
dsxuserrio
Participant
Posts: 82
Joined: Thu Dec 02, 2004 10:27 pm
Location: INDIA

Post by dsxuserrio »

vmcburney
As you were mentioning
"I will play around with sorts and transformers and maybe custom stages to find some way around this."

Just for count distinct you can use sort with stable unique option.
Of course in the main aggregator stage you want all your records. So just for doing a count you need one more stage.
Thanks
dsxuserrio

Kannan.N
Bangalore,INDIA
jasper
Participant
Posts: 111
Joined: Mon May 06, 2002 1:25 am
Location: Belgium

Post by jasper »

Code:
SELECT COUNT(DISTINCT COLUMN_01), COUNT(DISTINCT COLUMN_02) FROM TABLENAME
Even oracle can do this, so it can't be hard.

I'm also wondering why this is so hard, for me it's very standard in an aggregator to have something like
select count(A),count(distinct A),count(B),count(distinct B)
If this is so hard to code in DS, they should rethink a lot about the aggregator stage.
roy
Participant
Posts: 2598
Joined: Wed Jul 30, 2003 2:05 am
Location: Israel

Post by roy »

Hi,
You could order by this column (as well as other key columns) and make all reapearing instances have NULL in a transformer prior to the aggregator stage, this should make your count equal to a count distinct on that column.

IHTH,
Roy R.
Time is money but when you don't have money time is all you can afford.

Search before posting:)

Join the DataStagers team effort at:
http://www.worldcommunitygrid.org
Image
Post Reply