How to Count Distinct within aggregation

Gazelle · Post by **Gazelle** » Mon Dec 10, 2012 4:55 pm

Requirement:
We are building a summary table with pre-aggregated data.
The table has 10 "key" columns, and we aggregate on 20 different combinations of these (e.g. on keys 1/2/3/4 and keys 1/2/5/6).
The trick is that we need to count the number of distinct items that were "rolled up" during the aggregation.

Here's a simplified version of what we want, using DB2 SQL :

Code: Select all

with temp (Round,Company,Contract,Category,Section,SubSect,Actual,Max,RegionActual,RegionMax)
         as (values
             (25,'Acme','BBB','Z','Groceries','Dairy',800,1000,null,null),
             (25,'Acme','GGG','Z','Groceries','Snacks',300,900,null,null),
             (25,'Acme','BBB','Z','Groceries','Fruit & Veg',null,null,550,1000),
             (25,'Acme','CCC','Y','Groceries','Fruit & Veg',null,null,450,1000),
             (25,'Acme','SSS','Y','Clothing','Mens',null,null,750,1000),
             (25,'Acme','DDD','X','Clothing','Mens',null,null,850,1000),
             (25,'Blade','AAA','W','Groceries','Dairy',900,1000,null,null),
             (25,'Blade','AAA','W','Groceries','Snacks',400,900,null,null),
             (25,'Blade','AAA','W','Groceries','Fruit & Veg',null,null,350,1000),
             (25,'Blade','CCC','Y','Groceries','Fruit & Veg',null,null,250,1000),
             (26,'Acme','BBB','Z','Groceries','Dairy',700,1000,null,null),
             (26,'Acme','GGG','Z','Groceries','Schedule',100,800,null,null),
             (26,'Acme','BBB','Z','Groceries','Fruit & Veg',null,null,650,1000),
             (26,'Blade','AAA','W','Groceries','Dairy',600,1000,null,null),
             (26,'Blade','AAA','W','Groceries','Snacks',200,800,null,null),
             (26,'Blade','AAA','W','Groceries','Fruit & Veg',null,null,150,1000)
             )
select 
  Round,Company,Section,
  sum(Actual) as SumAct,sum(Max) as SumMax,sum(RegionActual) as SumRegAct,sum(RegionMax) as SumRegMax,
count(distinct Contract) as CountContract,count(distinct Category) as CountCat
from temp
group by Round,Company,Section;

Output:

Code: Select all

 ROUND COMPANY SECTION   SUMACT SUMMAX SUMREGACT SUMREGMAX COUNTCONTRACT COUNTCAT
 ----- ------- --------- ------ ------ --------- --------- ------------- --------
    25 Acme    Clothing    NULL   NULL      1600      2000             2        2
    25 Acme    Groceries   1100   1900      1000      2000             3        2
    25 Blade   Groceries   1300   1900       600      2000             2        2
    26 Acme    Groceries    800   1800       650      1000             2        1
    26 Blade   Groceries    800   1800       150      1000             1        1

I am actually struggling to think of a way to do this via DataStage, so am open to ideas.

Note that data volumes are pretty low, which might open up some more options.

A search finds some old posts such as this one: viewtopic.php?t=91232

I was wondering whether there was anything included in v8.5 that made this a bit easier.

Or should I give up trying to do this in DataStage, and just do the aggregations in DB2?

jwiles · Post by **jwiles** » Mon Dec 10, 2012 7:45 pm

A couple of sorts which generate keychange columns followed by an aggravator

stage can do it. The trick is in using some of the more advanced options of the sorts and in proper partitioning of the data. Not too difficult if you understand what's being done and why.

DB2 is probably the more efficient choice (and easier to code it in SQL)

Regards,

Gazelle · Post by **Gazelle** » Mon Dec 10, 2012 11:00 pm

Heehee. "Aggravator" - a good name for it!

I think the use of KeyChange might be a bit unwieldy in our case, because we have 20 aggregations.
For each type of aggregation, we would require:
- Sort by keys and KeyChange on Contract
- Sort by keys and KeyChange on Category
- Aggregate by keys and sum the KeyChange columns

I'm not sure whether DB2 would be more efficient, but I think SQL would be much easier to understand than a job with 60+ stages.

I just remembered an example in the documentation, using loop variables and LastRowInGroup() within a Transformer: "Loop example: aggregating data".
I'm sure this method could be tweaked to count distinct Contract and Category values.
But again, I think it is harder to understand than SQL, especially for the poor Support staff who have to look at it in 5 years time.

jwiles · Post by **jwiles** » Tue Dec 11, 2012 12:08 am

With the example you gave where all of the aggregations are grouped on the same set of keys (Region, Company and Section), you only need X+3 stages: one aggregator, x sorts (x=number of distinct counts) and of course source and target stages. Partition on the grouping keys only (first sort) and same partitioning elsewhere, and make use of the "Don't Sort, Already Sorted" option for the grouping keys in all sorts but the first. In the aggregator, all the aggregations are sums. When you have multiple qroups (grouping by different keys), then it becomes some form of 3+Y*(X+1), Y being the number of groups and X the average number of distinct counts per group. It gets ugly pretty quick!

It's more tedious than anything else to setup the job this way for more than a few distinct counts. If I were to go for a looping solution w/in DS instead, I would probably write a BuildOp or Custom Operator rather than use the transformer. That also brings up the same issue of long-term maintenance--how many of your client's DataStage support team know C++? Maybe there's some job/contract stability/sales op in that?

Efficiency in this case is relative: SQL is designed to code and perform this type of work easily and most databases will do this fairly efficiently as long as the database is well-designed and the SQL developer is knowledgeable. DataStage, when the job is designed well, can also be efficient.. If I had multiple groups on different keys, I would go with SQL/DB2 in a heartbeat (and have for some pretty complicated aggregations).

Regards,