Aggregate Performance vs. DB2 vs. other options

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
dcguuenther
Participant
Posts: 13
Joined: Wed Feb 08, 2006 3:06 pm
Location: Chicago

Aggregate Performance vs. DB2 vs. other options

Post by dcguuenther »

uHey All,

I am just curious your general perceptions on Aggregate performance in DataStage.

If I had comparable cpu/memory on a DB2 Database and on a DataStage Server --- which should win(keep in mind you have to get the data off of DB2 to start with anyways?

I currently am aggregating 232,164,602 into

I am using the following process in my DataStage attempt(though I will prob break this into smaller steps). I have turned autosort and partition off in the datastage job.

1. Table read in the 232mill records
2. hash partition immediatly on 1 key with no sort
3. 2 lookups to 100k rows each
4. 2 lookups to 25 rows each
5. 2 lookups to 2500 rows
6. Transform (multiplications of source values with some of the some of the values returned in lookups)
7. Sort on 4 keys (no re-partitioning)
8. Sort aggregator

On DB2/DataStage Combination it is currently taking between 23 and 40 minutes.

On DataStage I killed it after 1:15 as it was only 30 million through the aggregator and I had the box cpu and memory were pegged and others need to share it right now. I have turned off auto sort and partition. I am doing parallel table reads to get the data in as fast as I can (though with this many records its still takes awhile (~100k rows/sec) and am doing one partition with no sort, and then one sort without repartitioning.

My statistics. On an 8-node config, it took me 44 minutes just to read in all the data on datastage server. My lookups for the most part did not add anytime to the read in.

Should I really only be using DataStage for simply aggregations on a low amount of rows? That is how I am leaning now. What would be a good benchmark to compare my DataStage jobs against? What other things should one look at when the have extract and aggregate requirements (userdefined sequal to bring data in sorted, etc...)

Just curious on what your opinions are--- Mostly academic as I will beat my needed SLA using a combo of DB2 and Datastage approach --- though I'd like to crush it.
vmcburney
Participant
Posts: 3593
Joined: Thu Jan 23, 2003 5:25 pm
Location: Australia, Melbourne
Contact:

Post by vmcburney »

If you are talking data warehouses or any type of reporting database I think of DataStage as primarily being responsible for delivering transformed data at its lowest level of granularity. What you do with it next can be any number of tools.

If you are on Netezza the recent buzz suggests you don't need any aggregation tables at all, that is is more efficient aggregating on the fly. Various BI tools can do their own aggregation on extract. Business Object Universes for example, OLAP tools make do specific aggregations for reporting performance.

The ELT approach is also viable where ETL does the heavy lifting and ELT does the aggregation. These ELT jobs can be controlled from DataStage via dummy jobs with user-defined SQL or via stored procedure calls. This makes the ELT somewhat more transparent to overall job control and metadata reporting.

Informatica has combined ETL and ELT into a single tool and I expect IBM will follow within the next year or two to make scenarios such as yours easier to code.
Post Reply