Aggregate Performance vs. DB2 vs. other options
Posted: Tue Oct 10, 2006 7:38 am
uHey All,
I am just curious your general perceptions on Aggregate performance in DataStage.
If I had comparable cpu/memory on a DB2 Database and on a DataStage Server --- which should win(keep in mind you have to get the data off of DB2 to start with anyways?
I currently am aggregating 232,164,602 into
I am using the following process in my DataStage attempt(though I will prob break this into smaller steps). I have turned autosort and partition off in the datastage job.
1. Table read in the 232mill records
2. hash partition immediatly on 1 key with no sort
3. 2 lookups to 100k rows each
4. 2 lookups to 25 rows each
5. 2 lookups to 2500 rows
6. Transform (multiplications of source values with some of the some of the values returned in lookups)
7. Sort on 4 keys (no re-partitioning)
8. Sort aggregator
On DB2/DataStage Combination it is currently taking between 23 and 40 minutes.
On DataStage I killed it after 1:15 as it was only 30 million through the aggregator and I had the box cpu and memory were pegged and others need to share it right now. I have turned off auto sort and partition. I am doing parallel table reads to get the data in as fast as I can (though with this many records its still takes awhile (~100k rows/sec) and am doing one partition with no sort, and then one sort without repartitioning.
My statistics. On an 8-node config, it took me 44 minutes just to read in all the data on datastage server. My lookups for the most part did not add anytime to the read in.
Should I really only be using DataStage for simply aggregations on a low amount of rows? That is how I am leaning now. What would be a good benchmark to compare my DataStage jobs against? What other things should one look at when the have extract and aggregate requirements (userdefined sequal to bring data in sorted, etc...)
Just curious on what your opinions are--- Mostly academic as I will beat my needed SLA using a combo of DB2 and Datastage approach --- though I'd like to crush it.
I am just curious your general perceptions on Aggregate performance in DataStage.
If I had comparable cpu/memory on a DB2 Database and on a DataStage Server --- which should win(keep in mind you have to get the data off of DB2 to start with anyways?
I currently am aggregating 232,164,602 into
I am using the following process in my DataStage attempt(though I will prob break this into smaller steps). I have turned autosort and partition off in the datastage job.
1. Table read in the 232mill records
2. hash partition immediatly on 1 key with no sort
3. 2 lookups to 100k rows each
4. 2 lookups to 25 rows each
5. 2 lookups to 2500 rows
6. Transform (multiplications of source values with some of the some of the values returned in lookups)
7. Sort on 4 keys (no re-partitioning)
8. Sort aggregator
On DB2/DataStage Combination it is currently taking between 23 and 40 minutes.
On DataStage I killed it after 1:15 as it was only 30 million through the aggregator and I had the box cpu and memory were pegged and others need to share it right now. I have turned off auto sort and partition. I am doing parallel table reads to get the data in as fast as I can (though with this many records its still takes awhile (~100k rows/sec) and am doing one partition with no sort, and then one sort without repartitioning.
My statistics. On an 8-node config, it took me 44 minutes just to read in all the data on datastage server. My lookups for the most part did not add anytime to the read in.
Should I really only be using DataStage for simply aggregations on a low amount of rows? That is how I am leaning now. What would be a good benchmark to compare my DataStage jobs against? What other things should one look at when the have extract and aggregate requirements (userdefined sequal to bring data in sorted, etc...)
Just curious on what your opinions are--- Mostly academic as I will beat my needed SLA using a combo of DB2 and Datastage approach --- though I'd like to crush it.