Agregation

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
bart12872
Participant
Posts: 82
Joined: Fri Jan 19, 2007 5:38 pm

Agregation

Post by bart12872 »

Hi,

i have a problem with an agregation. I need to agregate 400 billions rows to 2 billions rows and figure out 20 indicators.
The performance are poor in Datastage because I must use the sort mode. (sort 400 billions rows, hum..!).
So, I decide to extract, transform my data and load the rows in DB2 and DB2 agregate. It's better but still too long.

additionnal info : each indicators are defined at row level, I can't pre agregate data in DB2 extraction. I need to extract all the rows figure out indicators and after agregate with sum functions.

Did someone faced this problematic ? Have you any idea to improve performance in this situation?

Does restructure stage can help me ? I mean, if I create vectors of indicators and I use combine records for exemple ?

Thanks,
Martin.
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Post by ArndW »

I would pre-sort the data before the aggregator stage in DataStage and tell it that the data is already sorted, the job will just fly through the data. If the same data is to be used more than once then sort it into a dataset and use that.

The more nodes you sort into and use at runtime, the more throughput you will get (this, of course, depends upon your hardware).

If the execution time is really important, look into a product such as SyncSort. I would be surprised if a database could do this relatively straightforward aggregation faster than a PX job.
Post Reply