Dataset Read Performance

nishadkapadia · Post by **nishadkapadia** » Thu Sep 07, 2006 7:01 am

Hi,

We have a simple job design.

(30 Million ) Dataset -----> Small Lkp(393) -----> Small Lkp (392) --> Datasets.

The dataset is not sorted. More than half of my cpu's are idle for more than 55%(using 'topas'). However, the throughput of reading this dataset is no more than 8000 rows/sec. Am i missing something here, tunables etc.

There is ample space ( in GB ) 47% free while execution both in datasets and scratch disk space.

Thanks for your continued help

ray.wurlod · Post by **ray.wurlod** » Thu Sep 07, 2006 7:06 am

Are you using the same configuration as the one used to create the source Data Set? Set APT_DUMP_SCORE to True and examine the score that is logged, to verify what degree of parallelism is being used.

Rows/sec is not a reliable metric, not least because row sizes vary. Prefer MB/minute.

nishadkapadia · Post by **nishadkapadia** » Thu Sep 07, 2006 7:52 am

Will be careful in analysing metrics. Thanks.
The Revised design is :

Code: Select all

Dataset(30 million) --> Lkp1 (363)--> Lkp2(392)-->Lkp3(991343)--> Lkp4(35116) --> Datasets

I understand that in Lookup3 , could have been a join condition, however with the input stream being a bit more, getting them sorted and partitioned would have an impact.

The metric is around 2 MB / sec ( 8000 rows).

ray.wurlod · Post by **ray.wurlod** » Thu Sep 07, 2006 3:39 pm

The Lookup stage is a composite operator; it creates two operators - one to load the reference (virtual) Data Set, the other to perform the actual lookup operation. You might benefit by increasing the buffer sizes on the inputs to Lkup3 and possibly Lkup4.

nishadkapadia · Post by **nishadkapadia** » Wed Sep 13, 2006 5:32 am

I currently do not have access to Datastage, however would inform the forum on trying it out.

Thanks again for your continued help.