Dataset Read Performance

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
nishadkapadia
Charter Member
Charter Member
Posts: 47
Joined: Fri Mar 18, 2005 5:59 am

Dataset Read Performance

Post by nishadkapadia »

Hi,

We have a simple job design.

Code: Select all

(30 Million ) Dataset -----> Small Lkp(393) -----> Small Lkp (392) --> Datasets.
The dataset is not sorted. More than half of my cpu's are idle for more than 55%(using 'topas'). However, the throughput of reading this dataset is no more than 8000 rows/sec. Am i missing something here, tunables etc.

There is ample space ( in GB ) 47% free while execution both in datasets and scratch disk space.

Thanks for your continued help
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Are you using the same configuration as the one used to create the source Data Set? Set APT_DUMP_SCORE to True and examine the score that is logged, to verify what degree of parallelism is being used.

Rows/sec is not a reliable metric, not least because row sizes vary. Prefer MB/minute.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
nishadkapadia
Charter Member
Charter Member
Posts: 47
Joined: Fri Mar 18, 2005 5:59 am

Post by nishadkapadia »

Will be careful in analysing metrics. Thanks.
The Revised design is :

Code: Select all

Dataset(30 million) --> Lkp1 (363)--> Lkp2(392)-->Lkp3(991343)--> Lkp4(35116) --> Datasets
I understand that in Lookup3 , could have been a join condition, however with the input stream being a bit more, getting them sorted and partitioned would have an impact.

The metric is around 2 MB / sec ( 8000 rows).
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

The Lookup stage is a composite operator; it creates two operators - one to load the reference (virtual) Data Set, the other to perform the actual lookup operation. You might benefit by increasing the buffer sizes on the inputs to Lkup3 and possibly Lkup4.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
nishadkapadia
Charter Member
Charter Member
Posts: 47
Joined: Fri Mar 18, 2005 5:59 am

Post by nishadkapadia »

I currently do not have access to Datastage, however would inform the forum on trying it out.

Thanks again for your continued help.
Post Reply