How accurate the rows /sec counts are in Datastage?

SURA · Post by **SURA** » Tue May 06, 2014 5:39 pm

Hi there

I am loading a massive volume of records into a table. Source is a dataset and the target is SQL Server 2008, using ODBC Connector, using UPSERT (Insert then Update).

When I did the test in the sample table I was managed to get the load up to 23K rows / sec whereas in the real table starts with 10K, then down to 2200 rows / sec when the volume crossed 100 million!! I trust it could be due to the partitions, index.

But my concern is in Designer screen I saw a number (rows / sec) vs in the Director Monitor showing different number.

If the Rows / sec is 2200 in such case I am expecting the row counts should change in the screen which is not the case for a while?

How accurate are the rows /sec counts are in DataStage?

How can I find the real load counts?

Any idea?

ray.wurlod · Post by **ray.wurlod** » Tue May 06, 2014 8:02 pm

Sampling is at 10 second intervals, unless you've changed this via Options. This is pretty coarse sampling, particularly for small volumes. Further, rows/sec can include wait time for the generating (extraction) query to start delivering rows, which can be substantial (for example if the query includes an ORDER BY clause). Various other overheads are included in the denominator of this calculation too. I tend not to regard rows/sec as a particularly useful metric except when comparing apples with apples (usually different runs of the same job with the same data).

IBM Analytics Champion 2009 - 2020 · Post by **asorrell** » Wed May 07, 2014 11:40 am

The RPS can be misleading, especially on long-running front-end queries, since it starts timing from the second the query is sent. But overall, taking that into account it is moderately accurate.

DataStage keeps reducing the "report" rate as the number of rows grows, which is why you see it start reporting in 100k's or millions instead of exact row counts. However, the end number reported on a completed job (on the designer as well as the director log) is accurate.

There are numerous reasons a job would run slower with millions of rows instead of a small test set. You can usually use the RPS to isolate where the slowdown is occurring (note the drop in RPS after a stage) to get an idea where to start.

chulett · Post by **chulett** » Wed May 07, 2014 11:54 am

It's important to note (in my mind) that it is not the RPS at any particular point in time but rather is the average RPS at that time. And as noted, since it includes wait time, that's why you can see during periods with 'no activity' the only thing changing is the RPS. Going down.

kvshrao · Post by **kvshrao** » Wed May 07, 2014 1:44 pm

Rows per second is moderately accurate - it depends on various factors such as Source and Target/ Transformations

Thanks
Rao Kuntamukkala
www.networkzindia.com