Page 1 of 1

How accurate the rows /sec counts are in Datastage?

Posted: Tue May 06, 2014 5:39 pm
by SURA
Hi there

I am loading a massive volume of records into a table. Source is a dataset and the target is SQL Server 2008, using ODBC Connector, using UPSERT (Insert then Update).

When I did the test in the sample table I was managed to get the load up to 23K rows / sec whereas in the real table starts with 10K, then down to 2200 rows / sec when the volume crossed 100 million!! I trust it could be due to the partitions, index.

But my concern is in Designer screen I saw a number (rows / sec) vs in the Director Monitor showing different number.

If the Rows / sec is 2200 in such case I am expecting the row counts should change in the screen which is not the case for a while?

How accurate are the rows /sec counts are in DataStage?

How can I find the real load counts?

Any idea?

Posted: Tue May 06, 2014 8:02 pm
by ray.wurlod
Sampling is at 10 second intervals, unless you've changed this via Options. This is pretty coarse sampling, particularly for small volumes. Further, rows/sec can include wait time for the generating (extraction) query to start delivering rows, which can be substantial (for example if the query includes an ORDER BY clause). Various other overheads are included in the denominator of this calculation too. I tend not to regard rows/sec as a particularly useful metric except when comparing apples with apples (usually different runs of the same job with the same data).

Posted: Wed May 07, 2014 11:40 am
by asorrell
The RPS can be misleading, especially on long-running front-end queries, since it starts timing from the second the query is sent. But overall, taking that into account it is moderately accurate.

DataStage keeps reducing the "report" rate as the number of rows grows, which is why you see it start reporting in 100k's or millions instead of exact row counts. However, the end number reported on a completed job (on the designer as well as the director log) is accurate.

There are numerous reasons a job would run slower with millions of rows instead of a small test set. You can usually use the RPS to isolate where the slowdown is occurring (note the drop in RPS after a stage) to get an idea where to start.

Posted: Wed May 07, 2014 11:54 am
by chulett
It's important to note (in my mind) that it is not the RPS at any particular point in time but rather is the average RPS at that time. And as noted, since it includes wait time, that's why you can see during periods with 'no activity' the only thing changing is the RPS. Going down.

Posted: Wed May 07, 2014 1:44 pm
by kvshrao
Rows per second is moderately accurate - it depends on various factors such as Source and Target/ Transformations


Thanks
Rao Kuntamukkala
www.networkzindia.com