How accurate the rows /sec counts are in Datastage?

A forum for discussing DataStage<sup>®</sup> basics. If you're not sure where your question goes, start here.

Moderators: chulett, rschirm, roy

Post Reply
SURA
Premium Member
Premium Member
Posts: 1229
Joined: Sat Jul 14, 2007 5:16 am
Location: Sydney

How accurate the rows /sec counts are in Datastage?

Post by SURA »

Hi there

I am loading a massive volume of records into a table. Source is a dataset and the target is SQL Server 2008, using ODBC Connector, using UPSERT (Insert then Update).

When I did the test in the sample table I was managed to get the load up to 23K rows / sec whereas in the real table starts with 10K, then down to 2200 rows / sec when the volume crossed 100 million!! I trust it could be due to the partitions, index.

But my concern is in Designer screen I saw a number (rows / sec) vs in the Director Monitor showing different number.

If the Rows / sec is 2200 in such case I am expecting the row counts should change in the screen which is not the case for a while?

How accurate are the rows /sec counts are in DataStage?

How can I find the real load counts?

Any idea?
Thanks
Ram
----------------------------------
Revealing your ignorance is fine, because you get a chance to learn.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Sampling is at 10 second intervals, unless you've changed this via Options. This is pretty coarse sampling, particularly for small volumes. Further, rows/sec can include wait time for the generating (extraction) query to start delivering rows, which can be substantial (for example if the query includes an ORDER BY clause). Various other overheads are included in the denominator of this calculation too. I tend not to regard rows/sec as a particularly useful metric except when comparing apples with apples (usually different runs of the same job with the same data).
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
asorrell
Posts: 1707
Joined: Fri Apr 04, 2003 2:00 pm
Location: Colleyville, Texas

Post by asorrell »

The RPS can be misleading, especially on long-running front-end queries, since it starts timing from the second the query is sent. But overall, taking that into account it is moderately accurate.

DataStage keeps reducing the "report" rate as the number of rows grows, which is why you see it start reporting in 100k's or millions instead of exact row counts. However, the end number reported on a completed job (on the designer as well as the director log) is accurate.

There are numerous reasons a job would run slower with millions of rows instead of a small test set. You can usually use the RPS to isolate where the slowdown is occurring (note the drop in RPS after a stage) to get an idea where to start.
Andy Sorrell
Certified DataStage Consultant
IBM Analytics Champion 2009 - 2020
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

It's important to note (in my mind) that it is not the RPS at any particular point in time but rather is the average RPS at that time. And as noted, since it includes wait time, that's why you can see during periods with 'no activity' the only thing changing is the RPS. Going down.
-craig

"You can never have too many knives" -- Logan Nine Fingers
kvshrao
Participant
Posts: 6
Joined: Fri Aug 03, 2012 12:25 am
Location: Chennai
Contact:

Post by kvshrao »

Rows per second is moderately accurate - it depends on various factors such as Source and Target/ Transformations


Thanks
Rao Kuntamukkala
www.networkzindia.com
Thanks
Ravi
Post Reply