Performance Testing

ewartpm · Post by **ewartpm** » Fri Nov 12, 2004 6:28 am

I want to create a DS project with jobs/templates that allow one to test performance. This like:
how fast does the database spool the data to DataStage?
how fast does DS read sequential files?
how fast does DS write to sequential files?
etc

All the above is really to help developers get a baseline wrt their environment. They can then design the jobs accordingly. (It may be a waste of time trying to optimise a job to the nth degree if the database is really slow and is not going to change).

I beleive this will be a real help to consultants who go to many sites. They can quickly see where problems exist.

Any input in this regard will be greated appreciated.

ray.wurlod · Post by **ray.wurlod** » Fri Nov 12, 2004 4:18 pm

These are easy to create.
To test speed of reading from a sequential file use this design.

Code: Select all

SeqFile  ----->  Transformer  ----->  SeqFile

Output link contraint is @FALSE so that no rows are written.
This is the fastest data flow you will ever get in DataStage. Every additional piece of functionality will slow it down.

To test speed of writing use the same design but remove the output link constraint. You already know the cost of reading.

To test speed of database extraction replace the first sequential file stage with a database stage, reinstate the output link constraint, and ensure that you select rows of the same size as in the sequential file case.

And so on.

What Is Performance?

Performance in the ETL world is the ability to process a given volume of data within a given time window.
Rows per second is a particularly useless rate, particularly with small volumes of data, for a number of reasons. These include:

it's a simple total rows / total time calculation

total time includes all wait time (e.g. for before/after subroutines to complete)

time is measured in whole seconds sampled coarsely (for example, you can process 1000 rows in 0 seconds, at 0 rows per second - which actually means that it finished in under 0.5 seconds)

you need to guarantee that rows are identically constructed so that you're comparing "apples with apples" - almost impossible to do in practice other than by using test data that you carry around with you- you can process far more one-byte rows per second than you can thousand-byte rows

you must cater for cache effects, usually by doing something else that will invalidate cache between each test run, or by setting up the cache in exactly the same way before each test run

Make sure that you measure using a sufficient volume of data to yield statistically significant results. I'd advocate at least 100MB, so minimize the effect of opening and closing connections and other wait times.

Measure performance in terms of MB/minute (or GB/hour) rather than rows/second.