Performance Testing

Post questions here relative to DataStage Server Edition for such areas as Server job design, DS Basic, Routines, Job Sequences, etc.

Moderators: chulett, rschirm, roy

Post Reply
ewartpm
Participant
Posts: 97
Joined: Wed Jun 25, 2003 2:15 am
Location: South Africa
Contact:

Performance Testing

Post by ewartpm »

:idea: I want to create a DS project with jobs/templates that allow one to test performance. This like:
how fast does the database spool the data to DataStage?
how fast does DS read sequential files?
how fast does DS write to sequential files?
etc

All the above is really to help developers get a baseline wrt their environment. They can then design the jobs accordingly. (It may be a waste of time trying to optimise a job to the nth degree if the database is really slow and is not going to change).

I beleive this will be a real help to consultants who go to many sites. They can quickly see where problems exist.

Any input in this regard will be greated appreciated.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

These are easy to create.
To test speed of reading from a sequential file use this design.

Code: Select all

SeqFile  ----->  Transformer  ----->  SeqFile
Output link contraint is @FALSE so that no rows are written.
This is the fastest data flow you will ever get in DataStage. Every additional piece of functionality will slow it down.

To test speed of writing use the same design but remove the output link constraint. You already know the cost of reading.

To test speed of database extraction replace the first sequential file stage with a database stage, reinstate the output link constraint, and ensure that you select rows of the same size as in the sequential file case.

And so on.


What Is Performance?

Performance in the ETL world is the ability to process a given volume of data within a given time window.
Rows per second is a particularly useless rate, particularly with small volumes of data, for a number of reasons. These include:
  • it's a simple total rows / total time calculation

    total time includes all wait time (e.g. for before/after subroutines to complete)

    time is measured in whole seconds sampled coarsely (for example, you can process 1000 rows in 0 seconds, at 0 rows per second - which actually means that it finished in under 0.5 seconds)

    you need to guarantee that rows are identically constructed so that you're comparing "apples with apples" - almost impossible to do in practice other than by using test data that you carry around with you- you can process far more one-byte rows per second than you can thousand-byte rows

    you must cater for cache effects, usually by doing something else that will invalidate cache between each test run, or by setting up the cache in exactly the same way before each test run
Make sure that you measure using a sufficient volume of data to yield statistically significant results. I'd advocate at least 100MB, so minimize the effect of opening and closing connections and other wait times.

Measure performance in terms of MB/minute (or GB/hour) rather than rows/second.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Post Reply