These are easy to create.
To test speed of reading from a sequential file use this design.
Code: Select all
SeqFile -----> Transformer -----> SeqFile
Output link contraint is @FALSE so that no rows are written.
This is the fastest data flow you will ever get in DataStage. Every additional piece of functionality will slow it down.
To test speed of writing use the same design but remove the output link constraint. You already know the cost of reading.
To test speed of database extraction replace the first sequential file stage with a database stage, reinstate the output link constraint, and ensure that you select rows of the same size as in the sequential file case.
And so on.
What Is Performance?
Performance in the ETL world is the ability to process a given volume of data within a given time window.
Rows per second is a particularly useless rate, particularly with small volumes of data, for a number of reasons. These include:
- it's a simple total rows / total time calculation
total time includes all wait time (e.g. for before/after subroutines to complete)
time is measured in whole seconds sampled coarsely (for example, you can process 1000 rows in 0 seconds, at 0 rows per second - which actually means that it finished in under 0.5 seconds)
you need to guarantee that rows are identically constructed so that you're comparing "apples with apples" - almost impossible to do in practice other than by using test data that you carry around with you- you can process far more one-byte rows per second than you can thousand-byte rows
you must cater for cache effects, usually by doing something else that will invalidate cache between each test run, or by setting up the cache in exactly the same way before each test run
Make sure that you measure using a sufficient volume of data to yield statistically significant results. I'd advocate at least 100MB, so minimize the effect of opening and closing connections and other wait times.
Measure performance in terms of MB/minute (or GB/hour) rather than rows/second.