Performance issue

Post questions here relative to DataStage Server Edition for such areas as Server job design, DS Basic, Routines, Job Sequences, etc.

Moderators: chulett, rschirm, roy

Post Reply
yiminghu
Participant
Posts: 91
Joined: Fri Nov 14, 2003 10:09 am

Performance issue

Post by yiminghu »

Hi Everybody,

I have a performance issue about my job. In our source system,we have a trasanctional table which has around half million records every day. Our daily ETL is requested to read this table, and do some transformation/aggregation, then generate results in another table. But this process seems takes forever to run.

I am wondering how does datastage handle tranformation internally. Does it read row by row, or chunk by chunk? Is there any way to speed up the process?

By the way, our database server and DS server are on two different servers, we also have to take the network overhead into consideration.

Thanks,

yiming
kcbland
Participant
Posts: 5208
Joined: Wed Jan 15, 2003 8:56 am
Location: Lutz, FL
Contact:

Post by kcbland »

If you have a job that reads a source table, does OCI/ODBC based references, and writes to a target table, then you have the simplest job design but also the slowest one possible.

See this:
viewtopic.php?t=85569

Here's some ETL truths:

The way you load data in the fastest form in a data warehouse is via a bulk loader. To use the fastest methods, you will have to differentiate inserts from updates.

The fastest way to extract data out of a source table is to use multiple jobs (using instantiation) that select different subsets from the source table and collate them on the backside (write to separate sequential files, then concatenate together.

The fastest way to transform data is to go from flat file to flat (or hash) file, and use local reference constructs (arrays, hash files, etc) that have no network or database latency. Use job instantiation to partition the source data into distinct sets, and instantiate to the degree of parallelism your server can handle.
Kenneth Bland

Rank: Sempai
Belt: First degree black
Fight name: Captain Hook
Signature knockout: right upper cut followed by left hook
Signature submission: Crucifix combined with leg triangle
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

There are some simple test jobs you can create to determine the maximum capacity of your system.

Job 1 tests read performance from a sequential file. This is the fastest possible job. Anything else will run more slowly. The Transformer stage has an output contraint of @FALSE, so the job discards everything it reads.

Code: Select all

SeqFile  ---->  Transformer  ----->  /dev/null
Job2 tests how fast you can select from your database. The stage type is whatever you want to use (for example ODBC). Again, the job discards everything it reads.

Code: Select all

ODBC  ---->  Transformer  ----->  /dev/null
Job3 tests the speed of writing to a sequential file. You already know how fast you can read from a sequential file; any difference must be attributed to writing. The output link from the Transformer stage is unconstrained.

Code: Select all

SeqFile  ---->  Transformer  ----->  SeqFile
You can develop further tests. The more processing you add, the less throughput you'll get.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
yiminghu
Participant
Posts: 91
Joined: Fri Nov 14, 2003 10:09 am

Post by yiminghu »

[quote="kcbland"]If you have a job that reads a source table, does OCI/ODBC based references, and writes to a target table, then you have the simplest job design but also the slowest one possible.

Hi Ken,

We do have staging area in our DW, but we use database table instead of sequential file. Do you mean using sequential file as media of staging table will expediate the process?

Thanks,

Yiming
kcbland
Participant
Posts: 5208
Joined: Wed Jan 15, 2003 8:56 am
Location: Lutz, FL
Contact:

Post by kcbland »

Follow Ray's advice and build those jobs. It's soooo enlightening to see what is really going on.

If you write a job that selects a table and spools it to a sequential file, you see basically the spool rate from the database and the network overhead. Writing to a sequential file is negligible. Pulling data out of tables is dictated by the bandwidth capability of the connection and the load of the source database.

Loading data into tables is SLOW!!!! You have referential integrity, data typing, extent allocation, etc going on. This is all overhead.

OPINION:

Most users of DataStage blame performance issues on the tool, instead of the application of database technology into their job design.
Kenneth Bland

Rank: Sempai
Belt: First degree black
Fight name: Captain Hook
Signature knockout: right upper cut followed by left hook
Signature submission: Crucifix combined with leg triangle
ariear
Participant
Posts: 237
Joined: Thu Dec 26, 2002 2:19 pm

Post by ariear »

[quote="yiminghu] Do you mean using sequential file as media of staging table will expediate the process?

Thanks,

Yiming[/quote]

The nature of the staging area is transitive why use a persistent medium as staging area like an RDBMS with all the overhead that Ken described and what about its resources like permanent buffer cache that could be used for caching hash files ? Usually there's no need for a local RDBMS on the ETL tier (Only in cases of massive "between" lookups that the Universe is too slow for it and some other rare cases).
Everything RDBMS can do DataStage can do and faster ! (Talking etl of course) - I'm an Ascential partner and have beaten INFA 17 times in a row ! - like Ray said - Full table scan--->Seq files/Hash files---->Transformations--->Bulk Load .
They're still doing transformations in SQL !
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

ariear wrote:Full table scan--->Seq files/Hash files---->Transformations--->Bulk Load .
Don't forget that partial table scan may be even better. Only SELECT the rows and columns that you need to process. Restricting the rows will be assisted by indexes (of course), restricting the columns is simply only specifying the columns with which you want to work.

:idea: The fastest way to process data is not to do any unnecessary processing!
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
kcbland
Participant
Posts: 5208
Joined: Wed Jan 15, 2003 8:56 am
Location: Lutz, FL
Contact:

Post by kcbland »

In the architecture models for data warehouses, there are 3-4 "database" instances that come into play:

1. Persistent staging database (PSD). This is the landing zone and induction site for source data. The data is not manipulated in any way, it is simply stored in a raw form so that the original unadulterated data is accessible. There is no normalization or integration, the point is that the data is not touched, just stored. Karolyn Duncan of TDWI suggests that this be normalized, but the problem is that you end up having to scrub and suspend and correct data here, and have no option to go back and change the rules and reload data - it's gone from the sources.
2. Operational Data Store (ODS) - Optional. This is a highly normalized and integrated model to provide a 3NF corporate data model.
3. Enterprise Data Warehouse (EDW). This is the typical data warehouse. It can be a Star or Atomic model, depending on your flavor.
4. Data Marts. These are typically denormalized and/or Star models.

In a hub and spoke architecture, you typically do high-speed bulk loading of raw data into the PSD. You do the heavy lifting, cleansing, surrogate assignment, and aggregation into the EDW sourcing the PSD. The marts are fed by simple SQL processes from the EDW, as the marts are deployments of dimensions and aggregates and further aggregates.

So, the "staging database" as I talk about is the "sandbox". During ETL between the PSD and the EDW, you will use sequential staging files BECAUSE ITS PART OF A LARGER PHILOSOPHY I HAVE ALREADY WAXED ELOQUENT ABOUT. The hash files become a working model of the target so that you can achieve the goals laid down by Kimball for proper loading techniques (READ KIMBALL).
Kenneth Bland

Rank: Sempai
Belt: First degree black
Fight name: Captain Hook
Signature knockout: right upper cut followed by left hook
Signature submission: Crucifix combined with leg triangle
Post Reply