Performance issue

yiminghu · Post by **yiminghu** » Mon Jan 05, 2004 3:08 pm

Hi Everybody,

I have a performance issue about my job. In our source system,we have a trasanctional table which has around half million records every day. Our daily ETL is requested to read this table, and do some transformation/aggregation, then generate results in another table. But this process seems takes forever to run.

I am wondering how does datastage handle tranformation internally. Does it read row by row, or chunk by chunk? Is there any way to speed up the process?

By the way, our database server and DS server are on two different servers, we also have to take the network overhead into consideration.

Thanks,

yiming

kcbland · Post by **kcbland** » Mon Jan 05, 2004 4:00 pm

If you have a job that reads a source table, does OCI/ODBC based references, and writes to a target table, then you have the simplest job design but also the slowest one possible.

See this:
viewtopic.php?t=85569

Here's some ETL truths:

The way you load data in the fastest form in a data warehouse is via a bulk loader. To use the fastest methods, you will have to differentiate inserts from updates.

The fastest way to extract data out of a source table is to use multiple jobs (using instantiation) that select different subsets from the source table and collate them on the backside (write to separate sequential files, then concatenate together.

The fastest way to transform data is to go from flat file to flat (or hash) file, and use local reference constructs (arrays, hash files, etc) that have no network or database latency. Use job instantiation to partition the source data into distinct sets, and instantiate to the degree of parallelism your server can handle.

ray.wurlod · Post by **ray.wurlod** » Mon Jan 05, 2004 6:23 pm

There are some simple test jobs you can create to determine the maximum capacity of your system.

Job 1 tests read performance from a sequential file. This is the fastest possible job. Anything else will run more slowly. The Transformer stage has an output contraint of @FALSE, so the job discards everything it reads.

Code: Select all

SeqFile  ---->  Transformer  ----->  /dev/null

Job2 tests how fast you can select from your database. The stage type is whatever you want to use (for example ODBC). Again, the job discards everything it reads.

Code: Select all

ODBC  ---->  Transformer  ----->  /dev/null

Job3 tests the speed of writing to a sequential file. You already know how fast you can read from a sequential file; any difference must be attributed to writing. The output link from the Transformer stage is unconstrained.

Code: Select all

SeqFile  ---->  Transformer  ----->  SeqFile

You can develop further tests. The more processing you add, the less throughput you'll get.

yiminghu · Post by **yiminghu** » Tue Jan 06, 2004 8:24 am

[quote="kcbland"]If you have a job that reads a source table, does OCI/ODBC based references, and writes to a target table, then you have the simplest job design but also the slowest one possible.

Hi Ken,

We do have staging area in our DW, but we use database table instead of sequential file. Do you mean using sequential file as media of staging table will expediate the process?

Thanks,

Yiming

kcbland · Post by **kcbland** » Tue Jan 06, 2004 9:04 am

Follow Ray's advice and build those jobs. It's soooo enlightening to see what is really going on.

If you write a job that selects a table and spools it to a sequential file, you see basically the spool rate from the database and the network overhead. Writing to a sequential file is negligible. Pulling data out of tables is dictated by the bandwidth capability of the connection and the load of the source database.

Loading data into tables is SLOW!!!! You have referential integrity, data typing, extent allocation, etc going on. This is all overhead.

OPINION:

Most users of DataStage blame performance issues on the tool, instead of the application of database technology into their job design.

ariear · Post by **ariear** » Tue Jan 06, 2004 3:35 pm

[quote="yiminghu] Do you mean using sequential file as media of staging table will expediate the process?

Thanks,

Yiming[/quote]

The nature of the staging area is transitive why use a persistent medium as staging area like an RDBMS with all the overhead that Ken described and what about its resources like permanent buffer cache that could be used for caching hash files ? Usually there's no need for a local RDBMS on the ETL tier (Only in cases of massive "between" lookups that the Universe is too slow for it and some other rare cases).
Everything RDBMS can do DataStage can do and faster ! (Talking etl of course) - I'm an Ascential partner and have beaten INFA 17 times in a row ! - like Ray said - Full table scan--->Seq files/Hash files---->Transformations--->Bulk Load .
They're still doing transformations in SQL !

ray.wurlod · Post by **ray.wurlod** » Tue Jan 06, 2004 5:57 pm

ariear wrote:Full table scan--->Seq files/Hash files---->Transformations--->Bulk Load .

Don't forget that partial table scan may be even better. Only SELECT the rows and columns that you need to process. Restricting the rows will be assisted by indexes (of course), restricting the columns is simply only specifying the columns with which you want to work.

The fastest way to process data is not to do any unnecessary processing!

kcbland · Post by **kcbland** » Tue Jan 06, 2004 8:43 pm

In the architecture models for data warehouses, there are 3-4 "database" instances that come into play:

1. Persistent staging database (PSD). This is the landing zone and induction site for source data. The data is not manipulated in any way, it is simply stored in a raw form so that the original unadulterated data is accessible. There is no normalization or integration, the point is that the data is not touched, just stored. Karolyn Duncan of TDWI suggests that this be normalized, but the problem is that you end up having to scrub and suspend and correct data here, and have no option to go back and change the rules and reload data - it's gone from the sources.
2. Operational Data Store (ODS) - Optional. This is a highly normalized and integrated model to provide a 3NF corporate data model.
3. Enterprise Data Warehouse (EDW). This is the typical data warehouse. It can be a Star or Atomic model, depending on your flavor.
4. Data Marts. These are typically denormalized and/or Star models.

In a hub and spoke architecture, you typically do high-speed bulk loading of raw data into the PSD. You do the heavy lifting, cleansing, surrogate assignment, and aggregation into the EDW sourcing the PSD. The marts are fed by simple SQL processes from the EDW, as the marts are deployments of dimensions and aggregates and further aggregates.

So, the "staging database" as I talk about is the "sandbox". During ETL between the PSD and the EDW, you will use sequential staging files BECAUSE ITS PART OF A LARGER PHILOSOPHY I HAVE ALREADY WAXED ELOQUENT ABOUT. The hash files become a working model of the target so that you can achieve the goals laid down by Kimball for proper loading techniques (READ KIMBALL).