Performance Issue in Datastage Server Edition

avi21st · Post by **avi21st** » Thu Apr 20, 2006 4:03 pm

Hi

We have a problem in our hands out here.

We would be loading certain tables in Oracle Database from files coming from mainframes. Our job structure looks like:

Job 1:

Code: Select all

Infile -----Datastage-----> Staging Tables 
                  !
                  !
                  V
                Error Table

Job 2:

Code: Select all

Staging table -----Datastage-----> Tmp Tables 
                     !
                     !
                     V
                Error Table

Job 3:

Code: Select all

Write a Oracle merge and load the final table-wrapped in Unix shell

Note:We are running the jobs by Datastage Sequence

Now what error conditions we are tapping are like:
Invalid dates/Null in not null fields/Invalid Numbers/Non-numeric character in numeric fields

So the design of my Datastage job is:

Code: Select all

Infile -----Transformer-----> Staging Tables 
                  !
                  !
                  V
                Error Table

Now we are doing all the field validation and transformation in the Transformer- we are applying our own Routines/Pre-Defined Transforms- loading all in input records into staging variables-depending on the field assining error code. If Error code is Null we load the main table or else we load the Error table.

SO OUR PROBLEM:

1> We have tables with columns ranging from 20 to 200. Loading all of them in staging- and then sequentiually checking each row using the routines and transform would be very slow.

2> We would have load ranging from 5 to 30 million. Can Datastage server handle such a volume?

3> It would be nice if we can drop the invalid records before loading the staging table. Then work on those data with the routines to check and transform them to our required values (like Null to value / Invalid dates replaced by sysdate). Because we have seen if we are loading 20 million may be 10,000 records would be invalid.

Please suggest how to tackle this problem- so that we can do all the field validations in the Datastage itself but not work on so many records and loose valuable time in the Staging variables by using transformer

Is there any other way? It would be great if someone can help

ray.wurlod · Post by **ray.wurlod** » Thu Apr 20, 2006 4:52 pm

Try using staging files rather than staging tables. You'll be amazed how much difference it makes.

avi21st · Post by **avi21st** » Thu Apr 20, 2006 5:50 pm

ray.wurlod wrote:Try using staging files rather than staging tables. You'll be amazed how much difference it makes.

Hi Ray

could you please explain the process.

Do you mean working in the file system. Manipulating them in file system.

I am basically need the idea so that I can project the proof of concept on this.

Please help

chulett · Post by **chulett** » Thu Apr 20, 2006 5:57 pm

Not much to explain. Simply replace your staging tables with sequential files, swap your OCI stages for Sequential File stages. Landing the data to flat files instead of loading them up into staging tables cuts out all the overhead of the database, which it seems like you are not needing.

Now, if the 'Staging' data is joined to other database tables as part of its processing on the way to the 'Tmp' tables, that's another ballgame. However, if all you are doing is putting them there just to turn around and select it right back out again, what's the point?

avi21st · Post by **avi21st** » Thu Apr 20, 2006 6:11 pm

chulett wrote:Not much to explain. Simply replace your staging tables with sequential files, swap your OCI stages for Sequential File stages. Landing the data to flat files instead of loading them up into staging tables cuts out all the overhead of the database, which it seems like you are not needing.

Now, if the 'Staging' data is joined to other database tables as part of its processing on the way to the 'Tmp' tables, that's another ballgame. However, if all you are doing is putting them there just to turn around and select it right back out again, what's the point?

But we have to go through those transformation in the stage variables in the transformer.- Say if we have 60 columns and 40 million row- Can Datastage server edition jobs handle such volume?

We are basically putting them to Staging for QA audits and also to join with other dimension tables and get the foriegn keys- Another thing is - I put the Staging tables just to get rid of Hashed Files. As there is always a problem of scalability with Hash files and also the tables on which we are doing a lookup are monstoor dimensions- So we can't do without it.

My main question is "What volume of data and processing logic can server jobs handle?"
And in my case how to enhance the performance?

Thanks for the help in advance.

ray.wurlod · Post by **ray.wurlod** » Thu Apr 20, 2006 6:46 pm

Appending a line to a text file is WAY faster than upserting a row into a database table, and WAY WAY faster if that table has constraints to be checked and/or indices to be maintained. Hence my suggestion to use files for staging. (You can also use these as data files for bulk loading.)

avi21st · Post by **avi21st** » Thu Apr 20, 2006 7:07 pm

ray.wurlod wrote:Appending a line to a text file is WAY faster than upserting a row into a database table, and WAY WAY faster if that table has constraints to be checked and/or indices to be maintained. Hence my suggestion to use files for staging. (You can also use these as data files for bulk loading.)

But Ray I am basically using the staging table to join with other dimension - means I have Account_Num in my staging and I join with Account table to get the respective Acccount Id --------then I load the final table.

SO I cannot do without Staging tables.

Please suggest some alternatives if possible
Thanks in advance

ray.wurlod · Post by **ray.wurlod** » Thu Apr 20, 2006 9:13 pm

Rubbish. You can perform a lookup using the text file as stream and the table (or a hashed file containing appropriate columns from it) to feed the reference input. You don't have to join.

avi21st · Post by **avi21st** » Thu Apr 20, 2006 9:27 pm

ray.wurlod wrote:Rubbish. You can perform a lookup using the text file as stream and the table (or a hashed file containing appropriate columns from it) to feed the reference input. You don't have to join.

Ok..... I felt dimension like say Account would be very big for loookup through Hashed File.

Again those tables also have incremental loads so the data volume always increases.

So decided on staging tables.

We would try both as a POC if we get prod data soon

Thanks

ray.wurlod · Post by **ray.wurlod** » Thu Apr 20, 2006 9:29 pm

The hashed file needs only two columns (Account_Num and Account_ID), so it's not going to be that large even with a few million rows.

chulett · Post by **chulett** » Thu Apr 20, 2006 9:30 pm

avi21st wrote:As there is always a problem of scalability with Hash files and also the tables on which we are doing a lookup are monstoor dimensions- So we can't do without it.

As a DataStage Consultant, you need to have more faith in hashed files. Ken Bland has a great writeup on them in one of the recent Newsletters, well worth the read.

avi21st · Post by **avi21st** » Thu Apr 20, 2006 9:57 pm

ray.wurlod wrote:The hashed file needs only two columns (Account_Num and Account_ID), so it's not going to be that large even with a few million rows.

Yes- I have faith in Hashed file for one key column for lookup-like (Account_Num and Account_ID) .

But most of them have three or four lookups to the dimensions before loading the target table.

So what we use three hashed file lookup in a job or combine into a single hashed file and have a lookup on different key.

Please suggest- I think Ray you remember my previous post- this project is using Datastage sequencers to schedule the jobs.

So you understand it is tough for anyone to suggest the technical architect. The have done two projects like that.

But still I am trying to prove the concept to them.

Thanks for your support.

ray.wurlod · Post by **ray.wurlod** » Thu Apr 20, 2006 11:30 pm

Three hashed files.

The design canvas looks cleaner, and makes it more obvious what's happening. And you don't have to stuff around with artificial key columns and computing same.

rleishman · Post by **rleishman** » Mon Apr 24, 2006 2:29 am

At the risk of being lynched, I'll throw in my 2c:

First, I agree entirely with Ray's suggested approach. It is proven and scalable - it most certainly CAN be done with Hashed files.

However, the OP's approach is also valid and scalable. I know because that is the way I built my current DW. But it's not as straightforward to get it right.

You can use a hybrid approach with Externally Organised Tables: all the speed of a flat file with the convenience of a table (no extra calories and not an ounce of fat

).

If you are going to use table joins on Oracle instead of Hashed Files for dimension lookups, then you must be careful of BIG dimensions (say, >50,000 keys). When you join a big Staging table to many small dimensions, Oracle can hash the dimensions internally and perform the joins in much the same way that DS would.

BUT big dimensions are a problem. Oracle cannot hash them into memory, so it must partition both the Staging table and the big dimension into temp space (big disk IO).

The solution is to store your big dimension tables in Oracle Hash Clusters. You may have to supply hints in the SQL to force them to be used (/*+ HASH(tab) USE_NL(tab) */). The hash cluster essentially pre-hashes the dimension table ready for the join. You can join it to a staging table of ANY size and it will start returning rows immediately without the performance overhead of using indexes.

This must only be used for dimensions. Never cluster a Fact table. Never join a staging table to a Fact table (if you can possibly avoid it).

avi21st wrote:Write a Oracle merge and load the final table-wrapped in Unix shell

Be careful of MERGE:
- There are bugs associated with MERGE /*+APPEND*/ into base tables of fast-refresh MVs
- MERGE /*+ APPEND*/ does not perform direct path load if you have Foreign Keys enabled.
- MERGE /*+ APPEND */ is not scalable, as index changes are written to the rollback segments. You have to make the indexes unusable first, but that means they must be rebuilt entirely after the load.

I prefer to transform to a flat file, use Externally Organised Tables to DELETE the rows that already exist from the target, then use SQL*Loader to load the entire file in Direct Path mode. It bypasses all of these problems.

avi21st wrote:Now we are doing all the field validation and transformation in the Transformer- we are applying our own Routines/Pre-Defined Transforms

My only other advice is to ditch the validation routines in favour of transforms (and possibly many stage variables to perform long-winded logic). I had all my validation done in routines - changed them all to transforms and stage variables: performance went from 100/s to 500/s (this is on very crap hardware).

Using all these techniques on our Prod boxes - nasty littly 2-proc Linux boxes (ie. souped-up PCs) - we now get 5000+/sec.