Performance Improvement Needed

nagarjuna · Post by **nagarjuna** » Sun Jun 13, 2010 8:36 pm

Hi Everyone ,

I have an issue with loading of one of the table . Here are the details

a) Incoming file needs to be loaded into staging

b) From staging into Warehouse

c) Daily load : 20 million records ( delta )

d) Needs to do type1 updates .

c)Unloading the data from warehouse based on the records present in staging .This unload is needed to identify if a record is update or insert .I am separating inserts and updates .

d) Unload query :

select col1 , col2 , col3 , col4 from warehouse table where
( key1 , key2 , key3 ) in ( select col1 , col2 , col3 from staging table )

e) Have unique index on key1 ,key2 , key3

d) After unloading , the unload dataset will be joined with staging table data and insert file , update file will be produced .

e) Requirement is such a way that we CAN ignore updates .

The Unload part of the above process is taking lot of time ( 45-60 min) . I have tried sparse lookup also but its taking more time . Are there any other options to acheive this results in a better time .

Currentli there are 300 million records in warehouse .Incoming data is 20 million per day .

Thanks in advance .

chulett · Post by **chulett** » Sun Jun 13, 2010 9:21 pm

Join the two tables directly in the 'unload' query, that giant 'in' list is about the worst way to handle volumes of that size.

nagarjuna · Post by **nagarjuna** » Sun Jun 13, 2010 9:27 pm

Craig ,

Thanks for the response . i will try that option but I dont have indexes on that table in staging though i have in Warehouse table . Will it not impact performance ?

Thanks
Nag

chulett · Post by **chulett** » Sun Jun 13, 2010 10:18 pm

Try, let us know. What database?

nagarjuna · Post by **nagarjuna** » Mon Jun 14, 2010 8:05 am

Oracle Database .

Sainath.Srinivasan · Post by **Sainath.Srinivasan** » Mon Jun 14, 2010 8:16 am

Matching in Database will be best approach than extracting to DataStage.

Main reason is that
1.) it contains a unique index to prevent a full table scan
2.) avoids unwanted network to pull data into DS server
3.) avoids any sorting prior to join

Did you gather the statistics recently for your warehouse table ?

Did you try including hints in the SQL ?

What is the propotion of data you expect to go in the update stream ? If that is small, you can set the update as a dummy SQL and let that be handled in the target.

ds_teg · Post by **ds_teg** » Mon Jun 14, 2010 8:22 am

Sainath.Srinivasan wrote:Matching in Database will be best approach than extracting to DataStage.

Main reason is that
1.) it contains a unique index to prevent a full table scan
2.) avoids unwanted network to pull data into DS server
3.) avoids any sorting prior to join

Did you gather the statistics recently for your warehouse table ?

Did you try including hints in the SQL ?

What is the propotion of data you expect to go in the update stream ? If that is small, you can set the update as a dummy SQL and let that be handled in the target.

Yes , I have gathered the table stats after every load ( every day )
I have not tried hints.The portion of updates that are going is nearly 7-10% of the incoming records .
Are you suggesting to use upsert and reject the updates in the reject file ?
Could you please be more elaborate ?

chulett · Post by **chulett** » Mon Jun 14, 2010 8:50 am

You have yourself a plain old ordinary sql query optimization task which really doesn't have much to do with DataStage until it starts returning records, at which time your Array Size setting with take affect. Other than that, it's all about explain plans, indexing, hash joins versus nested loops, sql changes and perhaps hints to make the select as efficient as possible. Work with your DBA if you haven't worked all that much with explain plans before.