Page 1 of 1

Performance Improvement Needed

Posted: Sun Jun 13, 2010 8:36 pm
by nagarjuna
Hi Everyone ,

I have an issue with loading of one of the table . Here are the details

a) Incoming file needs to be loaded into staging

b) From staging into Warehouse

c) Daily load : 20 million records ( delta )

d) Needs to do type1 updates .

c)Unloading the data from warehouse based on the records present in staging .This unload is needed to identify if a record is update or insert .I am separating inserts and updates .

d) Unload query :

select col1 , col2 , col3 , col4 from warehouse table where
( key1 , key2 , key3 ) in ( select col1 , col2 , col3 from staging table )

e) Have unique index on key1 ,key2 , key3

d) After unloading , the unload dataset will be joined with staging table data and insert file , update file will be produced .

e) Requirement is such a way that we CAN ignore updates .

The Unload part of the above process is taking lot of time ( 45-60 min) . I have tried sparse lookup also but its taking more time . Are there any other options to acheive this results in a better time .

Currentli there are 300 million records in warehouse .Incoming data is 20 million per day .

Thanks in advance .

Posted: Sun Jun 13, 2010 9:21 pm
by chulett
Join the two tables directly in the 'unload' query, that giant 'in' list is about the worst way to handle volumes of that size.

Posted: Sun Jun 13, 2010 9:27 pm
by nagarjuna
Craig ,

Thanks for the response . i will try that option but I dont have indexes on that table in staging though i have in Warehouse table . Will it not impact performance ?

Thanks
Nag

Posted: Sun Jun 13, 2010 10:18 pm
by chulett
Try, let us know. What database?

Posted: Mon Jun 14, 2010 8:05 am
by nagarjuna
Oracle Database .

Posted: Mon Jun 14, 2010 8:16 am
by Sainath.Srinivasan
Matching in Database will be best approach than extracting to DataStage.

Main reason is that
1.) it contains a unique index to prevent a full table scan
2.) avoids unwanted network to pull data into DS server
3.) avoids any sorting prior to join

Did you gather the statistics recently for your warehouse table ?

Did you try including hints in the SQL ?

What is the propotion of data you expect to go in the update stream ? If that is small, you can set the update as a dummy SQL and let that be handled in the target.

Posted: Mon Jun 14, 2010 8:22 am
by ds_teg
Sainath.Srinivasan wrote:Matching in Database will be best approach than extracting to DataStage.

Main reason is that
1.) it contains a unique index to prevent a full table scan
2.) avoids unwanted network to pull data into DS server
3.) avoids any sorting prior to join

Did you gather the statistics recently for your warehouse table ?

Did you try including hints in the SQL ?

What is the propotion of data you expect to go in the update stream ? If that is small, you can set the update as a dummy SQL and let that be handled in the target.

Yes , I have gathered the table stats after every load ( every day )
I have not tried hints.The portion of updates that are going is nearly 7-10% of the incoming records .
Are you suggesting to use upsert and reject the updates in the reject file ?
Could you please be more elaborate ?

Posted: Mon Jun 14, 2010 8:50 am
by chulett
You have yourself a plain old ordinary sql query optimization task which really doesn't have much to do with DataStage until it starts returning records, at which time your Array Size setting with take affect. Other than that, it's all about explain plans, indexing, hash joins versus nested loops, sql changes and perhaps hints to make the select as efficient as possible. Work with your DBA if you haven't worked all that much with explain plans before.