effecient way to load millions of data in PX

rajpatel · Post by **rajpatel** » Wed Jun 07, 2006 10:38 am

Hi ,
I have 100's of millions of records in table "A" and I have daily maintenace coming for this table "A" as text file . THis text file also have millions of recods everyday. THis txt file data coming for insert/update, this upsert data range from anywhere in history daya .

What is the efficient way to load this txt file using datastage PX job ?

thanks,
RP

DSguru2B · Post by **DSguru2B** » Wed Jun 07, 2006 10:46 am

Seperate the inserts and updates by change capture stage and process the inserts and updates seperately.

rajpatel · Post by **rajpatel** » Wed Jun 07, 2006 10:59 am

any other method, CDC is not that fast.

Andet · Post by **Andet** » Wed Jun 07, 2006 11:45 am

If it's UDB, you can use load for the inserts.
If not or otherwise, your best bet is a custom op.
either through datastage or command line/scripted osh.

Ande

sud · Post by **sud** » Wed Jun 07, 2006 1:40 pm

rajpatel wrote:any other method, CDC is not that fast.

That's very correct but look at the trade-off. 100 million records wont be changed records. If you do a blind update it will take far more time than the time change capture will take to prune out unchanged records. And as DsGuru suggested, use bulk loader for the insert records while using update for changed records.

DSguru2B · Post by **DSguru2B** » Wed Jun 07, 2006 2:09 pm

Anyway doing incremental load on 100 million records is going to hit performance issues if there are not enough processors available.

ray.wurlod · Post by **ray.wurlod** » Wed Jun 07, 2006 3:31 pm

The nice thing about PX is that you can simply throw more processing nodes at it. Unless there are specific order issues in the text file, you can use multiple readers (each reading a portion of the file) or pre-process the file to split it into multiple files. Research the Sequential File stage properties.

yakiku · Post by **yakiku** » Wed Jun 07, 2006 3:31 pm

What is the database you are loading into?

rajpatel · Post by **rajpatel** » Thu Jun 08, 2006 2:13 pm

Oracle database

thompsonp · Post by **thompsonp** » Fri Jun 09, 2006 6:27 am

Is there anything in the text file that enables you to determine if it relates to an insert or an update? Perhaps a record id from a sequence for example.

I agree that you probably need to separate the inserts from the updates but only if there are a significant number of inserts. You should be able to load inserts faster than updates by using the Load Append option i.e. sql loader in the Oracle Enterprise stage. If there aren't many inserts this won't help much as the updates will slow you down.

Is the table you are updating indexed on just the primary key or is it heavily indexed on other fields as well?

If you end up splitting the data and loading the inserts you will have to decide how to maintain the indexes. The load append will require the indexes to be rebuilt. If you are doing this you will want to minimise the number of partitions you rebuild indexes on by determining which ones have been loaded (I assume your Oracle table with 100's of millions of rows is partitioned).

Of course until the indexes are valid you can't apply the updates with an update statement.

Can you be a little more precise with regard to volumes of data in the table and the number of inserts / updates?