effecient way to load millions of data in PX

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
rajpatel
Participant
Posts: 14
Joined: Fri Jul 01, 2005 8:31 am

effecient way to load millions of data in PX

Post by rajpatel »

Hi ,
I have 100's of millions of records in table "A" and I have daily maintenace coming for this table "A" as text file . THis text file also have millions of recods everyday. THis txt file data coming for insert/update, this upsert data range from anywhere in history daya .

What is the efficient way to load this txt file using datastage PX job ?

thanks,
RP
DSguru2B
Charter Member
Charter Member
Posts: 6854
Joined: Wed Feb 09, 2005 3:44 pm
Location: Houston, TX

Post by DSguru2B »

Seperate the inserts and updates by change capture stage and process the inserts and updates seperately.
Creativity is allowing yourself to make mistakes. Art is knowing which ones to keep.
rajpatel
Participant
Posts: 14
Joined: Fri Jul 01, 2005 8:31 am

Post by rajpatel »

any other method, CDC is not that fast.
Andet
Charter Member
Charter Member
Posts: 63
Joined: Mon Nov 01, 2004 9:40 am
Location: Clayton, MO

Post by Andet »

If it's UDB, you can use load for the inserts.
If not or otherwise, your best bet is a custom op.
either through datastage or command line/scripted osh.

Ande
sud
Premium Member
Premium Member
Posts: 366
Joined: Fri Dec 02, 2005 5:00 am
Location: Here I Am

Post by sud »

rajpatel wrote:any other method, CDC is not that fast.
That's very correct but look at the trade-off. 100 million records wont be changed records. If you do a blind update it will take far more time than the time change capture will take to prune out unchanged records. And as DsGuru suggested, use bulk loader for the insert records while using update for changed records. :)
DSguru2B
Charter Member
Charter Member
Posts: 6854
Joined: Wed Feb 09, 2005 3:44 pm
Location: Houston, TX

Post by DSguru2B »

Anyway doing incremental load on 100 million records is going to hit performance issues if there are not enough processors available.
Creativity is allowing yourself to make mistakes. Art is knowing which ones to keep.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

The nice thing about PX is that you can simply throw more processing nodes at it. Unless there are specific order issues in the text file, you can use multiple readers (each reading a portion of the file) or pre-process the file to split it into multiple files. Research the Sequential File stage properties.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
yakiku
Premium Member
Premium Member
Posts: 23
Joined: Thu May 13, 2004 7:14 am

Post by yakiku »

What is the database you are loading into?
rajpatel
Participant
Posts: 14
Joined: Fri Jul 01, 2005 8:31 am

Post by rajpatel »

Oracle database
thompsonp
Premium Member
Premium Member
Posts: 205
Joined: Tue Mar 01, 2005 8:41 am

Post by thompsonp »

Is there anything in the text file that enables you to determine if it relates to an insert or an update? Perhaps a record id from a sequence for example.

I agree that you probably need to separate the inserts from the updates but only if there are a significant number of inserts. You should be able to load inserts faster than updates by using the Load Append option i.e. sql loader in the Oracle Enterprise stage. If there aren't many inserts this won't help much as the updates will slow you down.

Is the table you are updating indexed on just the primary key or is it heavily indexed on other fields as well?

If you end up splitting the data and loading the inserts you will have to decide how to maintain the indexes. The load append will require the indexes to be rebuilt. If you are doing this you will want to minimise the number of partitions you rebuild indexes on by determining which ones have been loaded (I assume your Oracle table with 100's of millions of rows is partitioned).

Of course until the indexes are valid you can't apply the updates with an update statement.

Can you be a little more precise with regard to volumes of data in the table and the number of inserts / updates?
Post Reply