Posted: Fri Jan 12, 2007 7:31 pm
You're moving data from one table to another within the same instance. The fastest method to do this is avoid using DataStage and allow a parallel DML SQL statement do the data transfer.
DataStage generates a program which will pull the data out of the database and transfer it to its external process, which in turn opens a cursor and sends the data thru the OCI connection. In short terms, the data is leaving the database to go thru an outside process and then re-enters the database. There is no parallelism whatsoever, unless your SELECT query has to build an elaborate result that may use parallel queries, but ultimately the data is collected into a single result set in a temp storage area for single-threaded spooling to the single threaded OCI connection process.
The fastest methods for moving data out of a database is parallel extraction processes, each taking a portion of rows and streaming it out. By "partitioning" or separating the required data into subsets, and then streaming each subset out independently to a receiving process, you have done what is called "partitioned parallelism".
The fastest methods for moving data into a database is bulk loading. For Oracle, sqlldr with the DIRECT path option is your choice.
So, design a job that looks like this: OCI --> XFM --> SEQ (filename has job parameter "N" in it). In the OCI stage, pick an integer column that you can use to someone evenly divide the data into subsets. Now, decide how many parallel occurences of this job can simultaneously execute without overburdening your database, say 7. Place the expression MOD(yourcolumn, 7) = N - 1 in the WHERE clause in the OCI stage. Now run 7 instances of this job simultaneous (a Sequence would be good here with 7 job activity stage icons) and set parameter N to 1 thru 7.
When you fire off the Sequence job, you'll see 7 parallel copies of the same job, each extracting a subset of the data. When all 7 instances complete, concatenate the 7 files into a single file and then bulk load it into the target table.
Of course, intra-instance DML will do this all within the database for you. But, this technique teaches you how we get massive volumes out in incremental pieces, dividing the labor and spreading i/o across multiple CPUS. You'll actually see your DataStage server resources climb, as the 7 processes use more resources in totality.
By the way, this same technique is extremely powerful when doing jobs like this: SEQ --> XFM w/hash references --> XFM w/hash references --> XFM w/hash references --> SEQ. Your single threaded job is now multi-processing (parallel transformation pipelines). By designing this way, we achieve massive parallelism.
In PX world, a lot of this parallelism is done automagically for you without using multiple job instances.
DataStage generates a program which will pull the data out of the database and transfer it to its external process, which in turn opens a cursor and sends the data thru the OCI connection. In short terms, the data is leaving the database to go thru an outside process and then re-enters the database. There is no parallelism whatsoever, unless your SELECT query has to build an elaborate result that may use parallel queries, but ultimately the data is collected into a single result set in a temp storage area for single-threaded spooling to the single threaded OCI connection process.
The fastest methods for moving data out of a database is parallel extraction processes, each taking a portion of rows and streaming it out. By "partitioning" or separating the required data into subsets, and then streaming each subset out independently to a receiving process, you have done what is called "partitioned parallelism".
The fastest methods for moving data into a database is bulk loading. For Oracle, sqlldr with the DIRECT path option is your choice.
So, design a job that looks like this: OCI --> XFM --> SEQ (filename has job parameter "N" in it). In the OCI stage, pick an integer column that you can use to someone evenly divide the data into subsets. Now, decide how many parallel occurences of this job can simultaneously execute without overburdening your database, say 7. Place the expression MOD(yourcolumn, 7) = N - 1 in the WHERE clause in the OCI stage. Now run 7 instances of this job simultaneous (a Sequence would be good here with 7 job activity stage icons) and set parameter N to 1 thru 7.
When you fire off the Sequence job, you'll see 7 parallel copies of the same job, each extracting a subset of the data. When all 7 instances complete, concatenate the 7 files into a single file and then bulk load it into the target table.
Of course, intra-instance DML will do this all within the database for you. But, this technique teaches you how we get massive volumes out in incremental pieces, dividing the labor and spreading i/o across multiple CPUS. You'll actually see your DataStage server resources climb, as the 7 processes use more resources in totality.
By the way, this same technique is extremely powerful when doing jobs like this: SEQ --> XFM w/hash references --> XFM w/hash references --> XFM w/hash references --> SEQ. Your single threaded job is now multi-processing (parallel transformation pipelines). By designing this way, we achieve massive parallelism.
In PX world, a lot of this parallelism is done automagically for you without using multiple job instances.