Loading 50million records from Oracle9i to Oracle9i

Post questions here relative to DataStage Server Edition for such areas as Server job design, DS Basic, Routines, Job Sequences, etc.

Moderators: chulett, rschirm, roy

Post Reply
peppy_perky
Charter Member
Charter Member
Posts: 6
Joined: Wed Jan 05, 2005 2:50 pm

Loading 50million records from Oracle9i to Oracle9i

Post by peppy_perky »

Hi all,

I have a record of about 50million in the source database in the same server where datastage is residing and migrating to a target db in a different server. I would like to find out the fastest way to migrate this data. We are currently using Oracle OCI which is VERY slow.

There is a similar post here
http://dsxchange.com/viewtopic.php?t=86 ... 2edf245c69

but I cant figure out how to go about it.

Please give the best solution.

Thanks
kcbland
Participant
Posts: 5208
Joined: Wed Jan 15, 2003 8:56 am
Location: Lutz, FL
Contact:

Post by kcbland »

Write a multi-instance job that selects your data and spools to a text file. Use a parameter as part of the output text file name, and use a partitioning WHERE clause that will return a subset of your data. Run as many copies of the job as allowed to extract as much data in parallel as possible. Concatenate the output files together into a single large file.

Is your target table partitioned? You will want to now divide the single large file into as many files as you have partitions, each containing the target table partitions data only.

Do you have global indexes? Disable or drop them.

Use multiple sqlldr executions naming the partition for loading, use DIRECT path option.

Do you have global indexes? When all data is load now you can create/re-enable and rebuild.

If your target table is not partitioned, take the single large file and kick of sqlldr with DIRECT path option.
Kenneth Bland

Rank: Sempai
Belt: First degree black
Fight name: Captain Hook
Signature knockout: right upper cut followed by left hook
Signature submission: Crucifix combined with leg triangle
peppy_perky
Charter Member
Charter Member
Posts: 6
Joined: Wed Jan 05, 2005 2:50 pm

loading 50millions rows

Post by peppy_perky »

Hey Ken
This was your reply to a previous question. Could you please look into this and explain a bit more about the 3rd option and insert into target (select * from source) ?

Thanks

kcbland wrote:
Well, DataStage Server jobs are single-threaded. This means that the SELECT query is prepared however the database does it, then a process spools the data to DataStage, which is handled in a single-threaded fashion. DS spools the data to a file, then initiates a sqlldr session to load it.

Compare that to a sql statement like:

Code:
insert into target (select * from source)


This statement, if you have parallel dml enabled, will use multiple parallel query slaves not only to select the data, but also to write the data. Since Oracle 9i uses DIRECT path by default on these types of statements, the data will automatically partition itself along the target table partitions, and optimize for maximum throughput.

So, to answer your question, when moving data within the same instance, the fastest means is to stay within the database. Involving any external process will involve significant overhead as well as logic.


No - just the Datastage and the ETL-Oracle Instance are on the same machine. The other systems we are getting the data from are located on other machines with a high speed connection to the ETL-Server.

The first approach was: 40 minutes for 1million rows
ORAOCI8-Stage as input -> Transformer -> ORAOCI8-Stage as output

2nd approach with was not really faster: 40 minutes for 1million rows
ORAOCI8-Stage as input -> Transformer -> Bulk Loader stage as output and loading the outputfile.

In fact the datafile was loaded fast, but it took the same time to
create the file ....

3rd approach (as I come from the Oracle world);
a dummy stage ORAOCI8 with output.
The query to compile this piece was just a "select 1 from dummy"
Into the the before statement tab i put the sql query:
"insert /*+ append*/ tablename
select x from tablename@source"
which was really fast - the same speed as started on sql*plus
just takes 2 minutes for 1million rows with append (direct path if nologging is set on table) and
4 minutes without append (convential)

so it looks like that datastage does not process it in a bulk operation.
But isn't it possible to archive better performance.
How do the others implementent a distributed select without loading files?!?
seanc217
Premium Member
Premium Member
Posts: 188
Joined: Thu Sep 15, 2005 9:22 am

Post by seanc217 »

I know this is a DataStage forum, but the fastest way to do a one-time migration is to use Oracle's export import utility.
kcbland
Participant
Posts: 5208
Joined: Wed Jan 15, 2003 8:56 am
Location: Lutz, FL
Contact:

Post by kcbland »

If you're going to copy data across a network using DBLINKs, I hope your volume is low. INSERT INTO TABLE (SELECT * FROM TABLE@remotedatabase) should be avoided on large tables. Your network traffic, restartability, parallelism, and performance will suffer greatly.


Your solution is easy, it's just convincing folks about the hard parts. You need to separate getting the data out of the source from loading it into the target. Two steps are all you need to do.

1. Get the data out of the source table as fast as possible.
2. Put the data into the target table as fast as possible.

The fastest way to get the data out is to use multiple reading activities with their own writer. Range quering a table, round robin (check into using MOD) selection, choose your method. Just get as many process reading and writing out the data as your hardware allows.

The fastest way to load data is to align your data with the partitioning of the table. If there is no partitioning, forget about these steps. Otherwise, take all of your staged data and separate it into a file per partition and used DIRECT path named partition sqlldr'ing. If you have GLOBAL indexes, you can't be loading separate partitions simultaneously because they'll want to be updating the global indexes so you need to disable or drop them.

No partitioning means just blast away with sqlldr DIRECT path.

As for database imports and exports, you pick up a lot of baggage and requirements doing that. It will work, but you're looking at the identical structures on both sides, partitioning, etc. If that's the case, your problem is for a DBA, not a programmer. I'm giving you the programmers solution.
Kenneth Bland

Rank: Sempai
Belt: First degree black
Fight name: Captain Hook
Signature knockout: right upper cut followed by left hook
Signature submission: Crucifix combined with leg triangle
aditya
Charter Member
Charter Member
Posts: 41
Joined: Sat May 28, 2005 7:32 am

Post by aditya »

Did you try using the IPC Stage?

-- Aditya.
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Post by ArndW »

Putting an IPC stage or several of them will only help if a DataStage job is CPU bound on the DataStage server, in this case the bottleneck will not be CPU but database I/O so an IPC stage won't make any difference to throughput.
kris_daredevil
Participant
Posts: 20
Joined: Mon Dec 05, 2005 8:07 pm
Contact:

Post by kris_daredevil »

i had the same problem it was one time load( 52 million rows) and i made the index drop on the target DB and loaded the target with data and recreated the index after it has loaded.as you know we don't have right's in test and Prod i asked our DBA to do the dropping and recreating the indexes in test or prod.
KRIS
Post Reply