OCI Vs Dynamic RDBMS

shrey3a · Post by **shrey3a** » Mon Oct 31, 2005 10:42 am

Hi Gurus,

I'm in process of modifying some Datastage jobs for performance issues.

In these jobs Dynamic RDBMS stage is been used to make a connection to Oracle 9i database. I was wondering will it make difference to replace them with oracle9i(oci) stages i.e. will it improve the performance of jobs.

Thanks in advance

Regards,

kcbland · Post by **kcbland** » Mon Oct 31, 2005 10:57 am

No, look for your improvements in the design.

ray.wurlod · Post by **ray.wurlod** » Mon Oct 31, 2005 2:06 pm

Why did Ken say no? Because, irrespective of the stage type, what happens is that SQL statements are sent to Oracle and the results of those are retrieved. It is (or should be) the same SQL in either case.

kcbland · Post by **kcbland** » Mon Oct 31, 2005 9:44 pm

Since Halloween took up most of my day, I didn't further Ray's comments.

Changing out the RDBMS for OCI is worthless, as I believe that they both use sqlnet anyway, so there's no performance difference. I could be wrong, but even if someone said ODBC for OCI for performance reasons I'd say whoopee.

The problem is, if your job design is ODBC-->XFM-->ODBC and you think switching ODBC to OCI is going to make your job 2X faster, think again. RDBMS to OCI isn't going to do anything different. You need to be using every processor cycle you can in order to move data quicker.

A re-thinking of the design would be something like:

ODBC --> SEQ (run multiple instances of this with a ranging query to fast unload the source data)

ODBC --> HASH (run multiple instances of this with a ranging query to fast unload the target data that may be updated)

SEQ --> XFM w/HASH reference --> SEQ (updates)
--> SEQ (inserts)
(multi-instance this job to simultaneously XFM more data in less time)

SEQ --> bulkload to target (inserts)
SEQ --> OCI (update only SQL action)

Not only will I get the data out of the source database faster, I reduce the chance of holding a query snapshot too long, use too much rollback segment, etc. I'll transformer faster using more CPUS, and by separating inserts from updates I insure that final loading uses the fastest database methods possible. Also, the entire process has restartable segments with milestones in the process.

These types of structured design techniques yield multiples of improvements.

chulett · Post by **chulett** » Tue Nov 01, 2005 8:47 am

kcbland wrote:These types of structured design techniques yield multiples of improvements.

And run counter to a 'Best Practices' document I was just sent from IBM. I was flumoxed to read that landing data in Server jobs was frowned upon and specifically mentioned as a 'Thou Shall Not'.

kcbland · Post by **kcbland** » Tue Nov 01, 2005 9:06 am

Yeah, they've got to mindwipe everyone into a PX frame of mind. You'll hear the same argument from Oracle about writing queries and using work tables.

Here's a document that's really illuminating. It shows how landing of some data in Oracle facilitates getting your end results faster.

http://oraclesvca2.oracle.com/openworld ... W12681.doc

I worked with these guys on this project.

As for not landing, why the hell do you have hash files then? How do you effectively build restartable milestones in long running processes? How do you NOT pull all of the target tables into Merge/Lookup stages if you don't first scan your source data and use that information to shrink the targets/references into only-necessary datasets?

Real client example. Every load cycle they processed an average 20K rows of data. The target table held 100M rows. To decide if insert or update, they merged all 100M rows against the 20K. Their 128cpus were hammered during this PX job activity.

So, I had them take the 20K rows and scan it for keys and load those keys into a global work table in the DB. Then, we inner-joined the 20K rows in the work table with the 100M rows and thru that to the hash file. Now the job just had to deal with 20K referencing with <20K rows in a hash file. Any thoughts which took less time to run? Our smart Server solution on a single 4 cpu node took less time than the PX solution on 128 cpus slammed. Now, we could rewrite the Server as a pure PX solution and be even faster. But for this customer, one pilot example of pure Server versus PX had the "smart" Server design accomplish in 28 minutes what the PX solution required 120 minutes, and that was 4 cpu node versus 16 nodes totalling 128 cpus.

The problem IMO, and I know they're reading this, is that they are deathly afraid of anyone espousing that it takes intelligent design to do ETL. That is counter to the sales pitch. You might even think that by not using the Server tool effectively, it contributes to the "you need to upgrade to PX for performance reasons" argument, IMO.

PhilipDuPlessis · Post by **PhilipDuPlessis** » Tue Nov 01, 2005 9:20 am

Ken,

Dude, I am really new to DSExchange and datastage in general so plze be patient... Referring back to where you mentioned that performance would be roughly the same using an Oracle ODBC stage, would the same be said in the case of SQL (would it be faster to use on a SQL ODBC stage than a generic ODBC stage? if so, what would the performance difference be???

On the subject of the work tables, once again, it provides you the opportunity of indexing these making the updates in particular run extremely faster, as we discovered during one of our multiple million row jobs...

kcbland · Post by **kcbland** » Tue Nov 01, 2005 9:29 am

Oracle OCI, not Oracle ODBC. ODBC is Bill Gates revenge on the Unix world.

Oracle, UDB, Informix, and Sybase all have their own connectivity software for their databases. OCI, CLI, whatever the moniker. ODBC is a PC thing foisted on the Unix world.

In theory, database vendors know their databases better and can provide faster interfaces. But, they're not 2X faster, only like 20-30% faster. They have more features specific to their database than the ODBC standards allow, so in some cases they are faster because of a particular feature (tuning/connection options, TRUNCATE vs. deletes, etc).

I believe the MS-SQL stage, which is new, uses the same ODBC connection defined on the server. Oracle OCI versus ODBC is a different configuration. Oracle OCI requires you to install Oracle client software (highly preferred anyway) on your PC, whereas ODBC requires you to setup a system DSN under Control Panel.

As for the work tables, I'll paraphrase myself again. The best tool can be defeated by the worst user. The best user can make the worst tool shine. I can't beat Michael Jordan by putting him in highheels and myself in Nikes.

PhilipDuPlessis · Post by **PhilipDuPlessis** » Tue Nov 01, 2005 9:35 am

thanks Ken. Maynnn, this is gonna be a bumpy ride with loads of learnings...

kcbland · Post by **kcbland** » Tue Nov 01, 2005 9:43 am

The data integration business is not kiddie play. There's no magic fairy dust you can sprinkle on people to make them move billions of bytes of data better. The ETL Genie does not grant wishes for moving data faster just by checking a box or tweaking an array size.

Someone who attempts to load 2 billion rows of data in one pass needs to learn the lesson that's coming. Any company who puts a SQL-ignorant person in front of an ETL tool client and says "move my data" deserves the quality of result coming. The Unix goblin has to be defeated, you can't just think the ETL tool elevates you away from the operating system, you can't be "OS un-aware".