how are updates in datastage performance

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
kollurianu
Premium Member
Premium Member
Posts: 614
Joined: Fri Feb 06, 2004 3:59 pm

how are updates in datastage performance

Post by kollurianu »

how are updates in 7.5 version on an oracle table , for 300 thousand rows,
is it good way to choose datastage or is it better to user oracle plsql ,

we are preferring to use datastage what do you say.

Can any one help with suggestions
poorna_76
Charter Member
Charter Member
Posts: 190
Joined: Thu Jul 08, 2004 10:42 am

Re: how are updates in datastage performance

Post by poorna_76 »

kollurianu wrote:how are updates in 7.5 version on an oracle table , for 300 thousand rows,
is it good way to choose datastage or is it better to user oracle plsql ,

we are preferring to use datastage what do you say.

Can any one help with suggestions
Its better to use DataStage.
DataStage seems to perform better.

Thanks
kcbland
Participant
Posts: 5208
Joined: Wed Jan 15, 2003 8:56 am
Location: Lutz, FL
Contact:

Post by kcbland »

There's a huge difference. You must consider how SQL and relational databases work. DataStage has NO SECRET TRICKS. It simply uses DML statements just like everyone else. Read the SQL statement in the stage and you see INSERT/UPDATE syntax.

How you do updates does matter. If your SQL is using wildcard updates (an update statement that does not have the WHERE clause specify the primary key), then performance is unpredictable because of the one row in DataStage causing 0, 1, or many rows to update in the database. The performance depends on the ability of the database to find the rows that require updating.

Next, if you are using the primary key to update existing rows, you are operating at the speed it takes DataStage to deliver a row to the database and the database the time to find the row and update it. You get into all kinds of nasty things, like chained rows where the replacement row is bigger than the row already in the table. This causes the database to split the data into separate pieces and store them in different places in the table. You also have to deal with large partitioned tables having global indexes that require updating. You also have to deal with large partitioned tables and you potentially not supplying the partitioning column so the update is less efficient because you are not doing partition elimination.

There's a lot more to say about updates, but in general, updates are VERY SLOW and you need to seriously understand how a database works with them. There are ways to expedite them, but they are not tool based.
Kenneth Bland

Rank: Sempai
Belt: First degree black
Fight name: Captain Hook
Signature knockout: right upper cut followed by left hook
Signature submission: Crucifix combined with leg triangle
kollurianu
Premium Member
Premium Member
Posts: 614
Joined: Fri Feb 06, 2004 3:59 pm

Post by kollurianu »

Hi Bland ,

Thank you all for the answers, but Bland you mean to say updates are
slow normally it tool doesnot matter , only it depends on the database.

Can you clarify me.

Thank you
kcbland
Participant
Posts: 5208
Joined: Wed Jan 15, 2003 8:56 am
Location: Lutz, FL
Contact:

Post by kcbland »

The database controls what data is allowed to load, not the tool. Please understand this point fully: THERE ARE NO TOOLS THAT EXIST THAT CAN PUT DATA INTO A DATABASE. You either use an ODBC driver, a database supplied interface library (OCI/CLI, etc), or a command line data loader (sqlldr, TMU, isql, dbaccess, etc) supplied by the database maker. Ascential, Ab Initio, Informatica, Sunopsis, MS-DTS, OWB, it doesn't matter, they all use the same methods to load data. The database does all of the work, the tools use the supplied interfaces.

You can have arguments that native OCI/CLI connections are faster than ODBC and probably be correct. This is because ODBC is a generic driver, whereas the OCI/CLI interface is a database maker supplied library tuned for that database.

Why are updates slower than inserts? Because that is the nature of databases. When a database updates a row it traditionally does more work than inserting a row. It's easier to insert than update, and that's why most bulk loaders only support inserting rows (Redbricks TMU being the best exception).

I wholeheartedly recommend visiting http://asktom.oracle.com/ and also suggest you get your hands on Oracle Expert One-On-One by Tom Kyte. An excellent understanding of databases and how they work will make a person a better ETL developer. Using a tool does not automatically make things faster, usually just easier.
Kenneth Bland

Rank: Sempai
Belt: First degree black
Fight name: Captain Hook
Signature knockout: right upper cut followed by left hook
Signature submission: Crucifix combined with leg triangle
kcbland
Participant
Posts: 5208
Joined: Wed Jan 15, 2003 8:56 am
Location: Lutz, FL
Contact:

Post by kcbland »

kcbland wrote:Ascential, Ab Initio, Informatica, Sunopsis, MS-DTS, OWB, it doesn't matter, they all use the same methods to load data. The database does all of the work, the tools use the supplied interfaces.
Let me be very clear here. Even PX will partition the data to align with the physical database partitions by querying database catalogs and such. But they do not physically write to the underlying data files in the database. It is still handled by the database. In UDB the coordinator node is bypassed and the data directly written to the physical partitions because UDB makes this information possible, but UDB still manages these activities. This is true for insert only processing, but updating is still managed using DML.
Kenneth Bland

Rank: Sempai
Belt: First degree black
Fight name: Captain Hook
Signature knockout: right upper cut followed by left hook
Signature submission: Crucifix combined with leg triangle
kollurianu
Premium Member
Premium Member
Posts: 614
Joined: Fri Feb 06, 2004 3:59 pm

Post by kollurianu »

Thank you all for the answers,

but can any one clear me , what would be the difference between , the oracle bulk loader and normal inserts thru OCI STAGE.


Any answers would be appreciated.
kcbland
Participant
Posts: 5208
Joined: Wed Jan 15, 2003 8:56 am
Location: Lutz, FL
Contact:

Post by kcbland »

kollurianu wrote:what would be the difference between , the oracle bulk loader and normal inserts thru OCI STAGE.
sqlldr CONVENTIONAL method is the exact same as INSERT SQL statements, no difference.

sqlldr DIRECT method bypasses INDEXES, CONSTRAINTS, and TRIGGERS to write directly to the data files. As part of post-load processing, indexes and constraints are updated and re-enabled. Direct path loading is clocked at 100's of times faster, but you have to make sure that your data is clean and absolutely a new row. Direct path is extremely fast, but you have to be extremely careful using it. It was built for data warehousing, because the types of ETL processing you do in data warehousing usually does the checks required to satisfy using this method safely. Failure to do these steps means you cannot take advantage of direct path and must use the slower CONVENTIONAL method. The ETL steps are: insert and update row determination, surrogate key generation within the ETL processing not using database lookups and writes, data quality verification for data types, constraint/foreign key validation.
Kenneth Bland

Rank: Sempai
Belt: First degree black
Fight name: Captain Hook
Signature knockout: right upper cut followed by left hook
Signature submission: Crucifix combined with leg triangle
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

One big difference is that the Oracle bulk loader does not do updates. Only inserts.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
kollurianu
Premium Member
Premium Member
Posts: 614
Joined: Fri Feb 06, 2004 3:59 pm

Post by kollurianu »

I know oracle bulk loader just does inserts , what i want to know is there
any performance issue when we insert data thru normal OCI AND
thru bulk loader if so why?

Thank you very much for all ur answers.
diamondabhi
Premium Member
Premium Member
Posts: 108
Joined: Sat Feb 05, 2005 6:52 pm
Location: US

Post by diamondabhi »

Hi,
Bulk loader loads into onlyone table so it is comparitively fast, so wen u have to do bulk loading and with no updates its better to use bulk loader.

Thanks.
kollurianu
Premium Member
Premium Member
Posts: 614
Joined: Fri Feb 06, 2004 3:59 pm

Post by kollurianu »

Hi abhi ,

I didnot get ur point , u saybulk loader is faster as it loads to one table,

what does that mean?

And also can any one please shed me some light , is bulk loader better

sqlloader better, which one good performance one over the other , if
so in what each one i s good for?


Thank you
roy
Participant
Posts: 2598
Joined: Wed Jul 30, 2003 2:05 am
Location: Israel

Post by roy »

Hi,
IMHO, Ken's postings here are FAQ worthy.
Regarding the bulk loaders I think a search will find previous topic on that and conclude that using the command line utility supplied with the DB client software is prefered over using the bulk loader stage (the stage is normally used to generate the data and control files for the command line bulk loader, though it has loading capabilities)

IHTH,
Roy R.
Time is money but when you don't have money time is all you can afford.

Search before posting:)

Join the DataStagers team effort at:
http://www.worldcommunitygrid.org
Image
Post Reply