BEST UPSERT TECHNIC

Post questions here relative to DataStage Server Edition for such areas as Server job design, DS Basic, Routines, Job Sequences, etc.

Moderators: chulett, rschirm, roy

ZEEVS1970
Participant
Posts: 6
Joined: Wed Mar 08, 2006 5:57 am

BEST UPSERT TECHNIC

Post by ZEEVS1970 »

WHAT IS THE BASE WAY TO PREFORM UPSERT TO A TABLE?

I TRYED DOING A DIRECT UPSERT VIA "UPDATE OR INSERT"
AND DISCOVERED IT TO BE NUCH SLOWER THEN PREFORMING
A LOOKUP USING A HASH FILE.

HOWEVER I CAN NOT SAY FOR SURE IF LIMITING THE HASH FILE
TO NEWER ROWS CAN MAKE THE PROCCESS MORE EFFICIANT.

I WOULD APPRICIATE AN ADVICE WHICH TECHNIC IS THE BEST
AND RUNS THE FASTEST ASSUNMING WE ARE DEALING WITH A RELATIVLY LARGE DB OF MILIONS OF RECORDS.
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Post by ArndW »

Zeevs1970,

first off, welcome to DSXchange.

Could you explain your problem a bit more, since writing to a hashed file doesn't directly relate to using upsert into a table - unless you are doing it to an ODBC data source that links to a hashed file.

Generally it is much faster to split your updates and inserts and use your Database's bulk load tool to quickly put in the inserts.

p.s. Using all UPPERCASE is considered to be shouting and is generally frowned up
kcbland
Participant
Posts: 5208
Joined: Wed Jan 15, 2003 8:56 am
Location: Lutz, FL
Contact:

Post by kcbland »

During transformation split your data into rows for insertion and rows for updating. Bulk load your inserts, then perform the updates by either bulk loading into a work table and "merge" the data, or simply use update existing rows only. This method will be the fastest.
Kenneth Bland

Rank: Sempai
Belt: First degree black
Fight name: Captain Hook
Signature knockout: right upper cut followed by left hook
Signature submission: Crucifix combined with leg triangle
ZEEVS1970
Participant
Posts: 6
Joined: Wed Mar 08, 2006 5:57 am

that's what i thought also

Post by ZEEVS1970 »

however i do need a lookup hased file to determine
whic rows to update and wich rows to insert.

i created a job with 1 milion rows input than needs to be upserted
in table A containing 20 milion rows.

to do this i also created a hashed lookup file containing all records
already existing in table A.

then of course i'm splitting the transformer to insert and update
acording to the lookup results.

i also tried limiting the hashed file size by retrieving only the newest
rows from table Ainto the hashed file but it seems to me it doesn't shorten the proccess at all.


So am i to understand best way remains using a classic insert/update
based on a hashed file lookup containg all records?
kcbland
Participant
Posts: 5208
Joined: Wed Jan 15, 2003 8:56 am
Location: Lutz, FL
Contact:

Post by kcbland »

If you take the natural keys in your source data, load them into a work table in your target database, then inner join those keys to your target table and spool those rows to the hashed lookup file, your hashed file will only contain those rows found in the database. During transformation, reference this hash file and divide inserts and updates.

It seems like more jobs and more processing, but each step is very fast. Pure inserts of natural keys into a work table can be bulk loaded. Inner join spooling is faster than full table spooling, especially if your not found % reaches near 100%. Putting fewer rows into a hash file is faster. Referencing a smaller hash file is faster. Bulk loading more rows as dedicated inserts is faster.
Kenneth Bland

Rank: Sempai
Belt: First degree black
Fight name: Captain Hook
Signature knockout: right upper cut followed by left hook
Signature submission: Crucifix combined with leg triangle
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Your hashed file only needs to contain 20 million KEYS, which is a much smaller hashed file than 20 million rows, if all you're doing is checking for existence.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

One million keys, max. :wink:
-craig

"You can never have too many knives" -- Logan Nine Fingers
ZEEVS1970
Participant
Posts: 6
Joined: Wed Mar 08, 2006 5:57 am

maybe the % of input rows to be updated should be considered

Post by ZEEVS1970 »

I did try the method you suggested but strangly enough got quite
poor results.

I'm assuming maybe it's because too many rows needs updating.

for this test I cretated A source table with 1 milion rows derived from
the 20 milion target table and made the flowwing changes:
1. half of them I changed an ordineray filed
2. the other half I changed a primery key so the row will be
taken as new.

when I ran my job using all 20 milion target rows loaded in a lookup hash
file the proccess took 58 minutes.

however when I used the method you suggested combining the inner join,
the porcess took 1 hour and 45 minutes.

I'm still searching for a method to reduce the process to run less than
1 hour.
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

Which part of the process took that long? If it is the inner join, look to optimizing your hashed file build query...
-craig

"You can never have too many knives" -- Logan Nine Fingers
kcbland
Participant
Posts: 5208
Joined: Wed Jan 15, 2003 8:56 am
Location: Lutz, FL
Contact:

Post by kcbland »

Did you write the inserts to a file? Did you write the updates to a file? I bet you directly wrote to the database. If your hash file only contains the necessary rows for processing, you'll find the hash file is more performant. Trust Craig and I on this.
Kenneth Bland

Rank: Sempai
Belt: First degree black
Fight name: Captain Hook
Signature knockout: right upper cut followed by left hook
Signature submission: Crucifix combined with leg triangle
rleishman
Premium Member
Premium Member
Posts: 252
Joined: Mon Sep 19, 2005 10:28 pm
Location: Melbourne, Australia
Contact:

Post by rleishman »

I took a slightly different approach. Rather than join to the target in transform, I transformed as if everything was an insert, even though some of these rows would eventually fully replace existing rows. The target of the transformation was a flat file.

The object was then to MERGE that flat file with the target table. In Oracle, this can be some with the MERGE statement. This was my first approach, and it worked OK except that there is a bug with MERGE /*+ APPEND */ in 10g if the target table has a Materialized View Log. /*+ APPEND*/ is also inferior to SQL*Loader because it does not work if there are enabled Foreign Keys.

Instead I decided to DELETE the rows that existed, and SQL*Load the entire file knowing that all of them would load correctly. This worked great if there were not many DELETEs (it was at least as good as the MERGE), but like the MERGE, it was nasty if there were lots of DELETEs.

To counter this, I wrote a generic re-usable PL/SQL routine that does the following:
- Joins the flat file (as an EOT) to the target, loading the ROWIDs of matching rows into a GLOBAL TEMPORARY TABLE. These are the rows to be deleted.
- For each partition in the target table, if the number of rows in that partition to be deleted is less than x% (I use 10%) of the total rows, then just delete them.
- If more than 10% of the partition is to be deleted, then it is done by Partition Exchange. ie. CREATE TABLE pex AS SELECT * FROM target PARTITION (part) WHERE rowid NOT IN (SELECT row_id FROM gtt). Create the same index and foreign keys as the target table, and then exchange it with the partition in the target.

The whole thing dynamically generates the DML and DDL required from the data dictionary and from minimal information passed by parameters. The code could be easily adapted to any other 10g site. Send me a PM if you are interested, and I will post it on the web.
Ross Leishman
RodBarnes
Charter Member
Charter Member
Posts: 182
Joined: Fri Mar 18, 2005 2:10 pm

Spooling to a hash file

Post by RodBarnes »

kcbland wrote:If you take the natural keys in your source data, load them into a work table in your target database, then inner join those keys to your target table and spool those rows to the hashed lookup file, your hashed file will only contain those rows found in the database...

Inner join spooling is faster than full table spooling, especially if your not found % reaches near 100%...
I am very intrigued by your description here as it fits a current situation where I am trying to improve the performance. The whole bulk-load, inner-join method you describe makes sense to me but could you elaborate on the "spool those rows to the hashed lookup file"? Maybe I am not familiar with the term spooling in this context.
kcbland
Participant
Posts: 5208
Joined: Wed Jan 15, 2003 8:56 am
Location: Lutz, FL
Contact:

Post by kcbland »

Spooling, old slang for extracting the rows from a data source and writing to a target. "Spool to tape, spool to disk, spool to file." In this case, I'm talking about writing out to either a sequential or hashed file from a database. One would never spool to a database, however.
Kenneth Bland

Rank: Sempai
Belt: First degree black
Fight name: Captain Hook
Signature knockout: right upper cut followed by left hook
Signature submission: Crucifix combined with leg triangle
kcbland
Participant
Posts: 5208
Joined: Wed Jan 15, 2003 8:56 am
Location: Lutz, FL
Contact:

Post by kcbland »

A sourcing, extracting, unloading, exporting, or spooling job would be something like OCI-->XFM-->SEQ.
Kenneth Bland

Rank: Sempai
Belt: First degree black
Fight name: Captain Hook
Signature knockout: right upper cut followed by left hook
Signature submission: Crucifix combined with leg triangle
RodBarnes
Charter Member
Charter Member
Posts: 182
Joined: Fri Mar 18, 2005 2:10 pm

Post by RodBarnes »

kcbland wrote:Spooling, old slang for extracting the rows from a data source and writing to a target...
So, it would be something like this:
1) Insert extracted keys into temp table in DB.
2) Inner join to get list of existing keys and output from DB to hash file.
3) Process extracted rows using hash file to update existing rows in DB.
4) Process extracted rows using hash file to insert new rows in DB.

Correct?
Post Reply