BEST UPSERT TECHNIC

ZEEVS1970 · Post by **ZEEVS1970** » Wed Mar 08, 2006 6:55 am

WHAT IS THE BASE WAY TO PREFORM UPSERT TO A TABLE?

I TRYED DOING A DIRECT UPSERT VIA "UPDATE OR INSERT"
AND DISCOVERED IT TO BE NUCH SLOWER THEN PREFORMING
A LOOKUP USING A HASH FILE.

HOWEVER I CAN NOT SAY FOR SURE IF LIMITING THE HASH FILE
TO NEWER ROWS CAN MAKE THE PROCCESS MORE EFFICIANT.

I WOULD APPRICIATE AN ADVICE WHICH TECHNIC IS THE BEST
AND RUNS THE FASTEST ASSUNMING WE ARE DEALING WITH A RELATIVLY LARGE DB OF MILIONS OF RECORDS.

ArndW · Post by **ArndW** » Wed Mar 08, 2006 7:39 am

Zeevs1970,

first off, welcome to DSXchange.

Could you explain your problem a bit more, since writing to a hashed file doesn't directly relate to using upsert into a table - unless you are doing it to an ODBC data source that links to a hashed file.

Generally it is much faster to split your updates and inserts and use your Database's bulk load tool to quickly put in the inserts.

p.s. Using all UPPERCASE is considered to be shouting and is generally frowned up

kcbland · Post by **kcbland** » Wed Mar 08, 2006 7:47 am

During transformation split your data into rows for insertion and rows for updating. Bulk load your inserts, then perform the updates by either bulk loading into a work table and "merge" the data, or simply use update existing rows only. This method will be the fastest.

ZEEVS1970 · Post by **ZEEVS1970** » Wed Mar 08, 2006 8:07 am

however i do need a lookup hased file to determine
whic rows to update and wich rows to insert.

i created a job with 1 milion rows input than needs to be upserted
in table A containing 20 milion rows.

to do this i also created a hashed lookup file containing all records
already existing in table A.

then of course i'm splitting the transformer to insert and update
acording to the lookup results.

i also tried limiting the hashed file size by retrieving only the newest
rows from table Ainto the hashed file but it seems to me it doesn't shorten the proccess at all.

So am i to understand best way remains using a classic insert/update
based on a hashed file lookup containg all records?

kcbland · Post by **kcbland** » Wed Mar 08, 2006 8:14 am

If you take the natural keys in your source data, load them into a work table in your target database, then inner join those keys to your target table and spool those rows to the hashed lookup file, your hashed file will only contain those rows found in the database. During transformation, reference this hash file and divide inserts and updates.

It seems like more jobs and more processing, but each step is very fast. Pure inserts of natural keys into a work table can be bulk loaded. Inner join spooling is faster than full table spooling, especially if your not found % reaches near 100%. Putting fewer rows into a hash file is faster. Referencing a smaller hash file is faster. Bulk loading more rows as dedicated inserts is faster.

ray.wurlod · Post by **ray.wurlod** » Wed Mar 08, 2006 8:40 am

Your hashed file only needs to contain 20 million KEYS, which is a much smaller hashed file than 20 million rows, if all you're doing is checking for existence.

chulett · Post by **chulett** » Wed Mar 08, 2006 8:54 am

One million keys, max.

ZEEVS1970 · Post by **ZEEVS1970** » Thu Mar 09, 2006 5:39 am

I did try the method you suggested but strangly enough got quite
poor results.

I'm assuming maybe it's because too many rows needs updating.

for this test I cretated A source table with 1 milion rows derived from
the 20 milion target table and made the flowwing changes:
1. half of them I changed an ordineray filed
2. the other half I changed a primery key so the row will be
taken as new.

when I ran my job using all 20 milion target rows loaded in a lookup hash
file the proccess took 58 minutes.

however when I used the method you suggested combining the inner join,
the porcess took 1 hour and 45 minutes.

I'm still searching for a method to reduce the process to run less than
1 hour.

chulett · Post by **chulett** » Thu Mar 09, 2006 7:31 am

Which part of the process took that long? If it is the inner join, look to optimizing your hashed file build query...

kcbland · Post by **kcbland** » Thu Mar 09, 2006 8:30 am

Did you write the inserts to a file? Did you write the updates to a file? I bet you directly wrote to the database. If your hash file only contains the necessary rows for processing, you'll find the hash file is more performant. Trust Craig and I on this.

rleishman · Post by **rleishman** » Fri Mar 10, 2006 1:13 am

I took a slightly different approach. Rather than join to the target in transform, I transformed as if everything was an insert, even though some of these rows would eventually fully replace existing rows. The target of the transformation was a flat file.

The object was then to MERGE that flat file with the target table. In Oracle, this can be some with the MERGE statement. This was my first approach, and it worked OK except that there is a bug with MERGE /*+ APPEND */ in 10g if the target table has a Materialized View Log. /*+ APPEND*/ is also inferior to SQL*Loader because it does not work if there are enabled Foreign Keys.

Instead I decided to DELETE the rows that existed, and SQL*Load the entire file knowing that all of them would load correctly. This worked great if there were not many DELETEs (it was at least as good as the MERGE), but like the MERGE, it was nasty if there were lots of DELETEs.

To counter this, I wrote a generic re-usable PL/SQL routine that does the following:
- Joins the flat file (as an EOT) to the target, loading the ROWIDs of matching rows into a GLOBAL TEMPORARY TABLE. These are the rows to be deleted.
- For each partition in the target table, if the number of rows in that partition to be deleted is less than x% (I use 10%) of the total rows, then just delete them.
- If more than 10% of the partition is to be deleted, then it is done by Partition Exchange. ie. CREATE TABLE pex AS SELECT * FROM target PARTITION (part) WHERE rowid NOT IN (SELECT row_id FROM gtt). Create the same index and foreign keys as the target table, and then exchange it with the partition in the target.

The whole thing dynamically generates the DML and DDL required from the data dictionary and from minimal information passed by parameters. The code could be easily adapted to any other 10g site. Send me a PM if you are interested, and I will post it on the web.

RodBarnes · Post by **RodBarnes** » Fri Apr 28, 2006 9:28 am

kcbland wrote:If you take the natural keys in your source data, load them into a work table in your target database, then inner join those keys to your target table and spool those rows to the hashed lookup file, your hashed file will only contain those rows found in the database...

Inner join spooling is faster than full table spooling, especially if your not found % reaches near 100%...

I am very intrigued by your description here as it fits a current situation where I am trying to improve the performance. The whole bulk-load, inner-join method you describe makes sense to me but could you elaborate on the "spool those rows to the hashed lookup file"? Maybe I am not familiar with the term spooling in this context.

kcbland · Post by **kcbland** » Fri Apr 28, 2006 9:41 am

Spooling, old slang for extracting the rows from a data source and writing to a target. "Spool to tape, spool to disk, spool to file." In this case, I'm talking about writing out to either a sequential or hashed file from a database. One would never spool to a database, however.

kcbland · Post by **kcbland** » Fri Apr 28, 2006 9:43 am

A sourcing, extracting, unloading, exporting, or spooling job would be something like OCI-->XFM-->SEQ.

RodBarnes · Post by **RodBarnes** » Fri Apr 28, 2006 9:48 am

kcbland wrote:Spooling, old slang for extracting the rows from a data source and writing to a target...

So, it would be something like this:
1) Insert extracted keys into temp table in DB.
2) Inner join to get list of existing keys and output from DB to hash file.
3) Process extracted rows using hash file to update existing rows in DB.
4) Process extracted rows using hash file to insert new rows in DB.

Correct?

DSXchange

BEST UPSERT TECHNIC

BEST UPSERT TECHNIC

that's what i thought also

maybe the % of input rows to be updated should be considered

Spooling to a hash file