Insert New/Update Existing vs. Update Existing/Insert New

sbass1 · Post by **sbass1** » Tue Aug 11, 2009 10:32 pm

Hi,

I have a job that:

* Extracts data from Sybase
* Performs two lookups (one against the target table for target SK lookup or creation (KeyMgtGetNextValueConcurrent), one against a calendar hashed file for date SK lookup
* Writes the result to a target table (DRS stage, SQL Server table).
* Sends rejects to a sequential file (rejects should be very rare if at all).

The job is performing poorly, so I'm looking for areas to tweak. Areas I'm looking at:

* Source Stage: Sybase Stage, Packet Size: change from blank to 1024 (??? - experimenting with the best value)

* Target Table Lookup Stage (for SK lookup): DRS Stage, Array Size: change from 1 to 10000 (??? - experimenting with the best value). Note I don't want to create a hashed file for this lookup unless this lookup against a large target table is the key bottleneck.

* Target Stage: DRS Stage:

--> Array Size: I've left as 1, due to this help text:

During input link processing, rows of data may be rejected by the database for various reasons, such as unique constraint violations or data type mismatches.

The DRS stage writes the offending row to the log for the DataStage job. For the database message detail, you must use the error messages returned by the database.

DataStage provides additional reject row handling. To use this capability:

Set Array Size to 1.

Use a Transformer stage to redirect the rejected rows.

You can then design your job by choosing an appropriate target for the rejected rows, such as a Sequential stage. Reuse this target as an input source once you resolve the issues with the offending row values.

--> Transaction Size: I've left as 0, due to this help text:

Specifies the number of rows written before a commit is executed in the database. A value of 0 causes all rows in the job to be written as a single transaction.

--> Update Action: This is where I need clarification of the help text. It says:

* Update existing rows or insert new rows. Updates the existing data rows before adding new rows. It is faster to update first when you have a large number of records.

* Insert new rows or update existing rows. Inserts the new rows before updating existing rows. It is faster to insert first if you have only a few records.

But which "large (few) number of records"? The source, or target? I assume target, since it's update action. But just clarifying.

Two scenarios:

1) The initial load of the empty table with the full target table is taking a LONG time. So, in this scenario, Source = "large", Target = "few (none)".

2) The "normal" load will be a small delta file into the large target. So, in this scenario, Source = "small", Target = "large".

I'd like both scenarios to have optimum performance, but priority goes to scenario #2.

Thanks,
Scott

chulett · Post by **chulett** » Tue Aug 11, 2009 10:47 pm

Combo actions are something I personally avoid the vast majority of the time except for very small data volumes - i.e. one row. I prefer qualifying my actions and using dedicated insert and update links. It also means that - if desired - I can land both to files, bulk load the inserts and then do the updates after that. Doing that also means you don't get into the situation where you seem to be - your ideal choice for delta runs causes poor performance during the initial run... or vice versa. That being said...

I can't really comment on their choice of words in the help text. The biggest thing you have to understand is that the first action must fail in order for the second action to be triggered. So you want to use the option that will get you the most successful actions on the first try. If the 'majority' of your work will be inserting new records, make sure you pick the 'Insert then Update' action. Contrary-wise, if you're going to be doing mostly updates, then choose the other action. Getting it wrong is what kills performance because you've now changed what should have been one database operation into two. Also make sure you have the unique index you need in order for 'Insert else Update' to work properly if you go that route.

Hope that helps.

sbass1 · Post by **sbass1** » Wed Aug 12, 2009 1:36 am

Hi Craig,

Thanks for the ideas. Follow up questions:

1) DS is on Unix, SQL Server on ... drum roll please ... Windows

So if I did create a sequential file for bulk insert, I assume it would have to be ftp'd to Windows or otherwise accessible from the SQL Server machine.

A quick read of SQL Server's bulk load capability:

BULK INSERT AdventureWorks.Sales.SalesOrderDetail
FROM 'f:\orders\lineitem.tbl'
WITH
(
FIELDTERMINATOR =' |',
ROWTERMINATOR =' |\n'
)

So I'm assuming bulk insert of my inserts falls into the "too hard" bucket, unless there's some other bulk insert capability for SQL Server that I'm unaware of.

2) I'll investigate splitting the stream, but am worried about the deadlocks I often hit with this approach. When I have two concurrent database connections, one inserting and one updating, I often get deadlock conditions.

I'll play with #2, but if you (or anyone) knows how to get DS to delay one stream (link) until another one (from the same transformer) has finished, please let me know (I don't think it's possible).

I'll definitely keep in mind the general best practice to avoid combo actions.

Thanks,
Scott

chulett · Post by **chulett** » Wed Aug 12, 2009 6:30 am

1) Ah, yes... SQL Server. I'm not really all that familiar with it but from what I've seen here you would have to deliver the file to the Windows server in order to bulk load it. Don't have that issue with other databases so tend to forget that little qualifier. So someone else more SQL Server savvy would need to help out with that specific aspect.

2) Hmmm, perhaps again something specific to SQL Server? Using a single OCI stage with two links writing to the same table is very common and does not create any sort of deadlock situation in Oracle. You may run afoul of that if you use two separate stages, but perhaps that distinction is lost on SQL Server... I honestly don't know.

The only way to achieve the 'delay' you mention is to introduce a process break of some kind in one link, say land the second link then pick it back up and send it to the database. That would allow one to complete before the other started. Otherwise, no it is not possible. I've also done it for small volumes by sending the second link through a Sort stage that does a completely unnecessary sort, well unnecessary except for the delay that it introduces in that link. That more for RI issues than deadlocks, though, where I wanted to be certain the parents arrived before the children when writing to related tables in one stage.