I have a job that:
* Extracts data from Sybase
* Performs two lookups (one against the target table for target SK lookup or creation (KeyMgtGetNextValueConcurrent), one against a calendar hashed file for date SK lookup
* Writes the result to a target table (DRS stage, SQL Server table).
* Sends rejects to a sequential file (rejects should be very rare if at all).
The job is performing poorly, so I'm looking for areas to tweak. Areas I'm looking at:
* Source Stage: Sybase Stage, Packet Size: change from blank to 1024 (??? - experimenting with the best value)
* Target Table Lookup Stage (for SK lookup): DRS Stage, Array Size: change from 1 to 10000 (??? - experimenting with the best value). Note I don't want to create a hashed file for this lookup unless this lookup against a large target table is the key bottleneck.
* Target Stage: DRS Stage:
--> Array Size: I've left as 1, due to this help text:
--> Transaction Size: I've left as 0, due to this help text:During input link processing, rows of data may be rejected by the database for various reasons, such as unique constraint violations or data type mismatches.
The DRS stage writes the offending row to the log for the DataStage job. For the database message detail, you must use the error messages returned by the database.
DataStage provides additional reject row handling. To use this capability:
Set Array Size to 1.
Use a Transformer stage to redirect the rejected rows.
You can then design your job by choosing an appropriate target for the rejected rows, such as a Sequential stage. Reuse this target as an input source once you resolve the issues with the offending row values.
--> Update Action: This is where I need clarification of the help text. It says:Specifies the number of rows written before a commit is executed in the database. A value of 0 causes all rows in the job to be written as a single transaction.
But which "large (few) number of records"? The source, or target? I assume target, since it's update action. But just clarifying.* Update existing rows or insert new rows. Updates the existing data rows before adding new rows. It is faster to update first when you have a large number of records.
* Insert new rows or update existing rows. Inserts the new rows before updating existing rows. It is faster to insert first if you have only a few records.
Two scenarios:
1) The initial load of the empty table with the full target table is taking a LONG time. So, in this scenario, Source = "large", Target = "few (none)".
2) The "normal" load will be a small delta file into the large target. So, in this scenario, Source = "small", Target = "large".
I'd like both scenarios to have optimum performance, but priority goes to scenario #2.
Thanks,
Scott