Page 2 of 2

Posted: Fri Apr 28, 2006 11:45 am
by kcbland
RodBarnes wrote:Correct?
Yes, for a minimum out of effort. I'm in the camp of spooling out the entire row (all columns) and staging into a sequential file first. This allows you to have a "before" image of a row that may be potentially updated.

Then, another job loads that sequential file into two hashed files: one that is a straight copy of the row using the surrogate primary key, the other that is the natural key to surrogate primary key cross reference, where the natural key is the hashed file primary key. This allows for a high-performance surrogate key lookup during foreign surrogate key substitution in child table processing. Also, by staging the entire row in its unadulterated form in a staging hashed file, you can update the existing contents of the hashed file with newer data during transformation processing.

During transformation processing, you can lookup against the natural to surrogate key cross-reference hashed file for insert or update determination and stage into the staging hashed files appropriately. During update processing, you would then reference the staging copy of the row and differentiate the newer row from the existing row and decide if an update is worthwhile. In the case of a slowly changing dimension, you could opt to generate the newer variant and potentially "retire" the older variant.

Using the hashed files as a staging mechanism allows you to handle the most difficult transformation exercises in data warehousing: SCD's, surrogate key assignment, foreign surrogate key substitution, repeating natural keys in a dataset, and milestoned/recoverable processing.

There's a few documents on my website that talk about this in detail. Take a peek.