Page 1 of 1

Surrogate key generation - best practices

Posted: Thu May 26, 2005 1:44 pm
by satish_valavala
Hi All,
In Data Warehouse practice usually we generate Surrogate keys.
As most of us know that we can generate surrogate key in two ways, either through DataStage(Eg: KeyMgt routines or SurrogateKey generator stage (in PX) or
at Database level(Eg: Sequence generator(in Oracle) / Identifier field(in DB2 UDB) etc., ).
I would like to know which is best practice to follow and why?
What are the advantages and disadvantages with each method?
I appreciate if anyone respond in detail.

Regards

Posted: Thu May 26, 2005 4:59 pm
by ray.wurlod
My main consideration is whether any other person/application is likely to generate keys for these tables. In that case, I would definitely leave creation of the surrogate keys to the database (sequence or identity column). That way, DataStage is not generating key values that someone else may already have used. The others, too, need to be disciplined enough always to use the same mechanism.

If it can be guaranteed that no other process will create keys in the table I prefer to keep them in DataStage, because the reverse-lookup hashed files can be built/maintained as the surrogate keys are being generated. These are subsequently used to map natural keys to surrogate keys when loading the fact tables.

Posted: Thu May 26, 2005 6:17 pm
by vmcburney
Agree with all Ray says there. We are currently generating them in the RDBMS target. I prefer generating them in DataStage. As well as the reverse hash file lookups it also helps for data augmentation - where a parent record or lookup is missing and the job needs to add it on the fly. Helps for jobs that split a single source table into target parent child tables, the parents stream can generate the key and pass it to the child rows without the need to look it up.

Have found that the lookup stage that accepts conditions statements has helped us use RDBMS generated surrogate keys as it lets us only lookup keys where the key field is blank (due to augmentation or missing parents).

Posted: Thu May 26, 2005 7:10 pm
by satish_valavala
Thank you Ray and Vmc..
Regards

Posted: Sun Jun 05, 2005 6:54 am
by elavenil
We are using a different method to generate keys in the dimensional modelling. There is a physical table (It is usually in the Staging DB) and there is a record for each dimension to have the maximum surrogate id and this can be incremented by system variable in DataStage. Rerun procedure can be easily handled if this table is designed more appropriate way. There will not be any gap in surrogate ids of the physical table if this method is used to generate surrogate keys.

Regards
Saravanan

Posted: Sun Jun 05, 2005 7:08 am
by SriKara
But in this method, when do you update the physical table in the staging db which holds the surrogate ids?? Do you update it after the loading of every dimension table?? Are you assuming that there is no third application which is inserting records into dw?

I find this method very interesting :) Can you elaborate a bit more please.

Cheers.

Posted: Sun Jun 05, 2005 7:12 am
by elavenil
Hi Kara,

It is very simple to explain in detail how we are maintaining this table and surrogate id generation. But sometimes we are to be too narrow in revealing things.

You could think and analyze then you would get the way to handle this.

Regards
Saravanan

Posted: Sun Oct 30, 2005 5:40 am
by maffan76
We have used this approach in few of our projects, there is one table which contains the table name and the max value so far of the key column an after every load we add the current count of the loaded value in that metadata table, and when we want to add new rows to the target table we get the latest max value from this metadata tale.

however this approach has come issue like keeping insync both tables and gaps in the keys incase of multiple apps updating the target tables etc.

HTH