Surrogate key generation - best practices

satish_valavala · Post by **satish_valavala** » Thu May 26, 2005 1:44 pm

Hi All,
In Data Warehouse practice usually we generate Surrogate keys.
As most of us know that we can generate surrogate key in two ways, either through DataStage(Eg: KeyMgt routines or SurrogateKey generator stage (in PX) or
at Database level(Eg: Sequence generator(in Oracle) / Identifier field(in DB2 UDB) etc., ).
I would like to know which is best practice to follow and why?
What are the advantages and disadvantages with each method?
I appreciate if anyone respond in detail.

Regards

ray.wurlod · Post by **ray.wurlod** » Thu May 26, 2005 4:59 pm

My main consideration is whether any other person/application is likely to generate keys for these tables. In that case, I would definitely leave creation of the surrogate keys to the database (sequence or identity column). That way, DataStage is not generating key values that someone else may already have used. The others, too, need to be disciplined enough always to use the same mechanism.

If it can be guaranteed that no other process will create keys in the table I prefer to keep them in DataStage, because the reverse-lookup hashed files can be built/maintained as the surrogate keys are being generated. These are subsequently used to map natural keys to surrogate keys when loading the fact tables.

vmcburney · Post by **vmcburney** » Thu May 26, 2005 6:17 pm

Agree with all Ray says there. We are currently generating them in the RDBMS target. I prefer generating them in DataStage. As well as the reverse hash file lookups it also helps for data augmentation - where a parent record or lookup is missing and the job needs to add it on the fly. Helps for jobs that split a single source table into target parent child tables, the parents stream can generate the key and pass it to the child rows without the need to look it up.

Have found that the lookup stage that accepts conditions statements has helped us use RDBMS generated surrogate keys as it lets us only lookup keys where the key field is blank (due to augmentation or missing parents).

satish_valavala · Post by **satish_valavala** » Thu May 26, 2005 7:10 pm

Thank you Ray and Vmc..
Regards

elavenil · Post by **elavenil** » Sun Jun 05, 2005 6:54 am

We are using a different method to generate keys in the dimensional modelling. There is a physical table (It is usually in the Staging DB) and there is a record for each dimension to have the maximum surrogate id and this can be incremented by system variable in DataStage. Rerun procedure can be easily handled if this table is designed more appropriate way. There will not be any gap in surrogate ids of the physical table if this method is used to generate surrogate keys.

Regards
Saravanan

SriKara · Post by **SriKara** » Sun Jun 05, 2005 7:08 am

But in this method, when do you update the physical table in the staging db which holds the surrogate ids?? Do you update it after the loading of every dimension table?? Are you assuming that there is no third application which is inserting records into dw?

I find this method very interesting

Can you elaborate a bit more please.

Cheers.

elavenil · Post by **elavenil** » Sun Jun 05, 2005 7:12 am

Hi Kara,

It is very simple to explain in detail how we are maintaining this table and surrogate id generation. But sometimes we are to be too narrow in revealing things.

You could think and analyze then you would get the way to handle this.

Regards
Saravanan

maffan76 · Post by **maffan76** » Sun Oct 30, 2005 5:40 am

We have used this approach in few of our projects, there is one table which contains the table name and the max value so far of the key column an after every load we add the current count of the loaded value in that metadata table, and when we want to add new rows to the target table we get the latest max value from this metadata tale.

however this approach has come issue like keeping insync both tables and gaps in the keys incase of multiple apps updating the target tables etc.

HTH