Surrogate key generation - best practices

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
satish_valavala
Participant
Posts: 123
Joined: Wed May 18, 2005 7:41 am
Location: USA

Surrogate key generation - best practices

Post by satish_valavala »

Hi All,
In Data Warehouse practice usually we generate Surrogate keys.
As most of us know that we can generate surrogate key in two ways, either through DataStage(Eg: KeyMgt routines or SurrogateKey generator stage (in PX) or
at Database level(Eg: Sequence generator(in Oracle) / Identifier field(in DB2 UDB) etc., ).
I would like to know which is best practice to follow and why?
What are the advantages and disadvantages with each method?
I appreciate if anyone respond in detail.

Regards
Regards
VS
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

My main consideration is whether any other person/application is likely to generate keys for these tables. In that case, I would definitely leave creation of the surrogate keys to the database (sequence or identity column). That way, DataStage is not generating key values that someone else may already have used. The others, too, need to be disciplined enough always to use the same mechanism.

If it can be guaranteed that no other process will create keys in the table I prefer to keep them in DataStage, because the reverse-lookup hashed files can be built/maintained as the surrogate keys are being generated. These are subsequently used to map natural keys to surrogate keys when loading the fact tables.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
vmcburney
Participant
Posts: 3593
Joined: Thu Jan 23, 2003 5:25 pm
Location: Australia, Melbourne
Contact:

Post by vmcburney »

Agree with all Ray says there. We are currently generating them in the RDBMS target. I prefer generating them in DataStage. As well as the reverse hash file lookups it also helps for data augmentation - where a parent record or lookup is missing and the job needs to add it on the fly. Helps for jobs that split a single source table into target parent child tables, the parents stream can generate the key and pass it to the child rows without the need to look it up.

Have found that the lookup stage that accepts conditions statements has helped us use RDBMS generated surrogate keys as it lets us only lookup keys where the key field is blank (due to augmentation or missing parents).
satish_valavala
Participant
Posts: 123
Joined: Wed May 18, 2005 7:41 am
Location: USA

Post by satish_valavala »

Thank you Ray and Vmc..
Regards
Regards
VS
elavenil
Premium Member
Premium Member
Posts: 467
Joined: Thu Jan 31, 2002 10:20 pm
Location: Singapore

Post by elavenil »

We are using a different method to generate keys in the dimensional modelling. There is a physical table (It is usually in the Staging DB) and there is a record for each dimension to have the maximum surrogate id and this can be incremented by system variable in DataStage. Rerun procedure can be easily handled if this table is designed more appropriate way. There will not be any gap in surrogate ids of the physical table if this method is used to generate surrogate keys.

Regards
Saravanan
SriKara
Premium Member
Premium Member
Posts: 30
Joined: Wed Jun 01, 2005 8:40 am
Location: UK

Post by SriKara »

But in this method, when do you update the physical table in the staging db which holds the surrogate ids?? Do you update it after the loading of every dimension table?? Are you assuming that there is no third application which is inserting records into dw?

I find this method very interesting :) Can you elaborate a bit more please.

Cheers.
elavenil
Premium Member
Premium Member
Posts: 467
Joined: Thu Jan 31, 2002 10:20 pm
Location: Singapore

Post by elavenil »

Hi Kara,

It is very simple to explain in detail how we are maintaining this table and surrogate id generation. But sometimes we are to be too narrow in revealing things.

You could think and analyze then you would get the way to handle this.

Regards
Saravanan
maffan76
Participant
Posts: 110
Joined: Tue Aug 23, 2005 5:27 am

Post by maffan76 »

We have used this approach in few of our projects, there is one table which contains the table name and the max value so far of the key column an after every load we add the current count of the loaded value in that metadata table, and when we want to add new rows to the target table we get the latest max value from this metadata tale.

however this approach has come issue like keeping insync both tables and gaps in the keys incase of multiple apps updating the target tables etc.

HTH
Regards,
Affan
"Questioning is Half Knowledge"
Post Reply