Hi All,
In Data Warehouse practice usually we generate Surrogate keys.
As most of us know that we can generate surrogate key in two ways, either through DataStage(Eg: KeyMgt routines or SurrogateKey generator stage (in PX) or
at Database level(Eg: Sequence generator(in Oracle) / Identifier field(in DB2 UDB) etc., ).
I would like to know which is best practice to follow and why?
What are the advantages and disadvantages with each method?
I appreciate if anyone respond in detail.
Regards
Surrogate key generation - best practices
Moderators: chulett, rschirm, roy
-
- Participant
- Posts: 123
- Joined: Wed May 18, 2005 7:41 am
- Location: USA
Surrogate key generation - best practices
Regards
VS
VS
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
My main consideration is whether any other person/application is likely to generate keys for these tables. In that case, I would definitely leave creation of the surrogate keys to the database (sequence or identity column). That way, DataStage is not generating key values that someone else may already have used. The others, too, need to be disciplined enough always to use the same mechanism.
If it can be guaranteed that no other process will create keys in the table I prefer to keep them in DataStage, because the reverse-lookup hashed files can be built/maintained as the surrogate keys are being generated. These are subsequently used to map natural keys to surrogate keys when loading the fact tables.
If it can be guaranteed that no other process will create keys in the table I prefer to keep them in DataStage, because the reverse-lookup hashed files can be built/maintained as the surrogate keys are being generated. These are subsequently used to map natural keys to surrogate keys when loading the fact tables.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
-
- Participant
- Posts: 3593
- Joined: Thu Jan 23, 2003 5:25 pm
- Location: Australia, Melbourne
- Contact:
Agree with all Ray says there. We are currently generating them in the RDBMS target. I prefer generating them in DataStage. As well as the reverse hash file lookups it also helps for data augmentation - where a parent record or lookup is missing and the job needs to add it on the fly. Helps for jobs that split a single source table into target parent child tables, the parents stream can generate the key and pass it to the child rows without the need to look it up.
Have found that the lookup stage that accepts conditions statements has helped us use RDBMS generated surrogate keys as it lets us only lookup keys where the key field is blank (due to augmentation or missing parents).
Have found that the lookup stage that accepts conditions statements has helped us use RDBMS generated surrogate keys as it lets us only lookup keys where the key field is blank (due to augmentation or missing parents).
Certus Solutions
Blog: Tooling Around in the InfoSphere
Twitter: @vmcburney
LinkedIn:Vincent McBurney LinkedIn
Blog: Tooling Around in the InfoSphere
Twitter: @vmcburney
LinkedIn:Vincent McBurney LinkedIn
-
- Participant
- Posts: 123
- Joined: Wed May 18, 2005 7:41 am
- Location: USA
We are using a different method to generate keys in the dimensional modelling. There is a physical table (It is usually in the Staging DB) and there is a record for each dimension to have the maximum surrogate id and this can be incremented by system variable in DataStage. Rerun procedure can be easily handled if this table is designed more appropriate way. There will not be any gap in surrogate ids of the physical table if this method is used to generate surrogate keys.
Regards
Saravanan
Regards
Saravanan
But in this method, when do you update the physical table in the staging db which holds the surrogate ids?? Do you update it after the loading of every dimension table?? Are you assuming that there is no third application which is inserting records into dw?
I find this method very interesting Can you elaborate a bit more please.
Cheers.
I find this method very interesting Can you elaborate a bit more please.
Cheers.
We have used this approach in few of our projects, there is one table which contains the table name and the max value so far of the key column an after every load we add the current count of the loaded value in that metadata table, and when we want to add new rows to the target table we get the latest max value from this metadata tale.
however this approach has come issue like keeping insync both tables and gaps in the keys incase of multiple apps updating the target tables etc.
HTH
however this approach has come issue like keeping insync both tables and gaps in the keys incase of multiple apps updating the target tables etc.
HTH
Regards,
Affan
"Questioning is Half Knowledge"
Affan
"Questioning is Half Knowledge"