Loading a DWH using Surrogate Key Generator / SCD stage

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
algfr
Participant
Posts: 106
Joined: Fri Sep 09, 2005 7:42 am

Loading a DWH using Surrogate Key Generator / SCD stage

Post by algfr »

Hello guys,

First I want to say that I have read the documentation, downloaded the red book and the example from IBM. I just need to know a few things that I cannot manage to clearly understand and I wuld just liek to know the general strategy. I use SK everywhere.

I have a DWH to load with dimensions (90% are type I 10% are type II) and 2 fact tables.

So I was thinking this to do it easily :

1) Create a job for each SCD1 dimension to load them in Oracle with the values retrieved from the operational system.

I have inserted a SK gen stage between my AS400 operational source and the Oracle target which create the ID for each business key. I have created the file with a standalone SK gen first.

=> It works great (I understand why there are gaps since I use PX and I don't care) but I want to make sure next time I run the job I can the keep the same IDs. Apparently the SK gen stage doesn't match the SK key with the business key (I think it is normal actually since I cannot link a BK to a SK in the surrogate key generator).

So my question is : to keep the same ID an add the new records I have to use a SCD stage ?

2) In the IBM example the primary source is fact data. I would have assumed in the first place that the primary link contains the current values and that you run them against the old one to detect changes. Why are fact data (sales amounts) are involved ? That's weird to me.

Usually, I would think to match facts with dims only when all dims are loaded and then find out ONLY at the end how to convert BK into SK in the fact table.

So I would like to know your classic DWH loading strategy using surrogate keys for primary keys with SCD ?

Thanks a lot.
algfr
Participant
Posts: 106
Joined: Fri Sep 09, 2005 7:42 am

Post by algfr »

Does anyone here have an idea ?
jcthornton
Premium Member
Premium Member
Posts: 79
Joined: Thu Mar 22, 2007 4:58 pm
Location: USA

Post by jcthornton »

There are plenty of very knowledgeable and experienced individuals who frequently participate on this forum. Most of the time, they have plenty of good ideas.

I don't speak for anybody else so I will not hazard a guess on why nobody answered the initial post.
but I want to make sure next time I run the job I can the keep the same ID
You cannot do this if you are just using a SK gen stage. All that stage will do is provide you an unused value from the sequence it is setup with. Every time you run, the values produced will be different.

If you want to perform an update and keep the same SK, you will need to get the SK from the existing record.

If you want to perform a Type 2 update with the same SK, the SK will not be your primary key, which eliminates some of the benefits of using SK in the first place.
So my question is : to keep the same ID an add the new records I have to use a SCD stage ?
No - you do not have to use an SCD stage. The same task can be performed using a combination of other stages. At the same time, why reinvent the wheel?


For part 2 of your questions, I cannot answer since I have not looked at the IBM example you are trying to figure out.
Jack Thornton
----------------
Spectacular achievement is always preceded by spectacular preparation - Robert H. Schuller
vmcburney
Participant
Posts: 3593
Joined: Thu Jan 23, 2003 5:25 pm
Location: Australia, Melbourne
Contact:

Post by vmcburney »

The slowly changing dimension stage assumes that you maintain dimensions as you load a fact table. It is a rigid definition that is not always applicable, what about when you have conformed dimensions? How about referential integrity? The way we use the SCD stage is to load a single dimension in a single DataStage job and to discard the "fact" outputs using a copy stage and to deliver the dimension output to the dimension table.

Make sure you tweak the surrogate key retrieve variable to retrieve 1000 or more at a time rather than 1 at a time or your job will run very slowly.

I don't care about burning (losing) surrogate key values - they are not meant to be used as a counter or an audit.
algfr
Participant
Posts: 106
Joined: Fri Sep 09, 2005 7:42 am

Post by algfr »

vmcburney wrote:The slowly changing dimension stage assumes that you maintain dimensions as you load a fact table. It is a rigid definition that is not always applicable, what about when you have conformed dimensions? How about referential integrity? The way we use the SCD stage is to load a single dimension in a single DataStage job and to discard the "fact" outputs using a copy stage and to deliver the dimension output to the dimension table.

Make sure you tweak the surrogate key retrieve variable to retrieve 1000 or more at a time rather than 1 at a time or your job will run very slowly.

I don't care about burning (losing) surrogate key values - they are not meant to be used as a counter or an audit.
You're right

What I finally did is using the SCD stage to perform updates and maintain the SK without using the output link (it goes to a copy stage).

When I load the fact table, I perform a lookup using the BK + date to retrieve the correct SK and use it in the final fact table.

Though I have an issue with Type 1 SCD : when the BK is a Varchar, it becomes blank in the update. I read it is a bug but I can't find the fixing patch on the site.

Other have the same strategy / issue ?
algfr
Participant
Posts: 106
Joined: Fri Sep 09, 2005 7:42 am

Post by algfr »

vmcburney wrote:The slowly changing dimension stage assumes that you maintain dimensions as you load a fact table. It is a rigid definition that is not always applicable, what about when you have conformed dimensions? How about referential integrity? The way we use the SCD stage is to load a single dimension in a single DataStage job and to discard the "fact" outputs using a copy stage and to deliver the dimension output to the dimension table.

Make sure you tweak the surrogate key retrieve variable to retrieve 1000 or more at a time rather than 1 at a time or your job will run very slowly.

I don't care about burning (losing) surrogate key values - they are not meant to be used as a counter or an audit.
You're right

What I finally did is using the SCD stage to perform updates and maintain the SK without using the output link (it goes to a copy stage).

When I load the fact table, I perform a lookup using the BK + date to retrieve the correct SK and use it in the final fact table.

Though I have an issue with Type 1 SCD : when the BK is a Varchar, it becomes blank in the update. I read it is a bug but I can't find the fixing patch on the site.

Other have the same strategy / issue ?
Post Reply