Page 1 of 1

SCD type at column level

Posted: Tue Aug 09, 2005 7:05 am
by Luk
Hello!

Some time ago there was a discussion about multiple SCD types in one dimension. Original post is here:
viewtopic.php?t=93674&highlight=scd

The main conclusion was that Type 1 or 2 is for whole dimension, not for single field in dimension.
Recently I have read Kimbal's book DW ETL Toolkit and there is another method called Hybrid SCD. Ralph says that SCD type at field level is possible. You may have 3 kinds of SCDs in one dimension (i.e. column1 is type 1, column2 is type3 and column 3 and 4 is type2). When you process dimension you first manage types 2 and 3 and then make update on all fields with correct natural (source) keys.

How do you comment this issue? Do you think this is good idea? Is there some limitations of this solution (for example performance)?

Thanks for any comments!

Posted: Tue Aug 09, 2005 12:37 pm
by wnogalski
Hybrid SCDs are common - sometimes there's no need to track history for some fields - f.e. consumer name or sex :wink: Actually it's the business people who decide what do they want to track - so they decide if it's a good idea or not.
As in the link You gave, there is one approach that may help You with Hybrid SCDs - simply divide the dimension into two - one will be Type 1 and the second Type 2 - see Ralph's "Data Warehouse Lifecycle Toolkit" page 184 :)

Posted: Tue Aug 09, 2005 3:11 pm
by ray.wurlod
You might also like to investigate the concept of rapidly changing dimensions, so that you have the complete picture.

Posted: Tue Aug 09, 2005 5:26 pm
by kduke
I still think this is ugly. I think on a certain level they want to know what is an elogant solution and not ugly. I think the trade off is mutiple records versus a single record. It is easier to compare previous values if it is stored all in one record. How often is this a requirement in your reporting environment that you know the previous value? This may determine SCD Hybrid or type 2 or 3.

I think type 2 is easy to follow and implement. I think it garuntees reporting accuracy at the dimension level. I do not care as to how rapid this dimension is changing. There is a cost as to the loading time of facts. You need to find the correct dimension key to insert into the fact based on the date of the fact.

Lets say customer 111 has several facts. We want to report credit card trasnactions or facts based on the customer zip code. Therefore we decide to use SCD type 2 dimensions. So maybe the first time we load customer 111 it gets surrogate key 2222 with zip code 12345 in Maryland. Next time it changes on 1/15/2004 to zip code 77777 in Alaska with surrogate key 3211. We need to know the date of the transaction to pick either 3211 or 2222 for this customer.

It appears to me that these other SCDs can tell you how many people moved from Maryland to Alaska within one record but what other advantage does this have. Now within that one record we move fields around from ZIP1 to ZIP2.

I think the question was more about this type of impression. I would like to hear others opinions about advatages and disadvantages.

Posted: Wed Aug 10, 2005 1:35 am
by Luk
I think the question was more about this type of impression. I would like to hear others opinions about advatages and disadvantages.
correct :)

Posted: Thu Aug 11, 2005 12:33 pm
by mhester
I think going the hybrid route is a bad idea and I'll give you an example. Let's say you have a table named DIM_STORE. You decide that some columns in the table are type 1 and some are type 2. To avoid anomolies, each time any of the type 1 attributes change you must go back and update all rows based on the natural key.

Example

Code: Select all


natural key  surrogate key          type1_col   type2_col

1                 1                   ABC        DEF                   Expired
1                 2                   ABC        GHI                   Expired
1                 3                   ZZZ        XXX                   Active

In the above example where a type 1 column changed in the third record you must go back and update all of the records with that natural key.

Code: Select all


natural key  surrogate key          type1_col type2_col

1                 1                   ZZZ        DEF                   Expired
1                 2                   ZZZ        GHI                   Expired
1                 3                   ZZZ        XXX                   Active

This adds a level of complexity that is not necessary. You may choose to not update the other two rows, but then reports would be wrong since the real value for type1_col should be ZZZ, but will be reported as ABC.

I know that Ralph and others write about doing hybrids, but I don't think it's a really solid solution.

Just my opinion.