SCD stage functionality
Posted: Thu Feb 25, 2010 11:41 am
Hello everyone,
Long time listener... first time caller ...
I've spent the last couple of days trying to master
the behaviour of the SCD stage in order to implement
its usage on our spanking new project.
I followed the recipe from "IBM InfoSphere DataStage Data Flow and Job Design"
http://www.redbooks.ibm.com/redbooks/pdfs/sg247576.pdf
Section 2.14 Slowly Changing Dimension (pages 113 to 126).
Here's what I have so far:
I got everything working and here's an example of the results:
Results of the change:
My concern is mostly with the results. I need the SCD1 column complete history to be updated like this:
COLOR_NAME (type 1) column has to have the same value ("TRUE WHITE") for ALL records,
as opposed to having only the last record value updated.
I understand that the update that comes out of the SCD stage
is also used to expire the record, hence it is based on the surrogate key.
We could debate on the nature of the SCD1 column update (and believe me I have)
but you can find an explanation of what I need to do (by Ralph Kimball) here:
http://www.kimballgroup.com/html/design ... ingSCD.pdf
**************************************************
I need to be able to do an update based on the business key
to the entire history of the column when a change is detected.
**************************************************
Here are my 3 possible solutions using the SCD stage:
SOLUTION 1: Use the OUTPUT to detect the changes
By comparing the CURRENT value of the SCD type 1 column with the NEW value
I could create a new update record based on the business key.
PROBLEM: The output map doesn't return the previous value of the dimension (prior to the change).
It seems to return the value after the change (from memory).
This is what I see (same value for NEW and CURRENT):
This is what I would like to see:
Question 1: Is this a bug, or is the OUTPUT MAP supposed to work this way?
_____________________________________________________________________________________________________
SOLUTION 2: Use a "LOOKUP FILE SET" or "DATA SET" for the reference lookup (CURRENT)
By using a "LOOKUP FILE SET" or "DATA SET" for the CURRENT,
we could reuse it in the comparison of the DimUpd records from the SCD stage.
This would allow me to construct a second update file based on the business key.
PROBLEM: I cannot get the LOOKUP FILE SET to work.
Question 2: What stages work as a "reference lookup" for the SCD stage?
_____________________________________________________________________________________________________
SOLUTION 3: Reuse de UpdDim records to do a blind update on the SCD type 1 columns based on the business key
This is probably the simplest to code. All records that make it to the Upsert
are reused to update the SCD Type 1 columns without validating if a change took place on the SCD Type 1 column.
PROBLEM: I would be doing unnecessary updates when the values for the SCD type 1 column haven't changed,
which is probably most of the time
Question 3: Any other ideas on how to do this with the SCD stage?
_____________________________________________________________________________________________________
One final thought; I am aware that there methods of doing this wihout an SCD stage ...
but it would defeat the purpose of having an SCD stage if we don't use it
Thank you for your time!
Long time listener... first time caller ...
I've spent the last couple of days trying to master
the behaviour of the SCD stage in order to implement
its usage on our spanking new project.
I followed the recipe from "IBM InfoSphere DataStage Data Flow and Job Design"
http://www.redbooks.ibm.com/redbooks/pdfs/sg247576.pdf
Section 2.14 Slowly Changing Dimension (pages 113 to 126).
Here's what I have so far:
Code: Select all
Current Current : Dimension reference records
| New : New "potential" dimension records
| SCD : SCD stage (contains the surrogate key)
New--------SCD-------Dimension Dimension: Dimension upsert records (update and insert)
| Output : Combination of data from "New" and "Current"
|
Output
Code: Select all
*************
* NEW *
*************
business_key type_1 type_2 effective_date
COLOR_CODE COLOR_NAME COLOR_GROUP EFFECT_DATE
------------ ---------- ----------- --------------
FFFFFF TRUE WHITE LIGHT 2001-01-01
*************
* CURRENT *
*************
surrogate_key business_key type_1 type_2 effective_date expiry_date current_indicator
COLOR_ID COLOR_CODE COLOR_NAME COLOR_GROUP EFFECT_DATE EXPIR_DATE CURRENT_IND
------------- ------------ ---------- ----------- -------------- ----------- --------------
000000001 FFFFFF WHITE CLEAR 2000-01-01 2999-12-31 Y
Results of the change:
Code: Select all
*************************
* DIMENSION RESULT *
*************************
surrogate_key business_key type_1 type_2 effective_date expiry_date current_indicator
COLOR_ID COLOR_CODE COLOR_NAME COLOR_GROUP EFFECT_DATE EXPIR_DATE CURRENT_IND
------------- ------------ ---------- ----------- -------------- ----------- --------------
000000001 FFFFFF WHITE CLEAR 2000-01-01 2001-01-01 N
000000002 FFFFFF TRUE WHITE LIGHT 2001-01-01 2999-12-31 Y
Code: Select all
*************************
* DESIRED RESULT *
*************************
surrogate_key business_key type_1 type_2 effective_date expiry_date current_indicator
COLOR_ID COLOR_CODE COLOR_NAME COLOR_GROUP EFFECT_DATE EXPIR_DATE CURRENT_IND
------------- ------------ ---------- ----------- -------------- ----------- --------------
000000001 FFFFFF TRUE WHITE CLEAR 2000-01-01 2001-01-01 N
000000002 FFFFFF TRUE WHITE LIGHT 2001-01-01 2999-12-31 Y
as opposed to having only the last record value updated.
I understand that the update that comes out of the SCD stage
is also used to expire the record, hence it is based on the surrogate key.
We could debate on the nature of the SCD1 column update (and believe me I have)
but you can find an explanation of what I need to do (by Ralph Kimball) here:
http://www.kimballgroup.com/html/design ... ingSCD.pdf
**************************************************
I need to be able to do an update based on the business key
to the entire history of the column when a change is detected.
**************************************************
Here are my 3 possible solutions using the SCD stage:
SOLUTION 1: Use the OUTPUT to detect the changes
By comparing the CURRENT value of the SCD type 1 column with the NEW value
I could create a new update record based on the business key.
PROBLEM: The output map doesn't return the previous value of the dimension (prior to the change).
It seems to return the value after the change (from memory).
This is what I see (same value for NEW and CURRENT):
Code: Select all
*************************
* OUTPUT *
*************************
NEW NEW CURRENT CURRENT
business_key type_1 business_key type_1
COLOR_CODE COLOR_NAME COLOR_CODE COLOR_NAME
------------ ---------- ------------ ----------
FFFFFF TRUE WHITE FFFFFF TRUE WHITE
Code: Select all
*************************
* DESIRED OUTPUT *
*************************
NEW NEW CURRENT CURRENT
business_key type_1 business_key type_1
COLOR_CODE COLOR_NAME COLOR_CODE COLOR_NAME
------------ ---------- ------------ ----------
FFFFFF TRUE WHITE FFFFFF WHITE
_____________________________________________________________________________________________________
SOLUTION 2: Use a "LOOKUP FILE SET" or "DATA SET" for the reference lookup (CURRENT)
By using a "LOOKUP FILE SET" or "DATA SET" for the CURRENT,
we could reuse it in the comparison of the DimUpd records from the SCD stage.
This would allow me to construct a second update file based on the business key.
PROBLEM: I cannot get the LOOKUP FILE SET to work.
Question 2: What stages work as a "reference lookup" for the SCD stage?
_____________________________________________________________________________________________________
SOLUTION 3: Reuse de UpdDim records to do a blind update on the SCD type 1 columns based on the business key
This is probably the simplest to code. All records that make it to the Upsert
are reused to update the SCD Type 1 columns without validating if a change took place on the SCD Type 1 column.
PROBLEM: I would be doing unnecessary updates when the values for the SCD type 1 column haven't changed,
which is probably most of the time
Question 3: Any other ideas on how to do this with the SCD stage?
_____________________________________________________________________________________________________
One final thought; I am aware that there methods of doing this wihout an SCD stage ...
but it would defeat the purpose of having an SCD stage if we don't use it
Thank you for your time!