Implementing Type 2 SCD with DS
Moderators: chulett, rschirm, roy
Implementing Type 2 SCD with DS
I'm really new to datastage.
We're building a DWH from scratch, and most of our dimension tables need to be Type 2 SCD.
I need to see what has changed (I can't trust the audit columns), and insert a new row when something has changed updating the relevant Effective/Termination date fields.
I'd like to know the best way of implimenting this without writing too much code - since I'd like to keep the jobs as maintainable as possible.
Can anyone send me a few .dsx files as examples of how this problem was overcome with DS?
We're building a DWH from scratch, and most of our dimension tables need to be Type 2 SCD.
I need to see what has changed (I can't trust the audit columns), and insert a new row when something has changed updating the relevant Effective/Termination date fields.
I'd like to know the best way of implimenting this without writing too much code - since I'd like to keep the jobs as maintainable as possible.
Can anyone send me a few .dsx files as examples of how this problem was overcome with DS?
-
- Premium Member
- Posts: 892
- Joined: Thu Oct 16, 2003 5:18 am
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
I'm doing a lot of those right now, using DataStage 7.0.
My preferred technique is to use a stage variable for each pair of column comparisons. This allows me to Trim, pad, do whatever I need to do (for example "nulls compare equal") before determining whether the columns match or not. These Boolean stage variables are then assembled with Or operators to yield a "change detected" true/false value.
For this to work you need to have captured a before-image of current ("live") rows from the target table (into a hashed file is good) and have a mechanism for reverse lookup from natural key to surrogate key (another hashed file is an excellent choice). The hashed files must be maintained in real time so that existence and change can be detected as at the row just processed, even though the job itself may not be effecting changes directly into the target table (for example into a staging area instead, from which bulk loader can populate the table).
My preferred technique is to use a stage variable for each pair of column comparisons. This allows me to Trim, pad, do whatever I need to do (for example "nulls compare equal") before determining whether the columns match or not. These Boolean stage variables are then assembled with Or operators to yield a "change detected" true/false value.
For this to work you need to have captured a before-image of current ("live") rows from the target table (into a hashed file is good) and have a mechanism for reverse lookup from natural key to surrogate key (another hashed file is an excellent choice). The hashed files must be maintained in real time so that existence and change can be detected as at the row just processed, even though the job itself may not be effecting changes directly into the target table (for example into a staging area instead, from which bulk loader can populate the table).
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Add an extra column to the target table to hold a CRC value. Prior to processing a new file, create a hashED file containing CRC values of the latest records.
Assuming your source is a csv file, read the whole line in and discard records whose CRC already exist in the hashED file. Records whose CRC do not exist in the hashED file can be split using a Row Splitter stage and written to the target as well as the hashED file. If your source is a database table you will have to use a Row Merger stage first.
The advantage is that you do not have to do column by column comparisons nor will you have to maintain reverse lookups for keys.
The disadvantage is that there is a one chance in a googol (I don't know the exact number) that two different strings may evaluate to the same CRC value.
Assuming your source is a csv file, read the whole line in and discard records whose CRC already exist in the hashED file. Records whose CRC do not exist in the hashED file can be split using a Row Splitter stage and written to the target as well as the hashED file. If your source is a database table you will have to use a Row Merger stage first.
The advantage is that you do not have to do column by column comparisons nor will you have to maintain reverse lookups for keys.
The disadvantage is that there is a one chance in a googol (I don't know the exact number) that two different strings may evaluate to the same CRC value.
Jim Paradies
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
There is a small, calculable probability that a CRC value will not be unique. It's the reciprocal of the number of possible different values. For example, a 32-bit CRC has a 1 in 2**32 chance of a false positive.
My present technique - separate comparisons - is needed because there is a requirement to record which columns have changed. There are only seven tables, and the largest has only 53 columns, so it's not that great a task.
My present technique - separate comparisons - is needed because there is a requirement to record which columns have changed. There are only seven tables, and the largest has only 53 columns, so it's not that great a task.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
On the subject of implementing SCD jobs, has anyone played with the Intelligent Assistants in 7.5.1A? Or know where / how you use them?
There is quite a collection of xml 'templates' under Ascential/DataStage/Assistants/Generation/templates, hence the question. 83 in all, actually, including SCD1, SCD2 and SCD3 templates for jobs and Sequencers. Curious where I can read more about them.![Confused :?](./images/smilies/icon_confused.gif)
There is quite a collection of xml 'templates' under Ascential/DataStage/Assistants/Generation/templates, hence the question. 83 in all, actually, including SCD1, SCD2 and SCD3 templates for jobs and Sequencers. Curious where I can read more about them.
![Confused :?](./images/smilies/icon_confused.gif)
-craig
"You can never have too many knives" -- Logan Nine Fingers
"You can never have too many knives" -- Logan Nine Fingers
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
Intelligent.
This is not the first attempt they've made at Intelligent Assistants. The first attempt was only (iirc) for 7.1 on Windows, and were wizards that guided you through job construction (source table? target table? voila! a job that moves data).
The latest lot seem to be XML exports of parallel jobs. Import DataStage Components (XML) from these. You need DataStage EE with parallel engine licensed.
This is not the first attempt they've made at Intelligent Assistants. The first attempt was only (iirc) for 7.1 on Windows, and were wizards that guided you through job construction (source table? target table? voila! a job that moves data).
The latest lot seem to be XML exports of parallel jobs. Import DataStage Components (XML) from these. You need DataStage EE with parallel engine licensed.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
-
- Participant
- Posts: 3593
- Joined: Thu Jan 23, 2003 5:25 pm
- Location: Australia, Melbourne
- Contact:
I once had a version of DataStage that had working SCD type 1, 2 and 3 intelligent assistants. Created sequence jobs and parallel jobs. Unfortunately I was only using that version for a couple weeks and I have not found any intelligent assistants on my subsequent versions.
On my current project we had to implement slowly changing dimensions of type 2 in parallel jobs. I discovered the intelligent assistant template folder and imported all the jobs there, they are parallel job templates for SCD of types 1, 2 and 3. They were not really usable. They were built without the use of transformer stages and they do not have many column names in them, they are severely lacking in helpful annotations and descriptions. We found it easier to build our own SCD design.
On my current project we had to implement slowly changing dimensions of type 2 in parallel jobs. I discovered the intelligent assistant template folder and imported all the jobs there, they are parallel job templates for SCD of types 1, 2 and 3. They were not really usable. They were built without the use of transformer stages and they do not have many column names in them, they are severely lacking in helpful annotations and descriptions. We found it easier to build our own SCD design.
Certus Solutions
Blog: Tooling Around in the InfoSphere
Twitter: @vmcburney
LinkedIn:Vincent McBurney LinkedIn
Blog: Tooling Around in the InfoSphere
Twitter: @vmcburney
LinkedIn:Vincent McBurney LinkedIn