Implementing Type 2 SCD with DS

Post questions here relative to DataStage Server Edition for such areas as Server job design, DS Basic, Routines, Job Sequences, etc.

Moderators: chulett, rschirm, roy

Post Reply
jcampbell
Participant
Posts: 1
Joined: Thu Sep 15, 2005 2:04 am

Implementing Type 2 SCD with DS

Post by jcampbell »

I'm really new to datastage.
We're building a DWH from scratch, and most of our dimension tables need to be Type 2 SCD.

I need to see what has changed (I can't trust the audit columns), and insert a new row when something has changed updating the relevant Effective/Termination date fields.

I'd like to know the best way of implimenting this without writing too much code - since I'd like to keep the jobs as maintainable as possible.

Can anyone send me a few .dsx files as examples of how this problem was overcome with DS?
Sreenivasulu
Premium Member
Premium Member
Posts: 892
Joined: Thu Oct 16, 2003 5:18 am

Post by Sreenivasulu »

HI

Sorry. Sending Job designs would violate client confidentiality.

Morever you are new to datastage. Please join a training course

Regards
Sreeni
wnogalski
Charter Member
Charter Member
Posts: 54
Joined: Thu Jan 06, 2005 10:49 am
Location: Warsaw

Post by wnogalski »

A lot has been written on this topic recently with specific and detailed design approaches so try using the fine :wink: search engine.
Regards,
Wojciech Nogalski
balajisr
Charter Member
Charter Member
Posts: 785
Joined: Thu Jul 28, 2005 8:58 am

Post by balajisr »

Hi

Please use the search mechanism which is on top of this page. You will get a good idea about SCD.

--Balaji S.R
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

I'm doing a lot of those right now, using DataStage 7.0.

My preferred technique is to use a stage variable for each pair of column comparisons. This allows me to Trim, pad, do whatever I need to do (for example "nulls compare equal") before determining whether the columns match or not. These Boolean stage variables are then assembled with Or operators to yield a "change detected" true/false value.

For this to work you need to have captured a before-image of current ("live") rows from the target table (into a hashed file is good) and have a mechanism for reverse lookup from natural key to surrogate key (another hashed file is an excellent choice). The hashed files must be maintained in real time so that existence and change can be detected as at the row just processed, even though the job itself may not be effecting changes directly into the target table (for example into a staging area instead, from which bulk loader can populate the table).
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
jzparad
Charter Member
Charter Member
Posts: 151
Joined: Thu Apr 01, 2004 9:37 pm

Post by jzparad »

Add an extra column to the target table to hold a CRC value. Prior to processing a new file, create a hashED file containing CRC values of the latest records.

Assuming your source is a csv file, read the whole line in and discard records whose CRC already exist in the hashED file. Records whose CRC do not exist in the hashED file can be split using a Row Splitter stage and written to the target as well as the hashED file. If your source is a database table you will have to use a Row Merger stage first.

The advantage is that you do not have to do column by column comparisons nor will you have to maintain reverse lookups for keys.

The disadvantage is that there is a one chance in a googol (I don't know the exact number) that two different strings may evaluate to the same CRC value.
Jim Paradies
kumar_s
Charter Member
Charter Member
Posts: 5245
Joined: Thu Jun 16, 2005 11:00 pm

Post by kumar_s »

The disadvantage is that there is a one chance in a googol (I don't know the exact number) that two different strings may evaluate to the same CRC value
Hi Jim,

Do you mean to say that, CRC may not be unique. :shock: I was under the impression that it will be always unique. :cry:

-Kumar
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

There is a small, calculable probability that a CRC value will not be unique. It's the reciprocal of the number of possible different values. For example, a 32-bit CRC has a 1 in 2**32 chance of a false positive.

My present technique - separate comparisons - is needed because there is a requirement to record which columns have changed. There are only seven tables, and the largest has only 53 columns, so it's not that great a task.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

On the subject of implementing SCD jobs, has anyone played with the Intelligent Assistants in 7.5.1A? Or know where / how you use them?

There is quite a collection of xml 'templates' under Ascential/DataStage/Assistants/Generation/templates, hence the question. 83 in all, actually, including SCD1, SCD2 and SCD3 templates for jobs and Sequencers. Curious where I can read more about them. :?
-craig

"You can never have too many knives" -- Logan Nine Fingers
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Yes. They aren't.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

Aren't what... intelligent? usable?
-craig

"You can never have too many knives" -- Logan Nine Fingers
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Intelligent.

This is not the first attempt they've made at Intelligent Assistants. The first attempt was only (iirc) for 7.1 on Windows, and were wizards that guided you through job construction (source table? target table? voila! a job that moves data).

The latest lot seem to be XML exports of parallel jobs. Import DataStage Components (XML) from these. You need DataStage EE with parallel engine licensed.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

So, no love for us Server slubs, eh? Not that I need these, just thought it might be cool to put Mr Wizard through his paces and see what popped out. :cry:
-craig

"You can never have too many knives" -- Logan Nine Fingers
vmcburney
Participant
Posts: 3593
Joined: Thu Jan 23, 2003 5:25 pm
Location: Australia, Melbourne
Contact:

Post by vmcburney »

I once had a version of DataStage that had working SCD type 1, 2 and 3 intelligent assistants. Created sequence jobs and parallel jobs. Unfortunately I was only using that version for a couple weeks and I have not found any intelligent assistants on my subsequent versions.

On my current project we had to implement slowly changing dimensions of type 2 in parallel jobs. I discovered the intelligent assistant template folder and imported all the jobs there, they are parallel job templates for SCD of types 1, 2 and 3. They were not really usable. They were built without the use of transformer stages and they do not have many column names in them, they are severely lacking in helpful annotations and descriptions. We found it easier to build our own SCD design.
Post Reply