Implementing Type 2 SCD with DS

jcampbell · Post by **jcampbell** » Mon Jan 16, 2006 12:57 am

I'm really new to datastage.
We're building a DWH from scratch, and most of our dimension tables need to be Type 2 SCD.

I need to see what has changed (I can't trust the audit columns), and insert a new row when something has changed updating the relevant Effective/Termination date fields.

I'd like to know the best way of implimenting this without writing too much code - since I'd like to keep the jobs as maintainable as possible.

Can anyone send me a few .dsx files as examples of how this problem was overcome with DS?

Sreenivasulu · Post by **Sreenivasulu** » Mon Jan 16, 2006 2:50 am

HI

Sorry. Sending Job designs would violate client confidentiality.

Morever you are new to datastage. Please join a training course

Regards
Sreeni

wnogalski · Post by **wnogalski** » Mon Jan 16, 2006 5:09 am

A lot has been written on this topic recently with specific and detailed design approaches so try using the fine

search engine.

balajisr · Post by **balajisr** » Mon Jan 16, 2006 5:16 am

Hi

Please use the search mechanism which is on top of this page. You will get a good idea about SCD.

--Balaji S.R

ray.wurlod · Post by **ray.wurlod** » Mon Jan 16, 2006 9:03 pm

I'm doing a lot of those right now, using DataStage 7.0.

My preferred technique is to use a stage variable for each pair of column comparisons. This allows me to Trim, pad, do whatever I need to do (for example "nulls compare equal") before determining whether the columns match or not. These Boolean stage variables are then assembled with Or operators to yield a "change detected" true/false value.

For this to work you need to have captured a before-image of current ("live") rows from the target table (into a hashed file is good) and have a mechanism for reverse lookup from natural key to surrogate key (another hashed file is an excellent choice). The hashed files must be maintained in real time so that existence and change can be detected as at the row just processed, even though the job itself may not be effecting changes directly into the target table (for example into a staging area instead, from which bulk loader can populate the table).

jzparad · Post by **jzparad** » Tue Jan 17, 2006 12:09 am

Add an extra column to the target table to hold a CRC value. Prior to processing a new file, create a hashED file containing CRC values of the latest records.

Assuming your source is a csv file, read the whole line in and discard records whose CRC already exist in the hashED file. Records whose CRC do not exist in the hashED file can be split using a Row Splitter stage and written to the target as well as the hashED file. If your source is a database table you will have to use a Row Merger stage first.

The advantage is that you do not have to do column by column comparisons nor will you have to maintain reverse lookups for keys.

The disadvantage is that there is a one chance in a googol (I don't know the exact number) that two different strings may evaluate to the same CRC value.

kumar_s · Post by **kumar_s** » Tue Jan 17, 2006 2:51 am

The disadvantage is that there is a one chance in a googol (I don't know the exact number) that two different strings may evaluate to the same CRC value

Hi Jim,

Do you mean to say that, CRC may not be unique.

I was under the impression that it will be always unique.

-Kumar

ray.wurlod · Post by **ray.wurlod** » Tue Jan 17, 2006 3:03 am

There is a small, calculable probability that a CRC value will not be unique. It's the reciprocal of the number of possible different values. For example, a 32-bit CRC has a 1 in 2**32 chance of a false positive.

My present technique - separate comparisons - is needed because there is a requirement to record which columns have changed. There are only seven tables, and the largest has only 53 columns, so it's not that great a task.

chulett · Post by **chulett** » Tue Jan 17, 2006 8:30 am

On the subject of implementing SCD jobs, has anyone played with the Intelligent Assistants in 7.5.1A? Or know where / how you use them?

There is quite a collection of xml 'templates' under Ascential/DataStage/Assistants/Generation/templates, hence the question. 83 in all, actually, including SCD1, SCD2 and SCD3 templates for jobs and Sequencers. Curious where I can read more about them.

ray.wurlod · Post by **ray.wurlod** » Tue Jan 17, 2006 3:33 pm

Yes. They aren't.

chulett · Post by **chulett** » Tue Jan 17, 2006 3:37 pm

Aren't what... intelligent? usable?

ray.wurlod · Post by **ray.wurlod** » Tue Jan 17, 2006 4:00 pm

Intelligent.

This is not the first attempt they've made at Intelligent Assistants. The first attempt was only (iirc) for 7.1 on Windows, and were wizards that guided you through job construction (source table? target table? voila! a job that moves data).

The latest lot seem to be XML exports of parallel jobs. Import DataStage Components (XML) from these. You need DataStage EE with parallel engine licensed.

chulett · Post by **chulett** » Tue Jan 17, 2006 4:05 pm

So, no love for us Server slubs, eh? Not that I need these, just thought it might be cool to put Mr Wizard through his paces and see what popped out.

vmcburney · Post by **vmcburney** » Tue Jan 17, 2006 5:00 pm

I once had a version of DataStage that had working SCD type 1, 2 and 3 intelligent assistants. Created sequence jobs and parallel jobs. Unfortunately I was only using that version for a couple weeks and I have not found any intelligent assistants on my subsequent versions.

On my current project we had to implement slowly changing dimensions of type 2 in parallel jobs. I discovered the intelligent assistant template folder and imported all the jobs there, they are parallel job templates for SCD of types 1, 2 and 3. They were not really usable. They were built without the use of transformer stages and they do not have many column names in them, they are severely lacking in helpful annotations and descriptions. We found it easier to build our own SCD design.