Does anyone have a PX job template for SCD type 2?

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
mfavero
Premium Member
Premium Member
Posts: 45
Joined: Thu Jan 16, 2003 1:20 pm
Location: Minneapolis

Does anyone have a PX job template for SCD type 2?

Post by mfavero »

I have templates for Server jobs thanks to Ken Bland. And if I am still to believe what he wrote almost 2 years ago, then I should try to stick with server jobs for all SCD processing.

See post 'SCD In Parallel Jobs' from shivakumar on Mon Jul 31, 2006 11:58 pm

Anyone want to give another opinion? :?:

Here's what ken had to say then (Hi Ken!)

The difficulty in SCD in parallel is the complicated nature of a type 1, 2, and hybrid 1-2. You need to transform a new row and compare it to the current row. You may compare just a subset of all columns. This means that if you have 40 columns, maybe only 35 pertain to looking for a difference. You may even use CRC32 to facilitate this comparison but computing CRC32 values for the column groups in current and new and then do an equivalency test.

In addition, the hybrd 1-2 type means a smaller subset of columns cause an update, and another subset of columns cause an insert of the next variant. In addition, in type 1 handling you potentially update the current row with an effective end date and optionally remove a current indicator flag.

If there is any chance that you may have time ordered data and need to detect a repeating natural key, then you would potentially have a whole family of SCD rows to generate for the same natural key in the correct order. This would require some device to stage/store the rows as you assemble them. PX has no storage mechanism for holding rows and updating them. .ds datasets are write once, you cannot reference a row from it and modify it. About the only method is a custom buildop in this case.

The combination of stages required to handle SCD type logic, as well as deal with all of the various potential inserts and update streams of rows make SCD in a parallel job a daunting task. If you have to deal with repeating natural keys of time ordered data, you're now outside the tool and writing a buildop. For this reason, unless performance requires it, you should consider using a Server job and hashed file stages. You can easily store the current row in a hashed file, transform the new row, reference and compare to the current row, and optionally overwrite the current row with a newer row or insert a new row as well as retire/update the current row. Then, extract from your staging hashed file your inserts and updates and then load to your target table.

_________________
Kenneth Bland


Mike :shock: :roll: :o
Michael Favero

2852 Humboldt Ave So
Minneapolis, MN 55408
rameshrr3
Premium Member
Premium Member
Posts: 609
Joined: Mon May 10, 2004 3:32 am
Location: BRENTWOOD, TN

Post by rameshrr3 »

A look at how PX jobs in Ver 8.x can handle SCD type 2
http://www.leverageinformation.info/myA ... sdatastage
John Smith
Charter Member
Charter Member
Posts: 193
Joined: Tue Sep 05, 2006 8:01 pm
Location: Australia

Post by John Smith »

If you're on v8 then use the SCD stage. If you're on 7.5 you can use CDC stage. Not hard to template them up if you want to. I would use RCP if your environment can allow that.
kcbland
Participant
Posts: 5208
Joined: Wed Jan 15, 2003 8:56 am
Location: Lutz, FL
Contact:

Post by kcbland »

The SCD stage doesn't address most of my points. You only can use it according to its limitations, you must model and have data that conform to its design specifications.
Kenneth Bland

Rank: Sempai
Belt: First degree black
Fight name: Captain Hook
Signature knockout: right upper cut followed by left hook
Signature submission: Crucifix combined with leg triangle
mfavero
Premium Member
Premium Member
Posts: 45
Joined: Thu Jan 16, 2003 1:20 pm
Location: Minneapolis

Post by mfavero »

I have read through the SCD stage documentation and it is not clear what those limitations are and just how I would identify if my design does or does not fit the design of the SCD. The first thing I notice is that the SCD stage seems to think that a given source stream provides an update/insert to some dimension as well as an insert to a fact. This is not intuitive to me. I happened to see another post which asked if the output was mandatory and Ray siad yes but you could direct it to seq file /tmp/null or whatever to throw it away. Are we taking a 2 dollar one size fits all solution to dress a pig for the Sport's Illustrated Swimsuit competition?

I am probably going to have to try it out, but I was hoping that someone has a lot of experience with it already and can provide templates/prototypes which satisfy some of the common applications of SCD.
Michael Favero

2852 Humboldt Ave So
Minneapolis, MN 55408
rameshrr3
Premium Member
Premium Member
Posts: 609
Joined: Mon May 10, 2004 3:32 am
Location: BRENTWOOD, TN

Post by rameshrr3 »

I succesfully implemented a Hybrid Type1/2 scenario using a combination of PX Transformer and PX Lookup stages, with multiple output links.

1. End Dating an Old record and setting its Current indicator to NO.

2. Inserting new records or new versions of the record ( with Current indicator to YES and setting a start date with open end date) with fresh surrogate key.

3. Updating a record where there are only Type 1 column changes.

Out of 110 columns, 18 are being tracked for Type2 Changes, One is a key column(surrogate), another business key( code column), 3 are for start and enda dates as well as current indicator and the remaining 87 are Type 1 columns.

I know that a combination using Difference OR Change capture stages can also facilitate this, however is there any SIGNIFICANT performance impact expected in using aTransformer/Lookup combination to do the same? The dimension has areound 250000 records and may have 400 inserts/updates every month. The job runs on a daily basis.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

The storage mechanism, which the OP stated does not exist, in fact does exist. It's a "state file" created (if needed), initialized and used by the Surrogate Key Generator stage and the GetSurrogateKey() function.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

rameshrr3 wrote:A look at how PX jobs in Ver 8.x can handle SCD type 2
http://www.leverageinformation.info/myA ... sdatastage
WARNING
This website (leverageinformation.info, not DSXchange!) is for sale and, apparently to try to make a buck, every page you touch opens an extra advertising window (even if you have pop-ups blocked). That's OK if you want stupid animated emoticons or weird background images but, personally, I don't.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
vmcburney
Participant
Posts: 3593
Joined: Thu Jan 23, 2003 5:25 pm
Location: Australia, Melbourne
Contact:

Post by vmcburney »

In the credit crisis IBM have ditched the websites LeverageInformation (heavily promoted at IOD 2007) and InfoSphere.info (heavily promoted at IOD 2008). In fact if you see IBM promote a website at a conference it's pretty much the kiss of death for that website. Opportunists have bought the LeverageInformation URL off the open market and turned it into a spam sites while InfoSphere.info redirects to the IOD conference home page.
Post Reply