Does anyone have a PX job template for SCD type 2?
Moderators: chulett, rschirm, roy
Does anyone have a PX job template for SCD type 2?
I have templates for Server jobs thanks to Ken Bland. And if I am still to believe what he wrote almost 2 years ago, then I should try to stick with server jobs for all SCD processing.
See post 'SCD In Parallel Jobs' from shivakumar on Mon Jul 31, 2006 11:58 pm
Anyone want to give another opinion?
Here's what ken had to say then (Hi Ken!)
The difficulty in SCD in parallel is the complicated nature of a type 1, 2, and hybrid 1-2. You need to transform a new row and compare it to the current row. You may compare just a subset of all columns. This means that if you have 40 columns, maybe only 35 pertain to looking for a difference. You may even use CRC32 to facilitate this comparison but computing CRC32 values for the column groups in current and new and then do an equivalency test.
In addition, the hybrd 1-2 type means a smaller subset of columns cause an update, and another subset of columns cause an insert of the next variant. In addition, in type 1 handling you potentially update the current row with an effective end date and optionally remove a current indicator flag.
If there is any chance that you may have time ordered data and need to detect a repeating natural key, then you would potentially have a whole family of SCD rows to generate for the same natural key in the correct order. This would require some device to stage/store the rows as you assemble them. PX has no storage mechanism for holding rows and updating them. .ds datasets are write once, you cannot reference a row from it and modify it. About the only method is a custom buildop in this case.
The combination of stages required to handle SCD type logic, as well as deal with all of the various potential inserts and update streams of rows make SCD in a parallel job a daunting task. If you have to deal with repeating natural keys of time ordered data, you're now outside the tool and writing a buildop. For this reason, unless performance requires it, you should consider using a Server job and hashed file stages. You can easily store the current row in a hashed file, transform the new row, reference and compare to the current row, and optionally overwrite the current row with a newer row or insert a new row as well as retire/update the current row. Then, extract from your staging hashed file your inserts and updates and then load to your target table.
_________________
Kenneth Bland
Mike
See post 'SCD In Parallel Jobs' from shivakumar on Mon Jul 31, 2006 11:58 pm
Anyone want to give another opinion?
Here's what ken had to say then (Hi Ken!)
The difficulty in SCD in parallel is the complicated nature of a type 1, 2, and hybrid 1-2. You need to transform a new row and compare it to the current row. You may compare just a subset of all columns. This means that if you have 40 columns, maybe only 35 pertain to looking for a difference. You may even use CRC32 to facilitate this comparison but computing CRC32 values for the column groups in current and new and then do an equivalency test.
In addition, the hybrd 1-2 type means a smaller subset of columns cause an update, and another subset of columns cause an insert of the next variant. In addition, in type 1 handling you potentially update the current row with an effective end date and optionally remove a current indicator flag.
If there is any chance that you may have time ordered data and need to detect a repeating natural key, then you would potentially have a whole family of SCD rows to generate for the same natural key in the correct order. This would require some device to stage/store the rows as you assemble them. PX has no storage mechanism for holding rows and updating them. .ds datasets are write once, you cannot reference a row from it and modify it. About the only method is a custom buildop in this case.
The combination of stages required to handle SCD type logic, as well as deal with all of the various potential inserts and update streams of rows make SCD in a parallel job a daunting task. If you have to deal with repeating natural keys of time ordered data, you're now outside the tool and writing a buildop. For this reason, unless performance requires it, you should consider using a Server job and hashed file stages. You can easily store the current row in a hashed file, transform the new row, reference and compare to the current row, and optionally overwrite the current row with a newer row or insert a new row as well as retire/update the current row. Then, extract from your staging hashed file your inserts and updates and then load to your target table.
_________________
Kenneth Bland
Mike
Michael Favero
2852 Humboldt Ave So
Minneapolis, MN 55408
2852 Humboldt Ave So
Minneapolis, MN 55408
A look at how PX jobs in Ver 8.x can handle SCD type 2
http://www.leverageinformation.info/myA ... sdatastage
http://www.leverageinformation.info/myA ... sdatastage
-
- Charter Member
- Posts: 193
- Joined: Tue Sep 05, 2006 8:01 pm
- Location: Australia
The SCD stage doesn't address most of my points. You only can use it according to its limitations, you must model and have data that conform to its design specifications.
Kenneth Bland
Rank: Sempai
Belt: First degree black
Fight name: Captain Hook
Signature knockout: right upper cut followed by left hook
Signature submission: Crucifix combined with leg triangle
Rank: Sempai
Belt: First degree black
Fight name: Captain Hook
Signature knockout: right upper cut followed by left hook
Signature submission: Crucifix combined with leg triangle
I have read through the SCD stage documentation and it is not clear what those limitations are and just how I would identify if my design does or does not fit the design of the SCD. The first thing I notice is that the SCD stage seems to think that a given source stream provides an update/insert to some dimension as well as an insert to a fact. This is not intuitive to me. I happened to see another post which asked if the output was mandatory and Ray siad yes but you could direct it to seq file /tmp/null or whatever to throw it away. Are we taking a 2 dollar one size fits all solution to dress a pig for the Sport's Illustrated Swimsuit competition?
I am probably going to have to try it out, but I was hoping that someone has a lot of experience with it already and can provide templates/prototypes which satisfy some of the common applications of SCD.
I am probably going to have to try it out, but I was hoping that someone has a lot of experience with it already and can provide templates/prototypes which satisfy some of the common applications of SCD.
Michael Favero
2852 Humboldt Ave So
Minneapolis, MN 55408
2852 Humboldt Ave So
Minneapolis, MN 55408
I succesfully implemented a Hybrid Type1/2 scenario using a combination of PX Transformer and PX Lookup stages, with multiple output links.
1. End Dating an Old record and setting its Current indicator to NO.
2. Inserting new records or new versions of the record ( with Current indicator to YES and setting a start date with open end date) with fresh surrogate key.
3. Updating a record where there are only Type 1 column changes.
Out of 110 columns, 18 are being tracked for Type2 Changes, One is a key column(surrogate), another business key( code column), 3 are for start and enda dates as well as current indicator and the remaining 87 are Type 1 columns.
I know that a combination using Difference OR Change capture stages can also facilitate this, however is there any SIGNIFICANT performance impact expected in using aTransformer/Lookup combination to do the same? The dimension has areound 250000 records and may have 400 inserts/updates every month. The job runs on a daily basis.
1. End Dating an Old record and setting its Current indicator to NO.
2. Inserting new records or new versions of the record ( with Current indicator to YES and setting a start date with open end date) with fresh surrogate key.
3. Updating a record where there are only Type 1 column changes.
Out of 110 columns, 18 are being tracked for Type2 Changes, One is a key column(surrogate), another business key( code column), 3 are for start and enda dates as well as current indicator and the remaining 87 are Type 1 columns.
I know that a combination using Difference OR Change capture stages can also facilitate this, however is there any SIGNIFICANT performance impact expected in using aTransformer/Lookup combination to do the same? The dimension has areound 250000 records and may have 400 inserts/updates every month. The job runs on a daily basis.
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
The storage mechanism, which the OP stated does not exist, in fact does exist. It's a "state file" created (if needed), initialized and used by the Surrogate Key Generator stage and the GetSurrogateKey() function.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
WARNINGrameshrr3 wrote:A look at how PX jobs in Ver 8.x can handle SCD type 2
http://www.leverageinformation.info/myA ... sdatastage
This website (leverageinformation.info, not DSXchange!) is for sale and, apparently to try to make a buck, every page you touch opens an extra advertising window (even if you have pop-ups blocked). That's OK if you want stupid animated emoticons or weird background images but, personally, I don't.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
-
- Participant
- Posts: 3593
- Joined: Thu Jan 23, 2003 5:25 pm
- Location: Australia, Melbourne
- Contact:
In the credit crisis IBM have ditched the websites LeverageInformation (heavily promoted at IOD 2007) and InfoSphere.info (heavily promoted at IOD 2008). In fact if you see IBM promote a website at a conference it's pretty much the kiss of death for that website. Opportunists have bought the LeverageInformation URL off the open market and turned it into a spam sites while InfoSphere.info redirects to the IOD conference home page.
Certus Solutions
Blog: Tooling Around in the InfoSphere
Twitter: @vmcburney
LinkedIn:Vincent McBurney LinkedIn
Blog: Tooling Around in the InfoSphere
Twitter: @vmcburney
LinkedIn:Vincent McBurney LinkedIn