Change Data Detection - Approach

Post questions here relative to DataStage Server Edition for such areas as Server job design, DS Basic, Routines, Job Sequences, etc.

Moderators: chulett, rschirm, roy

Post Reply
fridge
Premium Member
Premium Member
Posts: 136
Joined: Sat Jan 10, 2004 8:51 am

Change Data Detection - Approach

Post by fridge »

Hi, I am VERY new to DS but would appreciate view as how to handle changed data detection using datastage, I have prototyped a job to use CRC32 on a concat of the non-key fields from a hash-file derived from target table, but need a way of generising this so developers dont have to re-invent the wheel everytime they need to approach a new ETL job.

To clarify

1) Source system/s sends deltas only
2) Target EDW model could spread source data over several tables
3) Want generic module/sc to handle cdd, realise this may require drilling down to metadata, but understand it is not declared hence risky from a maintance point of view.

Any pointers gratefully recieved and apologies if subject a bit basic.

fridge
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Welcome aboard!

This topic has been covered in the past, so why not begin by undertaking a search of the forum (see Search in the menu bar). Likely useful topics include CDC, "change data detection", "detection" and CRC32.

You might be surprised to find that others have already been down this path! :lol:

Then, if you have particular questions, either reply to those posts, or extend this one.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

Apologies if I'm missing something obvious here, but if your source systems will be sending deltas only - why do you need to worry about CDD? :?
-craig

"You can never have too many knives" -- Logan Nine Fingers
mhester
Participant
Posts: 622
Joined: Tue Mar 04, 2003 5:26 am
Location: Phoenix, AZ
Contact:

Post by mhester »

As Ray pointed out, there are many posts that discuss CDC and CDD so please review these. If you are a member of the Ascential Developer Network (ADN) you can download a DS job that I created a couple of years ago, which is an example of how to use CRC32.

Regards,

Michael Hester

In addition, Craig has a valid point, if you are already receiving the deltas then why would you desire to perform CDD on the data?
kcbland
Participant
Posts: 5208
Joined: Wed Jan 15, 2003 8:56 am
Location: Lutz, FL
Contact:

Post by kcbland »

Delta's in an operational system are IRRELEVANT in a warehouse. For example, a source has 52 columns, of which only 26 are loading into the warehouse. The remainder columns are operational in nature and have no business value. Therefore, any activity on those columns are technically a change. The data, when mapped into the warehouse, is identical. If you are creating slowly changing dimensions based on upstream information, you are erring and creating superfluous and identical time variant rows (slowly changing dimensions). You MUST always fully transform the source row and verify the ultimately transformed row is different from the most current time variant row in the warehouse.

Furthermore, I carefully used the words fully transformed to indicate that only after transformation is the row for the warehouse constructed. You may have the situation where the source driving table did not have a change, but a referenced value from another table did. In that situation, you must fully merge/reference all data into the ultimately transformed row and then do your analysis (CRC32, column-by-column checking) to see if the row is different.

I also want to remind people about SCD types 1, 2, 3 and the unnamed (some call it 1 1/2). In a type 1 you perform an update if any columns are different. In this situation, it's less critical if you perform a superfluous update. In a type 2, you create a whole new version (time variant) of the row with a starting effective date. Superfluous "updates" (really an insert) gum up your table and should be avoided. Type 3 says a column is "rolled" to a series of history columns on the table (xyz_current, xyz_hist1, xyz_hist2, xyz_hist3). So each "update" requires moving 2 to 3, 1 to 2, current to 1, and the new value into current. Again, superfluous updates roll off your history, when it should only update when the current value doesn't match the incoming. Lastly, a type 1 1/2 means some columns changing allow you to just update the most current row, like a type 1; while other columns showing as changing require you to insert a new version (time variant).

So, to re-iterate, you must always determine which situation you are in and apply the correct solution. You should never rely on another system to tell your warehouse what is an insert and what is an update. You will have data rejected at the door because of quality concerns. Then, a subsequent row instructing you to insert it may actually need to be an update. Couple that with SCD/time variant warehouse logic and you realize quickly the truth to this point.
Kenneth Bland

Rank: Sempai
Belt: First degree black
Fight name: Captain Hook
Signature knockout: right upper cut followed by left hook
Signature submission: Crucifix combined with leg triangle
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

Point taken. :)
-craig

"You can never have too many knives" -- Logan Nine Fingers
mhester
Participant
Posts: 622
Joined: Tue Mar 04, 2003 5:26 am
Location: Phoenix, AZ
Contact:

Post by mhester »

I should have been clearer in my reply. My perspective was from the source to staging (kind of thought the poster was talking about that portion of the process) in which case you can certainly rely on the source system deltas or do your own CDD to avoid processing unnecessary data. It is much more efficient to use CRC32 against the entire row then it is to do a column-by-column comparison. I understand your point Ken and agree 100%. Once you begin populating the warehouse then you will need to follow a method as you have outlined.

I was thinking more along the lines of improved source to staging processing.

Regards,

Michael Hester
kcbland
Participant
Posts: 5208
Joined: Wed Jan 15, 2003 8:56 am
Location: Lutz, FL
Contact:

Post by kcbland »

And I agree with Mike 100%. I also forget to sometimes differentiate la-la-land from the real world. In la-la-land, we get to build the types of ultimate solutions which include persistent staging databases/induction layers. In the real world we often get architectural direction from management which bypasses said PSD due to budgetary constraints (pay me now or pay me 4X as much later) or lack of vision.

Thanks Mike for clarifying. If the original poster is receiving delta's only, and that delta information is a reference set as well as a driving set, it behoooooves you in the strongest sense to be using some sort of PSD or induction database. For example, if you're only getting customer deltas, at some point you need a baseline, and probably periodically need to get a baseline. Another example is fact based data such as orders. You probably can't sustain a full baseline, but, for audit trail or error correction post-load will want that staged somewhere non-relational, yet query-able, to facilitate this.
Kenneth Bland

Rank: Sempai
Belt: First degree black
Fight name: Captain Hook
Signature knockout: right upper cut followed by left hook
Signature submission: Crucifix combined with leg triangle
mhester
Participant
Posts: 622
Joined: Tue Mar 04, 2003 5:26 am
Location: Phoenix, AZ
Contact:

Post by mhester »

Ken,

I think populating the database for the first time is one of the most difficult (not complex) tasks we have when building a DW. Usually the client will ask for the last 24 months of data as a starting point and want you to load it yesterday! Makes all of us get really creative so I always look for ways to reduce the number of rows processed. This philosophy fits right in with your "divide and conquer" methodology too.

You are correct though that a refresh of certain tables on a regular basis is necessary.

Regards,

Michael Hester
fridge
Premium Member
Premium Member
Posts: 136
Joined: Sat Jan 10, 2004 8:51 am

Thanks

Post by fridge »

Basically thanks for the replys, good to know that I can crib ideas of more experienced people - keeps me in a job :) .

I downloaded the the sample job from Ascential which mirrors what I was trying to do but begs another question.

I want to make this a generic module, and feel I can do this by accessing the metadata that underlies DS with a bit of thought but am concerned that if Ascential change the underlying DB structure w.r.t mappings, keyfields and so forth that any genreric module will be redudent, especially as Asenctial dont document the repository structure.

I come from an ETI*Extract background that makes this possible if some what complicated from an programming p.o.v

Anyway thanks for all the posts and watch out as I will no doubt have a multitude of similar inane posts over the next decade
Post Reply