Change Data Detection - Approach

fridge · Post by **fridge** » Sat Jan 24, 2004 8:53 pm

Hi, I am VERY new to DS but would appreciate view as how to handle changed data detection using datastage, I have prototyped a job to use CRC32 on a concat of the non-key fields from a hash-file derived from target table, but need a way of generising this so developers dont have to re-invent the wheel everytime they need to approach a new ETL job.

To clarify

1) Source system/s sends deltas only
2) Target EDW model could spread source data over several tables
3) Want generic module/sc to handle cdd, realise this may require drilling down to metadata, but understand it is not declared hence risky from a maintance point of view.

Any pointers gratefully recieved and apologies if subject a bit basic.

fridge

ray.wurlod · Post by **ray.wurlod** » Sat Jan 24, 2004 9:52 pm

Welcome aboard!

This topic has been covered in the past, so why not begin by undertaking a search of the forum (see Search in the menu bar). Likely useful topics include CDC, "change data detection", "detection" and CRC32.

You might be surprised to find that others have already been down this path!

Then, if you have particular questions, either reply to those posts, or extend this one.

chulett · Post by **chulett** » Sat Jan 24, 2004 10:03 pm

Apologies if I'm missing something obvious here, but if your source systems will be sending deltas only - why do you need to worry about CDD?

mhester · Post by **mhester** » Sun Jan 25, 2004 6:00 am

As Ray pointed out, there are many posts that discuss CDC and CDD so please review these. If you are a member of the Ascential Developer Network (ADN) you can download a DS job that I created a couple of years ago, which is an example of how to use CRC32.

Regards,

Michael Hester

In addition, Craig has a valid point, if you are already receiving the deltas then why would you desire to perform CDD on the data?

kcbland · Post by **kcbland** » Sun Jan 25, 2004 7:37 am

Delta's in an operational system are IRRELEVANT in a warehouse. For example, a source has 52 columns, of which only 26 are loading into the warehouse. The remainder columns are operational in nature and have no business value. Therefore, any activity on those columns are technically a change. The data, when mapped into the warehouse, is identical. If you are creating slowly changing dimensions based on upstream information, you are erring and creating superfluous and identical time variant rows (slowly changing dimensions). You MUST always fully transform the source row and verify the ultimately transformed row is different from the most current time variant row in the warehouse.

Furthermore, I carefully used the words fully transformed to indicate that only after transformation is the row for the warehouse constructed. You may have the situation where the source driving table did not have a change, but a referenced value from another table did. In that situation, you must fully merge/reference all data into the ultimately transformed row and then do your analysis (CRC32, column-by-column checking) to see if the row is different.

I also want to remind people about SCD types 1, 2, 3 and the unnamed (some call it 1 1/2). In a type 1 you perform an update if any columns are different. In this situation, it's less critical if you perform a superfluous update. In a type 2, you create a whole new version (time variant) of the row with a starting effective date. Superfluous "updates" (really an insert) gum up your table and should be avoided. Type 3 says a column is "rolled" to a series of history columns on the table (xyz_current, xyz_hist1, xyz_hist2, xyz_hist3). So each "update" requires moving 2 to 3, 1 to 2, current to 1, and the new value into current. Again, superfluous updates roll off your history, when it should only update when the current value doesn't match the incoming. Lastly, a type 1 1/2 means some columns changing allow you to just update the most current row, like a type 1; while other columns showing as changing require you to insert a new version (time variant).

So, to re-iterate, you must always determine which situation you are in and apply the correct solution. You should never rely on another system to tell your warehouse what is an insert and what is an update. You will have data rejected at the door because of quality concerns. Then, a subsequent row instructing you to insert it may actually need to be an update. Couple that with SCD/time variant warehouse logic and you realize quickly the truth to this point.

chulett · Post by **chulett** » Sun Jan 25, 2004 9:10 am

Point taken.

mhester · Post by **mhester** » Sun Jan 25, 2004 9:38 am

I should have been clearer in my reply. My perspective was from the source to staging (kind of thought the poster was talking about that portion of the process) in which case you can certainly rely on the source system deltas or do your own CDD to avoid processing unnecessary data. It is much more efficient to use CRC32 against the entire row then it is to do a column-by-column comparison. I understand your point Ken and agree 100%. Once you begin populating the warehouse then you will need to follow a method as you have outlined.

I was thinking more along the lines of improved source to staging processing.

Regards,

Michael Hester

kcbland · Post by **kcbland** » Sun Jan 25, 2004 2:05 pm

And I agree with Mike 100%. I also forget to sometimes differentiate la-la-land from the real world. In la-la-land, we get to build the types of ultimate solutions which include persistent staging databases/induction layers. In the real world we often get architectural direction from management which bypasses said PSD due to budgetary constraints (pay me now or pay me 4X as much later) or lack of vision.

Thanks Mike for clarifying. If the original poster is receiving delta's only, and that delta information is a reference set as well as a driving set, it behoooooves you in the strongest sense to be using some sort of PSD or induction database. For example, if you're only getting customer deltas, at some point you need a baseline, and probably periodically need to get a baseline. Another example is fact based data such as orders. You probably can't sustain a full baseline, but, for audit trail or error correction post-load will want that staged somewhere non-relational, yet query-able, to facilitate this.

mhester · Post by **mhester** » Sun Jan 25, 2004 5:35 pm

Ken,

I think populating the database for the first time is one of the most difficult (not complex) tasks we have when building a DW. Usually the client will ask for the last 24 months of data as a starting point and want you to load it yesterday! Makes all of us get really creative so I always look for ways to reduce the number of rows processed. This philosophy fits right in with your "divide and conquer" methodology too.

You are correct though that a refresh of certain tables on a regular basis is necessary.

Regards,

Michael Hester

fridge · Post by **fridge** » Fri Jan 30, 2004 7:13 pm

Basically thanks for the replys, good to know that I can crib ideas of more experienced people - keeps me in a job

.

I downloaded the the sample job from Ascential which mirrors what I was trying to do but begs another question.

I want to make this a generic module, and feel I can do this by accessing the metadata that underlies DS with a bit of thought but am concerned that if Ascential change the underlying DB structure w.r.t mappings, keyfields and so forth that any genreric module will be redudent, especially as Asenctial dont document the repository structure.

I come from an ETI*Extract background that makes this possible if some what complicated from an programming p.o.v

Anyway thanks for all the posts and watch out as I will no doubt have a multitude of similar inane posts over the next decade