Using CRC32() to detect changed data in SCD2 implementation

sbass1 · Post by **sbass1** » Wed Mar 11, 2009 9:21 pm

Hi,

I've searched the site on "SCD", "MD5", and "message digest". A summary of my research:

1) MD5 and CRC64 are not supported OOTB in DS
2) Others have gotten MD5 support as an external module:
viewtopic.php?t=112259&highlight=MD5, but this post is light on the details of how it worked.
3) Unfortunately I don't know the details for finding, compiling, and implementing either MD5 or CRC64 as an external routine, nor do I understand the performance implications of external routines.
4) CRC32 is available OOTB, but might not be unique:
viewtopic.php?t=98883&highlight=MD5
viewtopic.php?t=112188&highlight=CRC64
viewtopic.php?t=110300&highlight=CRC64
4) I'll use the approach documented here to implement our SCD Type 2 requirements:
http://etl-tools.info/en/datastage-tuto ... astage.htm

Another ETL tool I've used used an MD5 hash of the "variables of interest" to detect changes. Given the above links, clearly CRC32 is being used in DS for this purpose.

Using the CRC32 routine is IMO really just "syntactic sugar", to avoid code like:

If (Var1 <> Var1_New) or (Var2 <> Var2_New) or (VarN <> VarN_New) then "Changed" Else "Not Changed"

Instead, the code would be something like:

If CRC32(Old.Var1:Old.Var2:Old.VarN) <> CRC32(New.Var1:New.Var2:New.VarN) Then "Changed"

Now onto my questions:

1) Is this a good approach, i.e. use of CRC32?
2) Would this incur a large performance hit vs. the "ugly code" above?
3) Will "pre-calculating" the CRC for existing data and maintaining in a hash help mitigate any performance hit? See Craig's approach here, search for his posts:
viewtopic.php?t=112188&highlight=CRC
4) If I implement Craig's approach above, can you explain how this would change the approach in http://etl-tools.info/en/datastage-tuto ... astage.htm?
5) Could someone (Craig?) post a (URL to a) screenshot of an SCD2 implementation using a hashed file to detect change records? If you don't have your own public web server to upload a screenshot to I find Google Docs to be useful http://docs.google.com.

A lot to ask I know...partial replies still most welcome!

Thanks,
Scott

chulett · Post by **chulett** » Thu Mar 12, 2009 11:08 am

Well, well... let's see if I can help. :D

On the MD5 front, we implemented that at the O/S level using Java. Basically, we landed a flat file with all of the data to be 'munged' using a Sequential File stage and then using the Filter option on the reading side passed the contents of the landed file to the Java code. It munged everything and wrote the output to standard out, which then fed into the rest of the job. All in one job. Worked just fine for us.

On the CRC32 side, I found it acceptable as a mechanism to do CDD when CDC wasn't available. So...

1) Yes, IMHO.
2) Define 'large'.

I don't believe so, and it certain makes both the job and logic easier to implement properly and digest.
3) Yes, definitely. There's a one-time hit to CRC32 everything and build a 'persistent' store, or you could redo it each time without persisting if your volumes were low.
4) It really wouldn't all that much. The lookup would be to the CRC32 hashed file and the 'Check Discrepencies' logic would be much simpler. Mmmm... sugar.
5) I don't have any DataStage access at the moment, so cannot.

To expand a little on (4) above, one change (if you didn't have "everything" already CRC32'd) would be to change that hashed file population link to add the CRC32 function call. And then I would land any changed CRC32 values with their key(s) to a flat file and update the hashed file only after successful completion of the database loads if you were persisting that information.

I'd also suggest not simply concatenating values together, especially is any of them can be null. The NullToEmpty transform comes in handy here, but I'd also suggest taking the "extra" step to put separators between the fields so you don't get anything silly like this happening:

Field A = 12, Field B = 3
Field A changes to 1, B to 23

Without a field delimiter, it would appear to not have changed: 123 = 123, but 12|3 <> 1|23. Silly, yes, but it can bite you in the ass if you let it.

Hope this helps.

sbass1 · Post by **sbass1** » Thu Mar 12, 2009 9:13 pm

I have a follow up question regarding the approach in http://etl-tools.info/en/datastage-tuto ... astage.htm.

The T001_Lookups transformer stage http://etl-tools.info/images/scd2-trans ... lookup.jpg has the constraint:

not(Lkp_Cust.NOTFOUND)

IOW do that processing only if the row was found, i.e. already existing.

But, wouldn't that filter out any NEW rows coming from the source file? Is this tutorial only meant to illustrate the CDC process, and not the complete SCD2 solution?

chulett · Post by **chulett** » Thu Mar 12, 2009 10:47 pm

Apparently so, yes.

Check out the older but (from what I recall) more inclusive Ascential Tech Tip we posted on Kim Duke's website on his Tips page - it's at the bottom in the 'TwoTechTips' zip file.

sbass1 · Post by **sbass1** » Sun Mar 15, 2009 4:34 pm

Thanks for pointing me to Kim Duke's website.

I gotta say after further review I'm not too impressed with the http://etl-tools.info/en/datastage-tuto ... astage.htm tutorial. I think it also has the source and lookup tables reversed.

If my source (new data, sequential file) is say 1000 records, and my target (existing data, dimension table) is 50M records, wouldn't I want my source to be the stream and the dimension to be the reference (lookup) link?

ray.wurlod · Post by **ray.wurlod** » Sun Mar 15, 2009 5:53 pm

No.

The new data MUST be the driver.

One useful technique is called "bucket table". In this technique you identify all the keys in source and load these into a table in your database. You can then perform a join of this "bucket table" with the existing dimension table to retrieve only the current records that you're going to need into the job (probably into a hashed file in a server job).

vmcburney · Post by **vmcburney** » Sun Mar 15, 2009 6:31 pm

That is one of the best questions ever asked in terms of researching the question first - I think we should turn this thread into a FAQ on change data capture for server jobs.

Don't concatenate your fields together manually. Read your dimension table and dump it to a delimited text file and read it in as a single field, then use Transformer functions to split it into a key field and remainder field and run the CRC32 on the remainder to turn it into a value. You can do this with the server merge stage. This should retain the delimiters for better CRC32 accuracy.

Consider building and maintaining a hash file that holds the key and CRC32 value and maintain it over time. For example you can incrementally update it each day and then fully refresh it once a month.

ray.wurlod · Post by **ray.wurlod** » Sun Mar 15, 2009 6:56 pm

... or even store the CRC32 value in the table itself.

sbass1 · Post by **sbass1** » Sun Mar 15, 2009 7:10 pm

ray.wurlod wrote:No.

The new data MUST be the driver.

One useful technique is called "bucket table". In this technique you identify all the keys in source and load these into a table in your database. You can then perform a join of this "bucket table" with the existing dimension table to retrieve only the current records that you're going to need into the job (probably into a hashed file in a server job).

Hi Ray,

Let's see if I understand you properly...

1) "The new data MUST be the driver." Isn't that what I was saying? The tutorial has the source as the dimension table, and the new data as the lookup. I was saying that the new data would be the source (the "delta" file) and the lookups would be against the target (dimension table).

Can you define what you mean by "driver"? I assume it means the stream input?

2) And for better lookup performance, you're saying write the source file's keys to the database, then do an inner join with the dimension (larger) table and retrieve those records for the lookups. Yep, I can definitely see the performance gains by doing that. Much better to upload 1000 records "across the wire", then filter at the database, than to download 50M records "across the wire" just to see if there's a hit.

However, my current approach will be to maintain a hashed file of NKs --> CRC of satellite variables. So I won't even be using the target table for lookups. I can also filter the hashed file on current record to limit its size.

ray.wurlod · Post by **ray.wurlod** » Sun Mar 15, 2009 8:39 pm

That's only part of the problem, but the other part could be resolved by including the SKs as a non-key column in your hashed file. Of course, you then have to maintain these when a Type 2 change occurs.