Storing CRC values

RodBarnes · Post by **RodBarnes** » Wed Jun 01, 2005 3:56 pm

I'd like some input from the community.

Background:
I've been working with DataStage for about six months now and feel pretty competent with it. I am using a CRC-check model where a change in a record is determined by a change in the CRC generated for the pertinent fields. The new generated value is compared to the stored value generated for the latest insert and, if different, the new record is inserted. (We're using a type 2 update model where all changes are inserts of a new record.)

My question:
Currently, I am just storing the CRC values in a hash file that is maintained across ETL runs. Someone shared a concern that this seems prone to issues since the hash files are separate from the target database. I don't necessarily agree but wondered what may be a "best practice" in this case. Specifically, should I be writing these CRCs out to a database table somewhere? That would mean reading them in -- or regenerating the current values each time -- both of which would seem to increase the processing time over simply reading them from an existing hash file.

I don't see any issue with a hash file having a life outside of a single ETL run. As long as the hash file and the target tables are being backed up so they can be restored together it shouldn't be an issue, right? But I was a PICK programmer in a former life so I admit to being very comfortable with hash files -- unlike some who are familiar only with the relational world.

Thanks in advance for your thoughts.

ray.wurlod · Post by **ray.wurlod** » Wed Jun 01, 2005 5:38 pm

The vendor will argue that hashed files were never intended for persistent storage and now that UniVerse and DataStage are separate products their argument carries more weight. If they ever decide to remove hashed file support from the product, you'll have a problem.
A suitable replacement already exists in the parallel job environment, so don't dismiss this possibility. The safest long-term strategy is to use a form of persistent storage that you know will not vanish in future, such as a database table or - if possible for your design - a text file being used as a persistent data set.

baabi_26 · Post by **baabi_26** » Wed Jun 01, 2005 7:41 pm

We have a scenario wherein we have to capture the change record information. This may or may not help you much..but this is how we are doing it.

Every week we get a source file which has old as well as changed record information. We compare old file with new file and identify what are inserts, updates, deletes and unchanged. I have created a DataStage job to handle this. (We don't have parallel canvas). We retain the old file (Sequential) to do the comparision. Everything is handled in a single job. Let me know if you are interested in how I am implementing. I can post a screen shot of it here and will explain.

RodBarnes · Post by **RodBarnes** » Thu Jun 02, 2005 8:51 am

Yes, I would be interested in learning more about the method you are using. It sounds like, during each ETL run, it is making a pass through the source data and generating a list of which are inserts, updates, deletes. Then this list is used in later processing to actually control the processing of each record.

Do I understand correctly?

baabi_26 wrote:We have a scenario wherein we have to capture the change record information. This may or may not help you much..but this is how we are doing it.

Every week we get a source file which has old as well as changed record information. We compare old file with new file and identify what are inserts, updates, deletes and unchanged. I have created a DataStage job to handle this. (We don't have parallel canvas). We retain the old file (Sequential) to do the comparision. Everything is handled in a single job. Let me know if you are interested in how I am implementing. I can post a screen shot of it here and will explain.

alanwms · Post by **alanwms** » Thu Jun 02, 2005 9:21 am

Rod,

Having worked with the CRC32 routine before for doing change data capture, I understand your dilemma. I have addressed the persistent hash file issue by unloading the hash files to flat files at the conclusion of any update process. This gives me a backup copy of the complete hash file contents that can be reloaded to the hash file in the event the hash file becomes damaged or needs to be cleared (load the hash file from an empty flat file). You can also archive the flat file (add a timestamp to the flat file name) so you can go back and reload data using the CRC values from a known point in time.

You can always set up a separate process to just generate the CRC32 values from the database, store in a flat file, and then reload the hash file lookup from the flat file. If you use flat files to keep snapshots of your hash files, you can easily migrate to lookup datasets and lookup stages in the Parallel framework (EE) when the time comes. I've had to modify my approach to hash files within DataStage away from persistent storage to temporary storage, even though I am and old, old PICK programmer (from '79). Although hash files are very flexible and powerful, the knowledge base on hash files seems to be dwindling, and more robust solutions are available with the newest releases of DS to provide the same functionality.

Alan

Alan

baabi_26 · Post by **baabi_26** » Thu Jun 02, 2005 9:28 am

Hey..how do i copy paste my job design..any help please...[/code][/url]

RodBarnes · Post by **RodBarnes** » Thu Jun 02, 2005 11:33 am

Thanks for the input, Alan. Much appreciated. I will look into your idea of exporting the CRC data to a flat file and then reimporting on each run. I'll compare the performance of this with just regenerating the CRC from the database. I expect the load-from-flat-file will be much faster.

alanwms wrote:Having worked with the CRC32 routine before for doing change data capture, I understand your dilemma...

baabi_26 · Post by **baabi_26** » Thu Jun 02, 2005 11:59 am

I am trying to show my job design to handle the change capture. How can I paste/attach my job design in the forum/message.

shawn_ramsey · Post by **shawn_ramsey** » Thu Jun 02, 2005 12:04 pm

Rod,

With much work we finally convinced the modeler to allow us to add it to the data warehouse model as a column in the destination table. To keep the users from getting confused filter the columns out via the reporting tools metadata. This worked well for us since the relational database persists and is backed up regularly, where the ETL environment would be recreated from certification if required.

clshore · Post by **clshore** » Thu Jun 02, 2005 12:40 pm

I assume that you are using CRC for performance reasons.
You should always be able to re-generate the set of current CRC values from the data stored in your safe persistent target.
In fact, for a production application, I consider a dedicated, tested, and versioned process to do that as an absolute required deliverable.
As long as this is true, I feel there is nothing wrong with keeping the CRC in a hash or other 'less persistent' store.
It may take a long time to regenerate, but it's there, just like any other caching store.

Slightly off topic, how does your process handle the issue of CRC collisions?

Carter

alanwms · Post by **alanwms** » Thu Jun 02, 2005 1:07 pm

Shawn's approach (adding the CRC value to the target table) would work great, especially with a large target table. I'm sure he and Mike Hester talked at length about how to implement the CRC value for CDC. As for collisions (as long as you use the CRC32 routine), it is highly unlikely that a two different strings of significant length for the same natural key would return the exact same CRC value. Remember that the CRC value is based on a string, and a longer string would decrease the likelihood of the CRC routine returning identical values. If the incoming string is delimited, and you want to compute the CRC value on the whole string, you'd need to set the delimiter to none in your passive (sequential file) stage.

Alan

mhester · Post by **mhester** » Thu Jun 02, 2005 3:51 pm

What Alan and Shawn recommend is {probably} the best practice regarding the usage of CRC32 and SCD processing. You need not be concerned about the liklihood of "collisions" for the reasons that Alan pointed out plus some fairly complex math. The liklihood of a failure to generate the correct CRC for a given row thereby incorrectly updating or not updating a particular row is 1 in 4,294,967,296. This does not mean that if you process 4294967296 rows of data in a given run that a failure will happen (and in all liklihood you will not see a failure in a recordset of this size).

ray.wurlod · Post by **ray.wurlod** » Thu Jun 02, 2005 4:27 pm

baabi_26 wrote:I am trying to show my job design to handle the change capture. How can I paste/attach my job design in the forum/message.

This forum does not support attachments of any kind.

The recommended approach is to show your job design inside a set of Code tags. Use Preview iteratively until the layout is correct before submitting.

chulett · Post by **chulett** » Thu Jun 02, 2005 7:00 pm

Well... not attachments of any kind, but it should support images as long as they are hosted somewhere accessable. So it would be possible (and I seem to recall at least one person actually doing this here) to take a screen print of a job, save it as a .jpg or .bmp (etc) and putting it on an image hosting site. Then the Img tags could be used to show it here.

Here are some well thought of sites to consider, if you don't own your own:

http://www.imageshack.us/
http://www.photobucket.com/
http://www.photodump.com/

Let me see if it works... brb...

Click on the thumbnail for the Big Picture.

This is done from Image Shack and you don't even need to register. The black bar on the bottom of the thumbnail is (apparently) optional and they provided the forum code complete with tags I just copied and pasted into the post.

mhester · Post by **mhester** » Fri Jun 03, 2005 7:27 am

Craig,

Very, very cool! Now we can share images with others - I like it!