Storing CRC values

Post questions here relative to DataStage Server Edition for such areas as Server job design, DS Basic, Routines, Job Sequences, etc.

Moderators: chulett, rschirm, roy

RodBarnes
Charter Member
Charter Member
Posts: 182
Joined: Fri Mar 18, 2005 2:10 pm

Storing CRC values

Post by RodBarnes »

I'd like some input from the community.

Background:
I've been working with DataStage for about six months now and feel pretty competent with it. I am using a CRC-check model where a change in a record is determined by a change in the CRC generated for the pertinent fields. The new generated value is compared to the stored value generated for the latest insert and, if different, the new record is inserted. (We're using a type 2 update model where all changes are inserts of a new record.)

My question:
Currently, I am just storing the CRC values in a hash file that is maintained across ETL runs. Someone shared a concern that this seems prone to issues since the hash files are separate from the target database. I don't necessarily agree but wondered what may be a "best practice" in this case. Specifically, should I be writing these CRCs out to a database table somewhere? That would mean reading them in -- or regenerating the current values each time -- both of which would seem to increase the processing time over simply reading them from an existing hash file.

I don't see any issue with a hash file having a life outside of a single ETL run. As long as the hash file and the target tables are being backed up so they can be restored together it shouldn't be an issue, right? But I was a PICK programmer in a former life so I admit to being very comfortable with hash files -- unlike some who are familiar only with the relational world. :-)

Thanks in advance for your thoughts.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

The vendor will argue that hashed files were never intended for persistent storage and now that UniVerse and DataStage are separate products their argument carries more weight. If they ever decide to remove hashed file support from the product, you'll have a problem.
A suitable replacement already exists in the parallel job environment, so don't dismiss this possibility. The safest long-term strategy is to use a form of persistent storage that you know will not vanish in future, such as a database table or - if possible for your design - a text file being used as a persistent data set.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
baabi_26
Participant
Posts: 14
Joined: Mon Jan 24, 2005 5:31 pm

Post by baabi_26 »

We have a scenario wherein we have to capture the change record information. This may or may not help you much..but this is how we are doing it.

Every week we get a source file which has old as well as changed record information. We compare old file with new file and identify what are inserts, updates, deletes and unchanged. I have created a DataStage job to handle this. (We don't have parallel canvas). We retain the old file (Sequential) to do the comparision. Everything is handled in a single job. Let me know if you are interested in how I am implementing. I can post a screen shot of it here and will explain.
RodBarnes
Charter Member
Charter Member
Posts: 182
Joined: Fri Mar 18, 2005 2:10 pm

Post by RodBarnes »

Yes, I would be interested in learning more about the method you are using. It sounds like, during each ETL run, it is making a pass through the source data and generating a list of which are inserts, updates, deletes. Then this list is used in later processing to actually control the processing of each record.

Do I understand correctly?
baabi_26 wrote:We have a scenario wherein we have to capture the change record information. This may or may not help you much..but this is how we are doing it.

Every week we get a source file which has old as well as changed record information. We compare old file with new file and identify what are inserts, updates, deletes and unchanged. I have created a DataStage job to handle this. (We don't have parallel canvas). We retain the old file (Sequential) to do the comparision. Everything is handled in a single job. Let me know if you are interested in how I am implementing. I can post a screen shot of it here and will explain.
alanwms
Charter Member
Charter Member
Posts: 28
Joined: Wed Feb 26, 2003 2:51 pm
Location: Atlanta/UK

Post by alanwms »

Rod,

Having worked with the CRC32 routine before for doing change data capture, I understand your dilemma. I have addressed the persistent hash file issue by unloading the hash files to flat files at the conclusion of any update process. This gives me a backup copy of the complete hash file contents that can be reloaded to the hash file in the event the hash file becomes damaged or needs to be cleared (load the hash file from an empty flat file). You can also archive the flat file (add a timestamp to the flat file name) so you can go back and reload data using the CRC values from a known point in time.

You can always set up a separate process to just generate the CRC32 values from the database, store in a flat file, and then reload the hash file lookup from the flat file. If you use flat files to keep snapshots of your hash files, you can easily migrate to lookup datasets and lookup stages in the Parallel framework (EE) when the time comes. I've had to modify my approach to hash files within DataStage away from persistent storage to temporary storage, even though I am and old, old PICK programmer (from '79). Although hash files are very flexible and powerful, the knowledge base on hash files seems to be dwindling, and more robust solutions are available with the newest releases of DS to provide the same functionality.

Alan

Alan
baabi_26
Participant
Posts: 14
Joined: Mon Jan 24, 2005 5:31 pm

Post by baabi_26 »

Hey..how do i copy paste my job design..any help please...[/code][/url]
RodBarnes
Charter Member
Charter Member
Posts: 182
Joined: Fri Mar 18, 2005 2:10 pm

Post by RodBarnes »

Thanks for the input, Alan. Much appreciated. I will look into your idea of exporting the CRC data to a flat file and then reimporting on each run. I'll compare the performance of this with just regenerating the CRC from the database. I expect the load-from-flat-file will be much faster.
alanwms wrote:Having worked with the CRC32 routine before for doing change data capture, I understand your dilemma...
baabi_26
Participant
Posts: 14
Joined: Mon Jan 24, 2005 5:31 pm

Post by baabi_26 »

I am trying to show my job design to handle the change capture. How can I paste/attach my job design in the forum/message.
shawn_ramsey
Participant
Posts: 145
Joined: Fri May 02, 2003 9:59 am
Location: Seattle, Washington. USA

Post by shawn_ramsey »

Rod,

With much work we finally convinced the modeler to allow us to add it to the data warehouse model as a column in the destination table. To keep the users from getting confused filter the columns out via the reporting tools metadata. This worked well for us since the relational database persists and is backed up regularly, where the ETL environment would be recreated from certification if required.
Shawn Ramsey

"It is a mistake to think you can solve any major problems just with potatoes."
-- Douglas Adams
clshore
Charter Member
Charter Member
Posts: 115
Joined: Tue Oct 21, 2003 11:45 am

Post by clshore »

I assume that you are using CRC for performance reasons.
You should always be able to re-generate the set of current CRC values from the data stored in your safe persistent target.
In fact, for a production application, I consider a dedicated, tested, and versioned process to do that as an absolute required deliverable.
As long as this is true, I feel there is nothing wrong with keeping the CRC in a hash or other 'less persistent' store.
It may take a long time to regenerate, but it's there, just like any other caching store.

Slightly off topic, how does your process handle the issue of CRC collisions?

Carter
alanwms
Charter Member
Charter Member
Posts: 28
Joined: Wed Feb 26, 2003 2:51 pm
Location: Atlanta/UK

Post by alanwms »

Shawn's approach (adding the CRC value to the target table) would work great, especially with a large target table. I'm sure he and Mike Hester talked at length about how to implement the CRC value for CDC. As for collisions (as long as you use the CRC32 routine), it is highly unlikely that a two different strings of significant length for the same natural key would return the exact same CRC value. Remember that the CRC value is based on a string, and a longer string would decrease the likelihood of the CRC routine returning identical values. If the incoming string is delimited, and you want to compute the CRC value on the whole string, you'd need to set the delimiter to none in your passive (sequential file) stage.

Alan
mhester
Participant
Posts: 622
Joined: Tue Mar 04, 2003 5:26 am
Location: Phoenix, AZ
Contact:

Post by mhester »

What Alan and Shawn recommend is {probably} the best practice regarding the usage of CRC32 and SCD processing. You need not be concerned about the liklihood of "collisions" for the reasons that Alan pointed out plus some fairly complex math. The liklihood of a failure to generate the correct CRC for a given row thereby incorrectly updating or not updating a particular row is 1 in 4,294,967,296. This does not mean that if you process 4294967296 rows of data in a given run that a failure will happen (and in all liklihood you will not see a failure in a recordset of this size).
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

baabi_26 wrote:I am trying to show my job design to handle the change capture. How can I paste/attach my job design in the forum/message.
This forum does not support attachments of any kind.

The recommended approach is to show your job design inside a set of Code tags. Use Preview iteratively until the layout is correct before submitting.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

Well... not attachments of any kind, but it should support images as long as they are hosted somewhere accessable. So it would be possible (and I seem to recall at least one person actually doing this here) to take a screen print of a job, save it as a .jpg or .bmp (etc) and putting it on an image hosting site. Then the Img tags could be used to show it here.

Here are some well thought of sites to consider, if you don't own your own:

http://www.imageshack.us/
http://www.photobucket.com/
http://www.photodump.com/

Let me see if it works... brb...

Image

Click on the thumbnail for the Big Picture. :wink:

This is done from Image Shack and you don't even need to register. The black bar on the bottom of the thumbnail is (apparently) optional and they provided the forum code complete with tags I just copied and pasted into the post. 8)
-craig

"You can never have too many knives" -- Logan Nine Fingers
mhester
Participant
Posts: 622
Joined: Tue Mar 04, 2003 5:26 am
Location: Phoenix, AZ
Contact:

Post by mhester »

Craig,

Very, very cool! Now we can share images with others - I like it!
Post Reply