Should I use checksum

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
jiegao
Premium Member
Premium Member
Posts: 46
Joined: Fri Sep 22, 2006 6:12 pm

Should I use checksum

Post by jiegao »

I need to compare a table row with a row in a file to see if any field has changed. I am thinking to concatenation required fields and do the comparison. I am not sure if I should calculate the checksum then do the comparison. Which one is more efficient? Thank you in advance.
Regards
Jie
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

If you plan on using a checksum for change management, then typically one would include a checksum field in the record with the value for the current row. Then you can generate one for the incoming row and compare the old v. new to see if anything had changed. Store the new checksum with the new / updated record. Note that this works just fine as long as you don't care which field changed. :wink:

I'd also suggest putting a standard delimiter between each field, anything not a string be converted to a string and ensuring that things like dates are standardized to avoid having something like "6/8/12" and "06/08/2012" being considered as different values. And the delimiter stops silly things like this happening:

Code: Select all

      Col1    Col2
Old:  12      345
New:  123     45
Simply cat'd together, "12345" = "12345" so no change. A delimiter like a pipe would show that "12|345" != "123|45".
-craig

"You can never have too many knives" -- Logan Nine Fingers
jiegao
Premium Member
Premium Member
Posts: 46
Joined: Fri Sep 22, 2006 6:12 pm

Post by jiegao »

Thanks Craig. I am not planning to store the checksum on the table record. Does it make sense to calculate both checksum and compare? I am concerned more on the performance. I plan to just compare the concatenated value. What do you think?
Regards
Jie
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Post by ArndW »

While checksum algorithms are pretty efficient, they still take CPU cycles. I would put the CRC value in the data with each record, the will at least halve the number of computations if each record is only compared one time.
Also remember that each field you concatenate needs to be checked for NULLs, which need to be replaced with something. Appending a NULL string to any other string results in a NULL string.
I have used the Generic Stage and the CRC32 stage to dynamically generate a checksum, it uses a list of column names in schema format as input to determine the concatenated string for CRC32 computation.
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

If you are not planning on storing it, then I don't see the point of computing it, especially twice - simply compare the two concatenated strings. And I forgot to mention the null issue, my bad. My only excuse is working with Informatica for the last two years. It doesn't seem to care about concatenating null fields together and automagically does some sort of 'NullToEmpty' equivalent behind the scenes, so we just paste'm together. :wink:
-craig

"You can never have too many knives" -- Logan Nine Fingers
jiegao
Premium Member
Premium Member
Posts: 46
Joined: Fri Sep 22, 2006 6:12 pm

Post by jiegao »

Thanks Craig
Regards
Jie
Post Reply