Should I use checksum

jiegao · Post by **jiegao** » Mon Jul 09, 2012 9:50 pm

I need to compare a table row with a row in a file to see if any field has changed. I am thinking to concatenation required fields and do the comparison. I am not sure if I should calculate the checksum then do the comparison. Which one is more efficient? Thank you in advance.

chulett · Post by **chulett** » Mon Jul 09, 2012 10:11 pm

If you plan on using a checksum for change management, then typically one would include a checksum field in the record with the value for the current row. Then you can generate one for the incoming row and compare the old v. new to see if anything had changed. Store the new checksum with the new / updated record. Note that this works just fine as long as you don't care which field changed.

I'd also suggest putting a standard delimiter between each field, anything not a string be converted to a string and ensuring that things like dates are standardized to avoid having something like "6/8/12" and "06/08/2012" being considered as different values. And the delimiter stops silly things like this happening:

Code: Select all

      Col1    Col2
Old:  12      345
New:  123     45

Simply cat'd together, "12345" = "12345" so no change. A delimiter like a pipe would show that "12|345" != "123|45".

jiegao · Post by **jiegao** » Tue Jul 10, 2012 5:16 am

Thanks Craig. I am not planning to store the checksum on the table record. Does it make sense to calculate both checksum and compare? I am concerned more on the performance. I plan to just compare the concatenated value. What do you think?

ArndW · Post by **ArndW** » Tue Jul 10, 2012 5:36 am

While checksum algorithms are pretty efficient, they still take CPU cycles. I would put the CRC value in the data with each record, the will at least halve the number of computations if each record is only compared one time.
Also remember that each field you concatenate needs to be checked for NULLs, which need to be replaced with something. Appending a NULL string to any other string results in a NULL string.
I have used the Generic Stage and the CRC32 stage to dynamically generate a checksum, it uses a list of column names in schema format as input to determine the concatenated string for CRC32 computation.

chulett · Post by **chulett** » Tue Jul 10, 2012 6:52 am

If you are not planning on storing it, then I don't see the point of computing it, especially twice - simply compare the two concatenated strings. And I forgot to mention the null issue, my bad. My only excuse is working with Informatica for the last two years. It doesn't seem to care about concatenating null fields together and automagically does some sort of 'NullToEmpty' equivalent behind the scenes, so we just paste'm together.

jiegao · Post by **jiegao** » Tue Jul 10, 2012 11:54 am

Thanks Craig