Page 1 of 1

Credibility of Checksum and Alternatives

Posted: Wed May 22, 2013 3:00 am
by jerome_rajan
We're trying to do some work on trying to find if the incremental data provided by our client's partners are indeed reliable and if yes, to what degree. To achieve this, we are performing the below exercise:

1. Day 1 - Take a full extract of the source from the partner for Day 1
2. Day 2 - Take a full extract of the source from the partner for Day 2 and also accept the incremental feed that they normally provide (i.e. Change data between day 1 and day 2)
3. Do a column-by-column comparison to check for any changes between the two full extracts we took on Day 1 and Day 2 and compare the output with the Incremental feed provided by the partner.

Now, the tables in question have volumes anywhere between 5 million to 60 million and each table can have almost 40-50 columns. To do a column-by-column comparison does not seem like a very good idea and I suggested using the checksum strategy.

My question : Is the CRC32 reliable for the above mentioned range of volumes? Is there a better alternative to achieve what we're trying to do here?

Posted: Wed May 22, 2013 5:32 am
by BI-RMA
Hi Jerome,

the important question is: what do you want to achieve by comparing CRC32 of two data-rows instead of comparing the column values themselves?

You'd still have to read all rows and all columns from both extracts and to generate the Checksum for both. So you do not reduce the amount of data that has be imported in the job. I do not doubt CRC32 to be sufficiently reliable when You keep the primary key of your data and generate the Checksum for all other columns. But I doubt your job is going to run much faster.

DataStage is easily capable of handling 5 to 60 million data-rows in ChangeCapture when your running it on an adequately sized unix-machine.
You say that You want to verify the reliability of Your business-partners. But You don't want to do that in a daily batch-job, do You?

Posted: Wed May 22, 2013 6:10 am
by jerome_rajan
I thought that comparing a single Checksum column would be more efficient than comparing each of the 40-50 columns. This is my idea behind using Checksum.

To answer your second question - This is not going to be a daily batch job that's going to go live but is going to be run every day for the next one month to monitor and get a good sample to arrive on a conclusion

Posted: Wed May 22, 2013 6:26 am
by BI-RMA
Hi Jerome,

Yes, sure it is. I just doubt your performance gains will be all that significant in a job You are just going to use during a certain test-period.

Posted: Wed May 22, 2013 6:32 am
by jerome_rajan
Assuming significant performance gains, is CRC32 reliable enough for the data volumes mentioned? Are there optimal alternatives to this approach?

Posted: Wed May 22, 2013 6:53 am
by chulett
I wouldn't assume anything (especially anything 'significant') but rather do a quick test to see what kind of speed you get from the CRC32 function and one comparison v. comparing all data columns. We're doing this using MD5 on sometimes a couple hundred columns (in another tool) mostly because it is easier than coding all of the checks and we don't have to worry about if it takes 'a little longer' to do it this way. And your data volumes have nothing to do with the 'reliability' of this approach... just the speed. And the only way to find out what that speed is is to code up some large volume tests.

If you do this for real, make sure you store the checksum in the target. That way you only ever have to compute it once per occurance.

Posted: Thu May 23, 2013 12:16 am
by jerome_rajan
chulett wrote:...And your data volumes have nothing to do with the 'reliability' of this approach... just the speed....
This was the doubt I had. Since it's a 32 bit code, it would produce 2^32-1 values, right? Or is that understanding wrong?

Posted: Thu May 23, 2013 1:11 am
by ray.wurlod
Worse than that. It will produce at most 2^31-1 values, but only if your data provide that many distinct combinations. (You can improve this to 2^32 values using uint32, but the same caveat applies.)

Posted: Thu May 23, 2013 1:23 am
by jerome_rajan
Thank you Ray, Craig and Roland. The information provided by you is sufficient for me to proceed with a POC and take a decision.

Posted: Thu May 23, 2013 6:18 am
by chulett
8)

On the subject of "at most 2^32-1 values", I just wanted to chime in with the thought that (in my mind at least) it's not a concern for the application we're discussing. If you were trying to generate "unique" values like I've seen some people do in a misguided attempt to establish some kind of "intelligent" surrogate, then I'd be concerned. But you are basically just comparing X columns in one version of a single row with a new set of X columns for that same row to see if anything changed. I don't know what the odds are but I'd wager the odds of you losing a change because the new version of the row happened to produce the exact same checksum as the old version of that same row have to be... pretty darn small.

Posted: Thu May 23, 2013 4:27 pm
by ray.wurlod
Possibly, but with the Change Capture, Difference or Compare stages there's no scope at all. Which is better odds than "pretty darn small" imho.

Posted: Thu May 23, 2013 9:18 pm
by chulett
Sure but that wasn't the topic of discussion. Besides, by pretty darn small I basically meant zero. :wink:

Posted: Fri May 24, 2013 12:16 am
by chandra.shekhar@tcs.com
@Jerome
If you are using checksum stage then you have luxury of dealing in huge data.
The checksum stage produces a 32 byte code(datatype char(32)).
So number of possible combinations will be 16^32 = 3.4028236692093846346337460743177e+38.

Posted: Fri May 24, 2013 2:04 am
by jerome_rajan
Woww! For real?! Is it because the checksum generated is hexadecimal in nature?

Posted: Fri May 24, 2013 2:26 am
by chandra.shekhar@tcs.com
Yup, you are right.
Check the below post as well:

viewtopic.php?t=145383