Generating Surrogate Keys via hash functions

kumar_s · Post by **kumar_s** » Tue Mar 13, 2007 5:56 pm

If you make a search on CRC32 you ll find the reason. There were lot many discussion went prior to this. You can find may of my own thread as well.
I guess its mhester, who gave the good valid point on why not to use.

Yuan_Edward · Post by **Yuan_Edward** » Tue Mar 13, 2007 5:59 pm

Craig, I am still looking for that suitable hashing algorithm and haven't decided to use CRC32 or MD5.

I know the traditional surrogate key is a good choice, but just want to know what is the drawback of the hashing approach? Thank you.

chulett wrote:I know no-one has asked for it, but thought I would throw in my two cents. I've said this before and I'll say it again.

I would not consider any kind of 'hashing' algorithm suitable for surrogate keys. Period. If forced to choose between your two choices, I'd go MD5 over CRC32 or would push for one of the 'newer' ones whose names escape me at this hour of the morning.

Plenty of references to this inappropriateness out in the wild. This is one that took all of two seconds to find, I'm sure there are 'better' ones.

IMHO - stick with a 'traditional' surrogate key, a big incremented integer. Assigned or derive absolutely no intelligence to or from it. And depending on how you handle the generation, you may not even need to make the trip back to the database to look it up you seem to want to avoid. At worst case, it's one trip to 'seed' the correct starting point at the beginning and then possibly one at the end to record the last one used.

kumar_s · Post by **kumar_s** » Tue Mar 13, 2007 6:02 pm

Check if this link can give you any help.

Yuan_Edward · Post by **Yuan_Edward** » Tue Mar 13, 2007 6:03 pm

Hi Craig, thanks for the reply. I am still looking for the suitable function and haven't decided to use CRC32 or MD5 yet.

I know the traditional surrogate key is a good choice but just want to get some comments on what will be the drawback of the hashing approach.

chulett wrote:I know no-one has asked for it, but thought I would throw in my two cents. I've said this before and I'll say it again.

I would not consider any kind of 'hashing' algorithm suitable for surrogate keys. Period. If forced to choose between your two choices, I'd go MD5 over CRC32 or would push for one of the 'newer' ones whose names escape me at this hour of the morning.

Plenty of references to this inappropriateness out in the wild. This is one that took all of two seconds to find, I'm sure there are 'better' ones.

IMHO - stick with a 'traditional' surrogate key, a big incremented integer. Assigned or derive absolutely no intelligence to or from it. And depending on how you handle the generation, you may not even need to make the trip back to the database to look it up you seem to want to avoid. At worst case, it's one trip to 'seed' the correct starting point at the beginning and then possibly one at the end to record the last one used.

Yuan_Edward · Post by **Yuan_Edward** » Tue Mar 13, 2007 6:37 pm

I wish there was a perfect hashing algorithm to guarantee uniqueness. That is an issue then. Thanks, Ray.

ray.wurlod wrote:There's one major drawback. It does not guarantee uniqueness. For example CRC32 has about a 1 in 4000000 chance of generating a duplicate. Four million is not that many rows.

The entire raison d ...

kumar_s · Post by **kumar_s** » Wed Mar 14, 2007 1:35 am

The number of byte you increase, the less the chance of getting wrong value. But could try to generated the Surrogate key using the sequence numbers, and use the same for DB1 and DB2.
Perhaps you can try to assign SK for all the rows irrespective of data issue. This way you can easily follow your business rule.