Generating Surrogate Keys via hash functions

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

kumar_s
Charter Member
Charter Member
Posts: 5245
Joined: Thu Jun 16, 2005 11:00 pm

Post by kumar_s »

If you make a search on CRC32 you ll find the reason. There were lot many discussion went prior to this. You can find may of my own thread as well.
I guess its mhester, who gave the good valid point on why not to use.
Impossible doesn't mean 'it is not possible' actually means... 'NOBODY HAS DONE IT SO FAR'
Yuan_Edward
Participant
Posts: 73
Joined: Tue May 10, 2005 6:21 pm
Location: Sydney

Post by Yuan_Edward »

Craig, I am still looking for that suitable hashing algorithm and haven't decided to use CRC32 or MD5.

I know the traditional surrogate key is a good choice, but just want to know what is the drawback of the hashing approach? Thank you.
chulett wrote:I know no-one has asked for it, but thought I would throw in my two cents. I've said this before and I'll say it again. :wink:

I would not consider any kind of 'hashing' algorithm suitable for surrogate keys. Period. If forced to choose between your two choices, I'd go MD5 over CRC32 or would push for one of the 'newer' ones whose names escape me at this hour of the morning. :?

Plenty of references to this inappropriateness out in the wild. This is one that took all of two seconds to find, I'm sure there are 'better' ones.

IMHO - stick with a 'traditional' surrogate key, a big incremented integer. Assigned or derive absolutely no intelligence to or from it. And depending on how you handle the generation, you may not even need to make the trip back to the database to look it up you seem to want to avoid. At worst case, it's one trip to 'seed' the correct starting point at the beginning and then possibly one at the end to record the last one used.
Edward Yuan
kumar_s
Charter Member
Charter Member
Posts: 5245
Joined: Thu Jun 16, 2005 11:00 pm

Post by kumar_s »

Check if this link can give you any help.
Impossible doesn't mean 'it is not possible' actually means... 'NOBODY HAS DONE IT SO FAR'
Yuan_Edward
Participant
Posts: 73
Joined: Tue May 10, 2005 6:21 pm
Location: Sydney

Post by Yuan_Edward »

Hi Craig, thanks for the reply. I am still looking for the suitable function and haven't decided to use CRC32 or MD5 yet.

I know the traditional surrogate key is a good choice but just want to get some comments on what will be the drawback of the hashing approach.
chulett wrote:I know no-one has asked for it, but thought I would throw in my two cents. I've said this before and I'll say it again. :wink:

I would not consider any kind of 'hashing' algorithm suitable for surrogate keys. Period. If forced to choose between your two choices, I'd go MD5 over CRC32 or would push for one of the 'newer' ones whose names escape me at this hour of the morning. :?

Plenty of references to this inappropriateness out in the wild. This is one that took all of two seconds to find, I'm sure there are 'better' ones.

IMHO - stick with a 'traditional' surrogate key, a big incremented integer. Assigned or derive absolutely no intelligence to or from it. And depending on how you handle the generation, you may not even need to make the trip back to the database to look it up you seem to want to avoid. At worst case, it's one trip to 'seed' the correct starting point at the beginning and then possibly one at the end to record the last one used.
Edward Yuan
Yuan_Edward
Participant
Posts: 73
Joined: Tue May 10, 2005 6:21 pm
Location: Sydney

Post by Yuan_Edward »

:cry: I wish there was a perfect hashing algorithm to guarantee uniqueness. That is an issue then. Thanks, Ray.
ray.wurlod wrote:There's one major drawback. It does not guarantee uniqueness. For example CRC32 has about a 1 in 4000000 chance of generating a duplicate. Four million is not that many rows.

The entire raison d ...
Edward Yuan
kumar_s
Charter Member
Charter Member
Posts: 5245
Joined: Thu Jun 16, 2005 11:00 pm

Post by kumar_s »

The number of byte you increase, the less the chance of getting wrong value. But could try to generated the Surrogate key using the sequence numbers, and use the same for DB1 and DB2.
Perhaps you can try to assign SK for all the rows irrespective of data issue. This way you can easily follow your business rule.
Impossible doesn't mean 'it is not possible' actually means... 'NOBODY HAS DONE IT SO FAR'
Post Reply