MD5 CheckSum key generator

sharma · Post by **sharma** » Fri Jun 18, 2010 8:10 am

Hi ,

I want to use the inbuilt checkSum stage in DS8.1 to generate a unique key.
I have couple of question regarding this.
1. It generates the checksum field in Alphanumeric form. Can it generate the checksum field in integer/Bigint ?

2. We are getting 5 billion records a day ( may be more in future ) and we want to generate unique Key for all of the records.Does this stage has a capability to generates consistence unique value for 5-6 Billion records ? Or it just built to handle small no. of records.

I just need to know the accuracy and performance of this stage for volume of data ( approx 6 Billion ).

Please reply to me ASAP as i need to decide whether to use inbuilt Datastage checksum stage or write my my C++ code for Checksum.

Regards
~Nirmal

chulett · Post by **chulett** » Fri Jun 18, 2010 8:47 am

There's nothing about any "checksum" that is meant to be used to generate unique / surrogate keys, so you are completely off track there. What's wrong with a normal incrementing surrogate?

sharma · Post by **sharma** » Fri Jun 18, 2010 9:01 am

In our table we are having 30 columns and we need to compare all the 30 column of one record with another to determine duplicates and also we are doing almost the same thing to determine the parent child relationship.

So it is very cumbersome to every time use all the 30 columns So we are generating a unique key out of these 30 columns and then we use this unique further everywhere in the code which make things simpler and maintainable.
So we need checksum and not surrogate key ( which is very different from checksum ).
We are currently using C++ code to generating checksum keys but i want to une inbuilt Datastage Stage.

Can you please now give the answers to my question asked above ( in my first post ).

Regards
~Nirmal

chulett · Post by **chulett** » Fri Jun 18, 2010 9:24 am

That's fine and a pretty common use of that function, what I was commenting on was your first sentence: "I want to use the inbuilt checkSum stage in DS8.1 to generate a unique key". It doesn't generate a unique key although we've had people attempt to use it for that purpose, hence the confusion. It does, however, generate a checksum for those field values that can be compared to future checksums for that same record to know if any values have changed for CDD.

An MD5 checksum is alphanumeric, period.

sharma · Post by **sharma** » Fri Jun 18, 2010 9:43 am

Ok, So can you please answer my 2nd question regarding the performance and accuracy?

Will it work correctly for billions of records ?

Regards
~Nirmal

chulett · Post by **chulett** » Fri Jun 18, 2010 10:09 am

I did, mostly. It will "work correctly" as the number of records really has no bearing on this except for any performance discussions and I can't address that for billions of records. I've done it with millions but not billions. There's obviously some overhead to having to generate any hash, you'll need to run some tests and see if it is "fast enough" in your environment for those volumes to be acceptable.

sharma · Post by **sharma** » Fri Jun 18, 2010 10:17 am

Thanks Craig,

I am more concerned about the accuracy than Performance.

We can handle the Performance if it gives the correct results on increasing volume of records.

Regards
~Nirmal

ray.wurlod · Post by **ray.wurlod** » Fri Jun 18, 2010 4:54 pm

Any hash or checksum algorithm will have its limitations. For example CRC32 is necessarily limited by its 32-bit result to be accurate only to about one in four million. The greater the number of potentially unique values, the "bigger" the algorithm you need. There are variations on MD5 - search the internet for details - but you may need to "roll your own".

chulett · Post by **chulett** » Fri Jun 18, 2010 5:13 pm

Still, why the worry about uniqueness in this application? All you care about is if the MD5 hash for any given record changes over time and who cares if the hash it carries is the same as any other record? Or are you thinking that the field values could change to some odd combination of values that equate to the exact same value as the previous hash, thus loosing a change? I would think the odds of that are pretty darn remote.