MD5 CheckSum key generator

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
sharma
Premium Member
Premium Member
Posts: 46
Joined: Mon Dec 24, 2007 2:16 pm

MD5 CheckSum key generator

Post by sharma »

Hi ,

I want to use the inbuilt checkSum stage in DS8.1 to generate a unique key.
I have couple of question regarding this.
1. It generates the checksum field in Alphanumeric form. Can it generate the checksum field in integer/Bigint ?

2. We are getting 5 billion records a day ( may be more in future ) and we want to generate unique Key for all of the records.Does this stage has a capability to generates consistence unique value for 5-6 Billion records ? Or it just built to handle small no. of records.

I just need to know the accuracy and performance of this stage for volume of data ( approx 6 Billion ).

Please reply to me ASAP as i need to decide whether to use inbuilt Datastage checksum stage or write my my C++ code for Checksum.

Regards
~Nirmal
Nirmal Sharma
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

There's nothing about any "checksum" that is meant to be used to generate unique / surrogate keys, so you are completely off track there. What's wrong with a normal incrementing surrogate? :?
-craig

"You can never have too many knives" -- Logan Nine Fingers
sharma
Premium Member
Premium Member
Posts: 46
Joined: Mon Dec 24, 2007 2:16 pm

Post by sharma »

In our table we are having 30 columns and we need to compare all the 30 column of one record with another to determine duplicates and also we are doing almost the same thing to determine the parent child relationship.

So it is very cumbersome to every time use all the 30 columns So we are generating a unique key out of these 30 columns and then we use this unique further everywhere in the code which make things simpler and maintainable.
So we need checksum and not surrogate key ( which is very different from checksum ).
We are currently using C++ code to generating checksum keys but i want to une inbuilt Datastage Stage.

Can you please now give the answers to my question asked above ( in my first post ).

Regards
~Nirmal
Nirmal Sharma
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

That's fine and a pretty common use of that function, what I was commenting on was your first sentence: "I want to use the inbuilt checkSum stage in DS8.1 to generate a unique key". It doesn't generate a unique key although we've had people attempt to use it for that purpose, hence the confusion. It does, however, generate a checksum for those field values that can be compared to future checksums for that same record to know if any values have changed for CDD.

An MD5 checksum is alphanumeric, period.
-craig

"You can never have too many knives" -- Logan Nine Fingers
sharma
Premium Member
Premium Member
Posts: 46
Joined: Mon Dec 24, 2007 2:16 pm

Post by sharma »

Ok, So can you please answer my 2nd question regarding the performance and accuracy?

Will it work correctly for billions of records ?

Regards
~Nirmal
Nirmal Sharma
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

I did, mostly. It will "work correctly" as the number of records really has no bearing on this except for any performance discussions and I can't address that for billions of records. I've done it with millions but not billions. There's obviously some overhead to having to generate any hash, you'll need to run some tests and see if it is "fast enough" in your environment for those volumes to be acceptable.
-craig

"You can never have too many knives" -- Logan Nine Fingers
sharma
Premium Member
Premium Member
Posts: 46
Joined: Mon Dec 24, 2007 2:16 pm

Post by sharma »

Thanks Craig,

I am more concerned about the accuracy than Performance.

We can handle the Performance if it gives the correct results on increasing volume of records.

Regards
~Nirmal
Nirmal Sharma
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Any hash or checksum algorithm will have its limitations. For example CRC32 is necessarily limited by its 32-bit result to be accurate only to about one in four million. The greater the number of potentially unique values, the "bigger" the algorithm you need. There are variations on MD5 - search the internet for details - but you may need to "roll your own".
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

Still, why the worry about uniqueness in this application? All you care about is if the MD5 hash for any given record changes over time and who cares if the hash it carries is the same as any other record? Or are you thinking that the field values could change to some odd combination of values that equate to the exact same value as the previous hash, thus loosing a change? I would think the odds of that are pretty darn remote.
-craig

"You can never have too many knives" -- Logan Nine Fingers
Post Reply