MD5 CheckSum key generator
Moderators: chulett, rschirm, roy
MD5 CheckSum key generator
Hi ,
I want to use the inbuilt checkSum stage in DS8.1 to generate a unique key.
I have couple of question regarding this.
1. It generates the checksum field in Alphanumeric form. Can it generate the checksum field in integer/Bigint ?
2. We are getting 5 billion records a day ( may be more in future ) and we want to generate unique Key for all of the records.Does this stage has a capability to generates consistence unique value for 5-6 Billion records ? Or it just built to handle small no. of records.
I just need to know the accuracy and performance of this stage for volume of data ( approx 6 Billion ).
Please reply to me ASAP as i need to decide whether to use inbuilt Datastage checksum stage or write my my C++ code for Checksum.
Regards
~Nirmal
I want to use the inbuilt checkSum stage in DS8.1 to generate a unique key.
I have couple of question regarding this.
1. It generates the checksum field in Alphanumeric form. Can it generate the checksum field in integer/Bigint ?
2. We are getting 5 billion records a day ( may be more in future ) and we want to generate unique Key for all of the records.Does this stage has a capability to generates consistence unique value for 5-6 Billion records ? Or it just built to handle small no. of records.
I just need to know the accuracy and performance of this stage for volume of data ( approx 6 Billion ).
Please reply to me ASAP as i need to decide whether to use inbuilt Datastage checksum stage or write my my C++ code for Checksum.
Regards
~Nirmal
Nirmal Sharma
In our table we are having 30 columns and we need to compare all the 30 column of one record with another to determine duplicates and also we are doing almost the same thing to determine the parent child relationship.
So it is very cumbersome to every time use all the 30 columns So we are generating a unique key out of these 30 columns and then we use this unique further everywhere in the code which make things simpler and maintainable.
So we need checksum and not surrogate key ( which is very different from checksum ).
We are currently using C++ code to generating checksum keys but i want to une inbuilt Datastage Stage.
Can you please now give the answers to my question asked above ( in my first post ).
Regards
~Nirmal
So it is very cumbersome to every time use all the 30 columns So we are generating a unique key out of these 30 columns and then we use this unique further everywhere in the code which make things simpler and maintainable.
So we need checksum and not surrogate key ( which is very different from checksum ).
We are currently using C++ code to generating checksum keys but i want to une inbuilt Datastage Stage.
Can you please now give the answers to my question asked above ( in my first post ).
Regards
~Nirmal
Nirmal Sharma
That's fine and a pretty common use of that function, what I was commenting on was your first sentence: "I want to use the inbuilt checkSum stage in DS8.1 to generate a unique key". It doesn't generate a unique key although we've had people attempt to use it for that purpose, hence the confusion. It does, however, generate a checksum for those field values that can be compared to future checksums for that same record to know if any values have changed for CDD.
An MD5 checksum is alphanumeric, period.
An MD5 checksum is alphanumeric, period.
-craig
"You can never have too many knives" -- Logan Nine Fingers
"You can never have too many knives" -- Logan Nine Fingers
I did, mostly. It will "work correctly" as the number of records really has no bearing on this except for any performance discussions and I can't address that for billions of records. I've done it with millions but not billions. There's obviously some overhead to having to generate any hash, you'll need to run some tests and see if it is "fast enough" in your environment for those volumes to be acceptable.
-craig
"You can never have too many knives" -- Logan Nine Fingers
"You can never have too many knives" -- Logan Nine Fingers
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
Any hash or checksum algorithm will have its limitations. For example CRC32 is necessarily limited by its 32-bit result to be accurate only to about one in four million. The greater the number of potentially unique values, the "bigger" the algorithm you need. There are variations on MD5 - search the internet for details - but you may need to "roll your own".
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Still, why the worry about uniqueness in this application? All you care about is if the MD5 hash for any given record changes over time and who cares if the hash it carries is the same as any other record? Or are you thinking that the field values could change to some odd combination of values that equate to the exact same value as the previous hash, thus loosing a change? I would think the odds of that are pretty darn remote.
-craig
"You can never have too many knives" -- Logan Nine Fingers
"You can never have too many knives" -- Logan Nine Fingers