chulett wrote:Pad them (each field) to a consistant size.
Great! Superb! Idea You got to try this one and let us know.
Save! CRC32.
Naveen.
Anything that won't sell, I don't want to invent. Its sale is proof of utility, and utility is success.
Author: Thomas A. Edison 1847-1931, American Inventor, Entrepreneur, Founder of GE
Exactly what i was thinking from the start.
Do we not concatanate columns , why would we use delimeters? I mean CRC32(Col1:Col2:............).
Am I missing something here?
(BTW.Mr.Clapp are you still in Vegas?)
Thanks,
We may want to break this into another thread because the I have been burned when not using column seperators. Padding is something I have never been forced to do...but there is always something new.
Yes, I am still in Las Vegas...if I don't get this fixed I may not be here for much longer.
Flying home now so I will pick this up next week. Thanks for all the suggestions...I will work on the padding suggestion and let you know.
Just wondering if you had a chance to try the padding and the concatenating suggestions.
Thanks,
Naveen.
Anything that won't sell, I don't want to invent. Its sale is proof of utility, and utility is success.
Author: Thomas A. Edison 1847-1931, American Inventor, Entrepreneur, Founder of GE
I've also run into CRC32 collisions on large data sets. There are only 2^32 possible CRC32 values, so any dataset with more records is guaranteed to yield collisions. But the likelyhood of a single collision in a uniformly ditributed data set is 50% even with much smaller datasets.
(Look up 'Birthday Paradox' on the web to help see why).
The likelyhood of collisions also becomes greater if your individual data items are not uniformly random (ie, you have domains of items that are mostly digits, or spaces, or alpha, or strings).
Remember that CRC32 was created to reduce transmission errors of blocks of data over noisy analog lines. The blocks were usually 512 bytes, so chances of a collision were very small.
The reason for using a delimiter between data items is to prevent cases like this, where different records when catted give the same result:
Thanks for all the suggestions...none of them worked. We finally wrote and active X plug in using the MD5 algorithm which resulting in no more duplicates. It was reasonably fast and after our initial load the deltas were very small.
Again the 'official' word from DS support is that CRC32 will produce duplicates for wide inputs. As was stated in this thread it is important to put some type of delimitor between columns before running through CRC32.
Again thanks for the suggestions from everyone. good dsssing....
There is clearly a misunderstanding of what a CRC is and how a CRC is generated. It is quite possible that two totally different rows of data will generate the same CRC value and that is ok (mathematically possible and happens quite frequently for a given data volume). What the CRC will identify is if the row has changed. If you take your first row and change one simple character and then generate a CRC - the two CRC's for the same row should be different. (keep in mind that on wide rows that checksum will likely fail to detect a small change).
This is why CRC would..... NO SHOULD NEVER be used as a unique key generator which is what some on this forum have tried to do with disastrous results.
CRC is different from checksum in that a checksum (DataStage) is an 8bit or maybe a 16bit checksum and is additive (as are all checksums). CRC is not additive and uses polynomial math and a table to generate the CRC value.
If what you are describing was truly a problem then HTTP, FTP etc... would not work since this is the same family of algorithm used in these communications protocols.
So you see..... the world is not collapsing and the sky is not falling it is simply a matter of understanding and unless there is an understanding of the technology then, of course, the belief will be that it does not work and that would be incorrect.
Rank: Sempai
Belt: First degree black
Fight name: Captain Hook
Signature knockout: right upper cut followed by left hook
Signature submission: Crucifix combined with leg triangle