Page 2 of 2

Posted: Fri Mar 03, 2006 2:18 pm
by I_Server_Whale
chulett wrote:Pad them (each field) to a consistant size.
Great! Superb! Idea :idea: You got to try this one and let us know.

Save! CRC32.


Naveen.

Posted: Fri Mar 03, 2006 3:27 pm
by logic
Exactly what i was thinking from the start.
Do we not concatanate columns , why would we use delimeters? I mean CRC32(Col1:Col2:............).
Am I missing something here?
(BTW.Mr.Clapp are you still in Vegas?)
Thanks,

Posted: Fri Mar 03, 2006 4:16 pm
by lclapp
We may want to break this into another thread because the I have been burned when not using column seperators. Padding is something I have never been forced to do...but there is always something new.

Yes, I am still in Las Vegas...if I don't get this fixed I may not be here for much longer.

Flying home now so I will pick this up next week. Thanks for all the suggestions...I will work on the padding suggestion and let you know.

leslie

Posted: Tue Mar 07, 2006 8:15 am
by I_Server_Whale
Hello Leslie,

Just wondering if you had a chance to try the padding and the concatenating suggestions. :wink:

Thanks,
Naveen.

Posted: Wed Apr 12, 2006 8:46 am
by clshore
I've also run into CRC32 collisions on large data sets. There are only 2^32 possible CRC32 values, so any dataset with more records is guaranteed to yield collisions. But the likelyhood of a single collision in a uniformly ditributed data set is 50% even with much smaller datasets.
(Look up 'Birthday Paradox' on the web to help see why).
The likelyhood of collisions also becomes greater if your individual data items are not uniformly random (ie, you have domains of items that are mostly digits, or spaces, or alpha, or strings).

Remember that CRC32 was created to reduce transmission errors of blocks of data over noisy analog lines. The blocks were usually 512 bytes, so chances of a collision were very small.

The reason for using a delimiter between data items is to prevent cases like this, where different records when catted give the same result:

'csv record' 'cat, no pipe' 'cat, with pipe'
---------------- ------------- -----------------
rec1 = 123, 456, 789 123456789 123|456|789
rec2 = 12, 3456, 789 123456789 12|3456|789


Carter Shore

Posted: Wed Apr 12, 2006 9:46 am
by lclapp
Thanks for all the suggestions...none of them worked. We finally wrote and active X plug in using the MD5 algorithm which resulting in no more duplicates. It was reasonably fast and after our initial load the deltas were very small.

Again the 'official' word from DS support is that CRC32 will produce duplicates for wide inputs. As was stated in this thread it is important to put some type of delimitor between columns before running through CRC32.

Again thanks for the suggestions from everyone. good dsssing....

leslie

Posted: Tue Jul 11, 2006 3:46 pm
by mhester
To all on this post,

There is clearly a misunderstanding of what a CRC is and how a CRC is generated. It is quite possible that two totally different rows of data will generate the same CRC value and that is ok (mathematically possible and happens quite frequently for a given data volume). What the CRC will identify is if the row has changed. If you take your first row and change one simple character and then generate a CRC - the two CRC's for the same row should be different. (keep in mind that on wide rows that checksum will likely fail to detect a small change).

This is why CRC would..... NO SHOULD NEVER be used as a unique key generator which is what some on this forum have tried to do with disastrous results.

CRC is different from checksum in that a checksum (DataStage) is an 8bit or maybe a 16bit checksum and is additive (as are all checksums). CRC is not additive and uses polynomial math and a table to generate the CRC value.

If what you are describing was truly a problem then HTTP, FTP etc... would not work since this is the same family of algorithm used in these communications protocols.

So you see..... the world is not collapsing and the sky is not falling it is simply a matter of understanding and unless there is an understanding of the technology then, of course, the belief will be that it does not work and that would be incorrect.

Regards,

Posted: Tue Jul 11, 2006 3:50 pm
by chulett
Ah... nothing like a CRC post to bring Michael out of the woodwork. :wink:

Posted: Tue Jul 11, 2006 4:02 pm
by mhester
Craig,

Just saw this one and it got my blood boiling!

:x

Posted: Tue Jul 11, 2006 4:34 pm
by shawn_ramsey
The question I have for the group is:

Isn't CRC really designed to detect difference in what you assume to be the same not the sameness in what you assume to be different?

Posted: Tue Jul 11, 2006 5:30 pm
by kcbland
whooaa dude, that is far out

But absolutely true.

Posted: Tue Jul 11, 2006 5:31 pm
by ray.wurlod
... for some particular value of truth, anyway.