CRC32 Does have limitations (severe in my opinion)

Post questions here relative to DataStage Server Edition for such areas as Server job design, DS Basic, Routines, Job Sequences, etc.

Moderators: chulett, rschirm, roy

I_Server_Whale
Premium Member
Premium Member
Posts: 1255
Joined: Wed Feb 02, 2005 11:54 am
Location: United States of America

Post by I_Server_Whale »

chulett wrote:Pad them (each field) to a consistant size.
Great! Superb! Idea :idea: You got to try this one and let us know.

Save! CRC32.


Naveen.
Anything that won't sell, I don't want to invent. Its sale is proof of utility, and utility is success.
Author: Thomas A. Edison 1847-1931, American Inventor, Entrepreneur, Founder of GE
logic
Participant
Posts: 115
Joined: Thu Feb 24, 2005 10:48 am

Post by logic »

Exactly what i was thinking from the start.
Do we not concatanate columns , why would we use delimeters? I mean CRC32(Col1:Col2:............).
Am I missing something here?
(BTW.Mr.Clapp are you still in Vegas?)
Thanks,
lclapp
Premium Member
Premium Member
Posts: 21
Joined: Wed May 19, 2004 2:43 pm

Post by lclapp »

We may want to break this into another thread because the I have been burned when not using column seperators. Padding is something I have never been forced to do...but there is always something new.

Yes, I am still in Las Vegas...if I don't get this fixed I may not be here for much longer.

Flying home now so I will pick this up next week. Thanks for all the suggestions...I will work on the padding suggestion and let you know.

leslie
I_Server_Whale
Premium Member
Premium Member
Posts: 1255
Joined: Wed Feb 02, 2005 11:54 am
Location: United States of America

Post by I_Server_Whale »

Hello Leslie,

Just wondering if you had a chance to try the padding and the concatenating suggestions. :wink:

Thanks,
Naveen.
Anything that won't sell, I don't want to invent. Its sale is proof of utility, and utility is success.
Author: Thomas A. Edison 1847-1931, American Inventor, Entrepreneur, Founder of GE
clshore
Charter Member
Charter Member
Posts: 115
Joined: Tue Oct 21, 2003 11:45 am

Post by clshore »

I've also run into CRC32 collisions on large data sets. There are only 2^32 possible CRC32 values, so any dataset with more records is guaranteed to yield collisions. But the likelyhood of a single collision in a uniformly ditributed data set is 50% even with much smaller datasets.
(Look up 'Birthday Paradox' on the web to help see why).
The likelyhood of collisions also becomes greater if your individual data items are not uniformly random (ie, you have domains of items that are mostly digits, or spaces, or alpha, or strings).

Remember that CRC32 was created to reduce transmission errors of blocks of data over noisy analog lines. The blocks were usually 512 bytes, so chances of a collision were very small.

The reason for using a delimiter between data items is to prevent cases like this, where different records when catted give the same result:

'csv record' 'cat, no pipe' 'cat, with pipe'
---------------- ------------- -----------------
rec1 = 123, 456, 789 123456789 123|456|789
rec2 = 12, 3456, 789 123456789 12|3456|789


Carter Shore
lclapp
Premium Member
Premium Member
Posts: 21
Joined: Wed May 19, 2004 2:43 pm

Post by lclapp »

Thanks for all the suggestions...none of them worked. We finally wrote and active X plug in using the MD5 algorithm which resulting in no more duplicates. It was reasonably fast and after our initial load the deltas were very small.

Again the 'official' word from DS support is that CRC32 will produce duplicates for wide inputs. As was stated in this thread it is important to put some type of delimitor between columns before running through CRC32.

Again thanks for the suggestions from everyone. good dsssing....

leslie
mhester
Participant
Posts: 622
Joined: Tue Mar 04, 2003 5:26 am
Location: Phoenix, AZ
Contact:

Post by mhester »

To all on this post,

There is clearly a misunderstanding of what a CRC is and how a CRC is generated. It is quite possible that two totally different rows of data will generate the same CRC value and that is ok (mathematically possible and happens quite frequently for a given data volume). What the CRC will identify is if the row has changed. If you take your first row and change one simple character and then generate a CRC - the two CRC's for the same row should be different. (keep in mind that on wide rows that checksum will likely fail to detect a small change).

This is why CRC would..... NO SHOULD NEVER be used as a unique key generator which is what some on this forum have tried to do with disastrous results.

CRC is different from checksum in that a checksum (DataStage) is an 8bit or maybe a 16bit checksum and is additive (as are all checksums). CRC is not additive and uses polynomial math and a table to generate the CRC value.

If what you are describing was truly a problem then HTTP, FTP etc... would not work since this is the same family of algorithm used in these communications protocols.

So you see..... the world is not collapsing and the sky is not falling it is simply a matter of understanding and unless there is an understanding of the technology then, of course, the belief will be that it does not work and that would be incorrect.

Regards,
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

Ah... nothing like a CRC post to bring Michael out of the woodwork. :wink:
-craig

"You can never have too many knives" -- Logan Nine Fingers
mhester
Participant
Posts: 622
Joined: Tue Mar 04, 2003 5:26 am
Location: Phoenix, AZ
Contact:

Post by mhester »

Craig,

Just saw this one and it got my blood boiling!

:x
shawn_ramsey
Participant
Posts: 145
Joined: Fri May 02, 2003 9:59 am
Location: Seattle, Washington. USA

Post by shawn_ramsey »

The question I have for the group is:

Isn't CRC really designed to detect difference in what you assume to be the same not the sameness in what you assume to be different?
Shawn Ramsey

"It is a mistake to think you can solve any major problems just with potatoes."
-- Douglas Adams
kcbland
Participant
Posts: 5208
Joined: Wed Jan 15, 2003 8:56 am
Location: Lutz, FL
Contact:

Post by kcbland »

whooaa dude, that is far out

But absolutely true.
Kenneth Bland

Rank: Sempai
Belt: First degree black
Fight name: Captain Hook
Signature knockout: right upper cut followed by left hook
Signature submission: Crucifix combined with leg triangle
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

... for some particular value of truth, anyway.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Post Reply