10 pounds in a 5 pound bag
Moderators: chulett, rschirm, roy
Craig,
the job ran through without a collision on all permutations of a 10-character string using CRC32. I just realized that this is not guarantee that it won't get a duplicate on a 9 or 11 length string - but the chances of a collision using 12 CRC32'd number on a 200 character string string are so infinitesmal to be effectively equal to 0, imho. I think the chances of getting a duplicate on 2 are small enough.
the job ran through without a collision on all permutations of a 10-character string using CRC32. I just realized that this is not guarantee that it won't get a duplicate on a 9 or 11 length string - but the chances of a collision using 12 CRC32'd number on a 200 character string string are so infinitesmal to be effectively equal to 0, imho. I think the chances of getting a duplicate on 2 are small enough.
I have a got a different approach to this problem.
May sound silly but it definetly works
Create a table with two fields (key number(20), charvalue varchar2(200))
populate this table with the character values (Key is auto generated / sequence).
Use the key value to replace the "200" length character string.
This approach definetly works
-Nagesh
May sound silly but it definetly works
Create a table with two fields (key number(20), charvalue varchar2(200))
populate this table with the character values (Key is auto generated / sequence).
Use the key value to replace the "200" length character string.
This approach definetly works
-Nagesh
Srinagesh,
actually, that is a really good solution. I was so concentrated on the string 'compression' function that the (now obvious) table-based solution that you have come up with completely escaped me. With th lookup file you could use a shorter key as well - just start with 1 and work upwards from there.
actually, that is a really good solution. I was so concentrated on the string 'compression' function that the (now obvious) table-based solution that you have come up with completely escaped me. With th lookup file you could use a shorter key as well - just start with 1 and work upwards from there.
It is an interesting solution, unfortunately I don't believe it will work for us. I've made the classic newbie mistake of not giving you 100 percent of the information needed to solve my problem.
I was looking for an function because it needs to be repeatable by the business partner we are creating the feed for that it will be incorporated in. At least I'm pretty sure that is one of the requirements... I need to double-check. So while 'DataStage-centric' solutions are cool, I think they are out of picture for what I need to do here.
If I need something repeatable, I'm assuming the CRC32 method would qualify, yes? Is it different than the CRC32() function available in DataStage? I ask because I ran some strings through it yesterday and got some rather large values out of the routine - 8 or 9 digits if I recall - rather than the 4 digits you mentioned, hence the question.
![Embarassed :oops:](./images/smilies/icon_redface.gif)
I was looking for an function because it needs to be repeatable by the business partner we are creating the feed for that it will be incorporated in. At least I'm pretty sure that is one of the requirements... I need to double-check. So while 'DataStage-centric' solutions are cool, I think they are out of picture for what I need to do here.
![Confused :?](./images/smilies/icon_confused.gif)
If I need something repeatable, I'm assuming the CRC32 method would qualify, yes? Is it different than the CRC32() function available in DataStage? I ask because I ran some strings through it yesterday and got some rather large values out of the routine - 8 or 9 digits if I recall - rather than the 4 digits you mentioned, hence the question.
-craig
"You can never have too many knives" -- Logan Nine Fingers
"You can never have too many knives" -- Logan Nine Fingers
Craig,
the answer comes as a 32-bit digit, so the displayed value is going to chew up a lot of spaces, what I meant is that this can be compressed into 4 bytes using a MBOC conversion. I have a routine here somewhere that does that {it also makes sure that no @FM is in the string, so that it can be written to a hashed file}.
But to keep things simple you can split your string into 5 substrings and concatenate the CRC each of them. Still mathematically inconceivable to get a collision and simpler to implement while staying under your 50 character maximum.
the answer comes as a 32-bit digit, so the displayed value is going to chew up a lot of spaces, what I meant is that this can be compressed into 4 bytes using a MBOC conversion. I have a routine here somewhere that does that {it also makes sure that no @FM is in the string, so that it can be written to a hashed file}.
But to keep things simple you can split your string into 5 substrings and concatenate the CRC each of them. Still mathematically inconceivable to get a collision and simpler to implement while staying under your 50 character maximum.
I'd be curious to see your routine as I'm not having alot of luck duplicating what you are saying here. Sorry.
I can find an 'MB0C' conversion code (zero not oh) but that seems to output a giant binary number. Did you perhaps mean "MO0C" to get Octal?![Confused :?](./images/smilies/icon_confused.gif)
I can find an 'MB0C' conversion code (zero not oh) but that seems to output a giant binary number. Did you perhaps mean "MO0C" to get Octal?
![Confused :?](./images/smilies/icon_confused.gif)
-craig
"You can never have too many knives" -- Logan Nine Fingers
"You can never have too many knives" -- Logan Nine Fingers
Craig,
to convert from an 32 bit integer number to a "packed" 4 byte number the basic method that I used in a function is:
This will return 4 bytes (or more when @FM is part of the code) for any input string.
to convert from an 32 bit integer number to a "packed" 4 byte number the basic method that I used in a function is:
Code: Select all
** (c) 2005 AWussing
WorkString = OCONV(CRC32(InString),'MB')
IF LEN(WorkString)<>32 THEN WorkString = STR('0',32-LEN(WorkString)):WorkString
Ans = ''
FOR PosNumber = 1 TO 32 STEP 8
NewChar = ICONV(WorkString[PosNumber,8],'MB0C')
IF (NewChar=@FM) THEN Ans := @VM:@VM ELSE Ans := NewChar
NEXT PosNumber
Thanks for your help with all this, Arnd. I've got something that seems to be working fine and giving them what they want - inspired by your code and guidance. I've stashed yours away for a rainy day, copyright and all, in case I find a need for it. They ended up wanting something more displayable ('queryable' as they put it) so I ended up taking a slightly different tack.
My incoming string gets hacked into 5 (as even as possible) pieces and each 'chunk' is CRC32'd. Instead of packing the results I ended up using the "MX" conversion code to translate each chunk's CRC32 value into Hex and then the 5 hex values are concatenated together. Gives them a consistant 40 character string that is purty and queryable - and unique to boot.
I've tested it with the 2 to 3 million record datasets we have available for it right now and there were no collisions across the generated keys so everyone is happy at the moment. :D
Thanks again.
My incoming string gets hacked into 5 (as even as possible) pieces and each 'chunk' is CRC32'd. Instead of packing the results I ended up using the "MX" conversion code to translate each chunk's CRC32 value into Hex and then the 5 hex values are concatenated together. Gives them a consistant 40 character string that is purty and queryable - and unique to boot.
![Wink :wink:](./images/smilies/icon_wink.gif)
I've tested it with the 2 to 3 million record datasets we have available for it right now and there were no collisions across the generated keys so everyone is happy at the moment. :D
Thanks again.
-craig
"You can never have too many knives" -- Logan Nine Fingers
"You can never have too many knives" -- Logan Nine Fingers
OK, 200 input characters (assume a-z,A-Z,0-9, about 64 possibilites?).
Each segment is 40 of these characters,
so the number of different possible character patterns is:
64^40 = 1.76 x 10^72
Each of these segments is translated by the CRC function into an 8 character string of hex. The number of different possible hex patterns, 00000000 to FFFFFFFF is:
16^8 = 4.29 x 10^9 (same as 2^32 of course)
So the ratio of possible string patterns to possible hex patterns
for each segment is:
1.76 x 10^72
---------------- = 4.1 x 10^62
4.29 x 10^9
Gosh, that's a big number.
Of course, your real world input string is most likely not a random set of characters, as there are probably recurring patterns of alpha and numeric characters that will produce localizations.
Still, how will you know if a collision takes place?
On the other hand, does it matter in your application?
Carter
Each segment is 40 of these characters,
so the number of different possible character patterns is:
64^40 = 1.76 x 10^72
Each of these segments is translated by the CRC function into an 8 character string of hex. The number of different possible hex patterns, 00000000 to FFFFFFFF is:
16^8 = 4.29 x 10^9 (same as 2^32 of course)
So the ratio of possible string patterns to possible hex patterns
for each segment is:
1.76 x 10^72
---------------- = 4.1 x 10^62
4.29 x 10^9
Gosh, that's a big number.
Of course, your real world input string is most likely not a random set of characters, as there are probably recurring patterns of alpha and numeric characters that will produce localizations.
Still, how will you know if a collision takes place?
On the other hand, does it matter in your application?
Carter
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
clshore,
to add to your computations, the CRC32() algorithm is designed to ensure that sequences that are quite similar will produce very dissimilar results, so that in this example the odds are further reduced by the data being constrained to mainly +-60 characters and maximum length of 200.
The likelihood of a collision with 60!/50!(60-50)! (75,394,027,566 possible string combinations) going into 2^32 (4,294,967,296) numbers is very high. But doing this 4 times brings the odds back up to 5.16x10^-73. I think that those are odds worth playing!
to add to your computations, the CRC32() algorithm is designed to ensure that sequences that are quite similar will produce very dissimilar results, so that in this example the odds are further reduced by the data being constrained to mainly +-60 characters and maximum length of 200.
The likelihood of a collision with 60!/50!(60-50)! (75,394,027,566 possible string combinations) going into 2^32 (4,294,967,296) numbers is very high. But doing this 4 times brings the odds back up to 5.16x10^-73. I think that those are odds worth playing!