10 pounds in a 5 pound bag

ArndW · Post by **ArndW** » Wed Oct 19, 2005 10:13 am

Craig,

10% finished so far and no duplicates. Have you got a fast machine?

chulett · Post by **chulett** » Wed Oct 19, 2005 10:23 am

They seem to think so - an HP Superdome running HP-UX.

ArndW · Post by **ArndW** » Thu Oct 20, 2005 1:45 am

Craig,

the job ran through without a collision on all permutations of a 10-character string using CRC32. I just realized that this is not guarantee that it won't get a duplicate on a 9 or 11 length string - but the chances of a collision using 12 CRC32'd number on a 200 character string string are so infinitesmal to be effectively equal to 0, imho. I think the chances of getting a duplicate on 2 are small enough.

srinagesh · Post by **srinagesh** » Thu Oct 20, 2005 4:31 am

I have a got a different approach to this problem.
May sound silly but it definetly works

Create a table with two fields (key number(20), charvalue varchar2(200))

populate this table with the character values (Key is auto generated / sequence).

Use the key value to replace the "200" length character string.

This approach definetly works

-Nagesh

ArndW · Post by **ArndW** » Thu Oct 20, 2005 4:52 am

Srinagesh,

actually, that is a really good solution. I was so concentrated on the string 'compression' function that the (now obvious) table-based solution that you have come up with completely escaped me. With th lookup file you could use a shorter key as well - just start with 1 and work upwards from there.

chulett · Post by **chulett** » Thu Oct 20, 2005 7:05 am

It is an interesting solution, unfortunately I don't believe it will work for us. I've made the classic newbie mistake of not giving you 100 percent of the information needed to solve my problem.

I was looking for an function because it needs to be repeatable by the business partner we are creating the feed for that it will be incorporated in. At least I'm pretty sure that is one of the requirements... I need to double-check. So while 'DataStage-centric' solutions are cool, I think they are out of picture for what I need to do here.

If I need something repeatable, I'm assuming the CRC32 method would qualify, yes? Is it different than the CRC32() function available in DataStage? I ask because I ran some strings through it yesterday and got some rather large values out of the routine - 8 or 9 digits if I recall - rather than the 4 digits you mentioned, hence the question.

ArndW · Post by **ArndW** » Thu Oct 20, 2005 9:50 am

Craig,

the answer comes as a 32-bit digit, so the displayed value is going to chew up a lot of spaces, what I meant is that this can be compressed into 4 bytes using a MBOC conversion. I have a routine here somewhere that does that {it also makes sure that no @FM is in the string, so that it can be written to a hashed file}.

But to keep things simple you can split your string into 5 substrings and concatenate the CRC each of them. Still mathematically inconceivable to get a collision and simpler to implement while staying under your 50 character maximum.

chulett · Post by **chulett** » Thu Oct 20, 2005 10:23 am

I'd be curious to see your routine as I'm not having alot of luck duplicating what you are saying here. Sorry.

I can find an 'MB0C' conversion code (zero not oh) but that seems to output a giant binary number. Did you perhaps mean "MO0C" to get Octal?

ArndW · Post by **ArndW** » Thu Oct 20, 2005 10:58 am

Craig,

to convert from an 32 bit integer number to a "packed" 4 byte number the basic method that I used in a function is:

Code: Select all

** (c) 2005 AWussing
   WorkString = OCONV(CRC32(InString),'MB')
   IF LEN(WorkString)<>32 THEN WorkString = STR('0',32-LEN(WorkString)):WorkString
   Ans = ''
   FOR PosNumber = 1 TO 32 STEP 8
      NewChar = ICONV(WorkString[PosNumber,8],'MB0C')
      IF (NewChar=@FM) THEN Ans := @VM:@VM ELSE Ans := NewChar
   NEXT PosNumber

This will return 4 bytes (or more when @FM is part of the code) for any input string.

chulett · Post by **chulett** » Thu Oct 20, 2005 2:53 pm

Thanks for your help with all this, Arnd. I've got something that seems to be working fine and giving them what they want - inspired by your code and guidance. I've stashed yours away for a rainy day, copyright and all, in case I find a need for it. They ended up wanting something more displayable ('queryable' as they put it) so I ended up taking a slightly different tack.

My incoming string gets hacked into 5 (as even as possible) pieces and each 'chunk' is CRC32'd. Instead of packing the results I ended up using the "MX" conversion code to translate each chunk's CRC32 value into Hex and then the 5 hex values are concatenated together. Gives them a consistant 40 character string that is purty and queryable - and unique to boot.

I've tested it with the 2 to 3 million record datasets we have available for it right now and there were no collisions across the generated keys so everyone is happy at the moment. :D

Thanks again.

clshore · Post by **clshore** » Thu Oct 20, 2005 4:21 pm

OK, 200 input characters (assume a-z,A-Z,0-9, about 64 possibilites?).
Each segment is 40 of these characters,
so the number of different possible character patterns is:

64^40 = 1.76 x 10^72

Each of these segments is translated by the CRC function into an 8 character string of hex. The number of different possible hex patterns, 00000000 to FFFFFFFF is:

16^8 = 4.29 x 10^9 (same as 2^32 of course)

So the ratio of possible string patterns to possible hex patterns
for each segment is:

1.76 x 10^72
---------------- = 4.1 x 10^62
4.29 x 10^9

Gosh, that's a big number.

Of course, your real world input string is most likely not a random set of characters, as there are probably recurring patterns of alpha and numeric characters that will produce localizations.

Still, how will you know if a collision takes place?

On the other hand, does it matter in your application?

Carter

ray.wurlod · Post by **ray.wurlod** » Thu Oct 20, 2005 5:08 pm

Write to a text file in a Sequential File stage. Specify compress or gzip in the filter.

ArndW · Post by **ArndW** » Fri Oct 21, 2005 2:16 am

clshore,

to add to your computations, the CRC32() algorithm is designed to ensure that sequences that are quite similar will produce very dissimilar results, so that in this example the odds are further reduced by the data being constrained to mainly +-60 characters and maximum length of 200.

The likelihood of a collision with 60!/50!(60-50)! (75,394,027,566 possible string combinations) going into 2^32 (4,294,967,296) numbers is very high. But doing this 4 times brings the odds back up to 5.16x10^-73. I think that those are odds worth playing!

chulett · Post by **chulett** » Fri Oct 21, 2005 7:05 am

They take them beyond the realm where I am going to worry about it... too many other fires burning. Just grateful one was put out.