10 pounds in a 5 pound bag

Post questions here relative to DataStage Server Edition for such areas as Server job design, DS Basic, Routines, Job Sequences, etc.

Moderators: chulett, rschirm, roy

ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Post by ArndW »

Craig,

10% finished so far and no duplicates. Have you got a fast machine?
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

They seem to think so - an HP Superdome running HP-UX.
-craig

"You can never have too many knives" -- Logan Nine Fingers
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Post by ArndW »

Craig,

the job ran through without a collision on all permutations of a 10-character string using CRC32. I just realized that this is not guarantee that it won't get a duplicate on a 9 or 11 length string - but the chances of a collision using 12 CRC32'd number on a 200 character string string are so infinitesmal to be effectively equal to 0, imho. I think the chances of getting a duplicate on 2 are small enough.
srinagesh
Participant
Posts: 125
Joined: Mon Jul 25, 2005 7:03 am

Post by srinagesh »

I have a got a different approach to this problem.
May sound silly but it definetly works

Create a table with two fields (key number(20), charvalue varchar2(200))

populate this table with the character values (Key is auto generated / sequence).

Use the key value to replace the "200" length character string.

This approach definetly works

-Nagesh
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Post by ArndW »

Srinagesh,

actually, that is a really good solution. I was so concentrated on the string 'compression' function that the (now obvious) table-based solution that you have come up with completely escaped me. With th lookup file you could use a shorter key as well - just start with 1 and work upwards from there.
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

It is an interesting solution, unfortunately I don't believe it will work for us. I've made the classic newbie mistake of not giving you 100 percent of the information needed to solve my problem. :oops:

I was looking for an function because it needs to be repeatable by the business partner we are creating the feed for that it will be incorporated in. At least I'm pretty sure that is one of the requirements... I need to double-check. So while 'DataStage-centric' solutions are cool, I think they are out of picture for what I need to do here. :?

If I need something repeatable, I'm assuming the CRC32 method would qualify, yes? Is it different than the CRC32() function available in DataStage? I ask because I ran some strings through it yesterday and got some rather large values out of the routine - 8 or 9 digits if I recall - rather than the 4 digits you mentioned, hence the question.
-craig

"You can never have too many knives" -- Logan Nine Fingers
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Post by ArndW »

Craig,

the answer comes as a 32-bit digit, so the displayed value is going to chew up a lot of spaces, what I meant is that this can be compressed into 4 bytes using a MBOC conversion. I have a routine here somewhere that does that {it also makes sure that no @FM is in the string, so that it can be written to a hashed file}.

But to keep things simple you can split your string into 5 substrings and concatenate the CRC each of them. Still mathematically inconceivable to get a collision and simpler to implement while staying under your 50 character maximum.
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

I'd be curious to see your routine as I'm not having alot of luck duplicating what you are saying here. Sorry.

I can find an 'MB0C' conversion code (zero not oh) but that seems to output a giant binary number. Did you perhaps mean "MO0C" to get Octal? :?
-craig

"You can never have too many knives" -- Logan Nine Fingers
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Post by ArndW »

Craig,

to convert from an 32 bit integer number to a "packed" 4 byte number the basic method that I used in a function is:

Code: Select all

** (c) 2005 AWussing
   WorkString = OCONV(CRC32(InString),'MB')
   IF LEN(WorkString)<>32 THEN WorkString = STR('0',32-LEN(WorkString)):WorkString
   Ans = ''
   FOR PosNumber = 1 TO 32 STEP 8
      NewChar = ICONV(WorkString[PosNumber,8],'MB0C')
      IF (NewChar=@FM) THEN Ans := @VM:@VM ELSE Ans := NewChar
   NEXT PosNumber
This will return 4 bytes (or more when @FM is part of the code) for any input string.
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

Thanks for your help with all this, Arnd. I've got something that seems to be working fine and giving them what they want - inspired by your code and guidance. I've stashed yours away for a rainy day, copyright and all, in case I find a need for it. They ended up wanting something more displayable ('queryable' as they put it) so I ended up taking a slightly different tack.

My incoming string gets hacked into 5 (as even as possible) pieces and each 'chunk' is CRC32'd. Instead of packing the results I ended up using the "MX" conversion code to translate each chunk's CRC32 value into Hex and then the 5 hex values are concatenated together. Gives them a consistant 40 character string that is purty and queryable - and unique to boot. :wink:

I've tested it with the 2 to 3 million record datasets we have available for it right now and there were no collisions across the generated keys so everyone is happy at the moment. :D

Thanks again.
-craig

"You can never have too many knives" -- Logan Nine Fingers
clshore
Charter Member
Charter Member
Posts: 115
Joined: Tue Oct 21, 2003 11:45 am

Post by clshore »

OK, 200 input characters (assume a-z,A-Z,0-9, about 64 possibilites?).
Each segment is 40 of these characters,
so the number of different possible character patterns is:

64^40 = 1.76 x 10^72

Each of these segments is translated by the CRC function into an 8 character string of hex. The number of different possible hex patterns, 00000000 to FFFFFFFF is:

16^8 = 4.29 x 10^9 (same as 2^32 of course)

So the ratio of possible string patterns to possible hex patterns
for each segment is:

1.76 x 10^72
---------------- = 4.1 x 10^62
4.29 x 10^9

Gosh, that's a big number.

Of course, your real world input string is most likely not a random set of characters, as there are probably recurring patterns of alpha and numeric characters that will produce localizations.

Still, how will you know if a collision takes place?

On the other hand, does it matter in your application?

Carter
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Write to a text file in a Sequential File stage. Specify compress or gzip in the filter. :lol:
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Post by ArndW »

clshore,

to add to your computations, the CRC32() algorithm is designed to ensure that sequences that are quite similar will produce very dissimilar results, so that in this example the odds are further reduced by the data being constrained to mainly +-60 characters and maximum length of 200.

The likelihood of a collision with 60!/50!(60-50)! (75,394,027,566 possible string combinations) going into 2^32 (4,294,967,296) numbers is very high. But doing this 4 times brings the odds back up to 5.16x10^-73. I think that those are odds worth playing!
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

They take them beyond the realm where I am going to worry about it... too many other fires burning. Just grateful one was put out. :wink:
-craig

"You can never have too many knives" -- Logan Nine Fingers
Post Reply