Page 1 of 1

override string similarity function

Posted: Mon Aug 04, 2014 3:45 am
by surfsup
Howdy,

I know this is a long shot and probably not possible (in an IBM approved manner), but is there any way to override the string comparison function used in the classification?

E.g.
Word of HERTFORDSHIRE has a tolerance of 700, but it will not match against IERTFORDSHIRE. It will happily match against misspells in the middle of the word though.


Cheers,
A

Posted: Mon Aug 04, 2014 2:56 pm
by rjdickson
Hi,

I doubt that IBM will change the existing algorithms as they would impact many, many existing customers.

My testing did show that they did match, but barely.

However, this is an example of something that may be fixable in Standardization. Can you verify the address? This would presumably change the city name to the correct name. If no verification is in play, can you Standardized it to the correct spelling?

In other words - this may be a Standardization challenge versus a Matching challenge.

Posted: Mon Aug 04, 2014 5:59 pm
by ray.wurlod
For example, these two would match using Reverse Soundex, even without any other standardisation.

Posted: Tue Aug 05, 2014 4:58 am
by surfsup
Hi Rj,

This was just an example; there are any number of possible individual errors that appear in the source and provisioning for all their various combinations in the standardisation phase would be much more costly (both in time and resources) than enhancing the standardisation function (I don't expect IBM to change the existing functionality in the product).


Hi Ray,

Reverse Soundex on these individual words would work, but the source data contains various other errors (letters as number, numbers as individual or groups of letters and letters replacements ) within the same word at the begining, middle and/or end.

My first stop to manage these errors would be to augument the piece of code that QS uses for classification of tokens (and hence write less standardisation rules and reuse more of the existing rule sets).