Address Cleansing

Infosphere's Quality Product

Moderators: chulett, rschirm

pklcnu
Premium Member
Premium Member
Posts: 50
Joined: Wed Aug 06, 2008 4:39 pm

Address Cleansing

Post by pklcnu »

Dear Experts

I need to do address cleansing, the table is having different columns for address like HouseNumber, StreetName , Location, PostCode .

The data in these columns is all mixed up, the following are some of the cases how the data is entered in the columns
1) In the HouseNumber column full address is entered
2) The house number and street name are mixed up in one or the other column
3) The house number is present in both HouseNumber column and StreetName columns
4) Some of the values in the columns are entered correctly

How exactly I should clean this type of data ? And what stages I should use for this and how ?

Please help me with this , Many thanks
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

That's exactly the kind of thing QualityStage does well. You can use your xxPREP rule set to lump absolutely every element into domain-specific single fields, and have the domain-specific rule sets (xxNAME, xxADDR, xxAREA) process these.

Or, instage of using the xxPREP rule set, you can assemble the single field in a Transformer stage.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Murali4u
Participant
Posts: 37
Joined: Sun Feb 21, 2010 12:27 pm

Post by Murali4u »

[quote="ray.wurlod"]That's exactly the kind of thing QualityStage does well. You can use your xxPREP rule set to lump absolutely every element into domain-specific single fields, and have the domain-specific rule sets ( ...[/quote]

Ray, i understnd the scenario but can u put something to do the same in detail.
Rockzz Tech
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

No, because the detail will depend completely on just how the data are "mixed up". Do the QualityStage Essentials class to learn about using the xxPREP rule set (among other things).

That said, my suggested approach will work "out of the box" - at least with those countries for which rule sets are supplied.

You can tweak results finely with overrides, but address cleansing is QualityStage's forte.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Murali4u
Participant
Posts: 37
Joined: Sun Feb 21, 2010 12:27 pm

Post by Murali4u »

[quote="ray.wurlod"]No, because the detail will depend completely on just how the data are "mixed up". Do the QualityStage Essentials class to learn about using the xxPREP rule set (among other things).

That said, my suggested approach will work "out of the box" - at least with those countries for which rule sets are supplied.

You can tweak results finely with overrides, but address cleansing is QualityStage's forte.[/quote]

Yes Ray, you are correct. But i 'm asking for the above scenario itself.
Otherwise you take up the scenario, while parsing the UK address contains Post code alone in some A column, suppose take the data in it as some junk or numeric characters are there since UK accepts/has alpha numeric character. I might use GBPREP or any GBxxxx ruleset.
1. Will it give a clear end result
2. What should i do with the unhandled data.
3. What should we give to make the data more clear.

Can you suggest the solution for this. Which ruleset will give the clear idea about it.

Thanks in advance
Rockzz Tech
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

GBAREA rule set knows both formats of UK postal codes and how to handle them.

You can verify this by inspecting the Pattern Action Language for this rule set.

Just try it. You'll be pleasantly surprised.

As for unhandled patterns or data, investigate these with a view to tweaking how the rules are working for you, probably through the use of overrides.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
stuartjvnorton
Participant
Posts: 527
Joined: Thu Apr 19, 2007 1:25 am
Location: Melbourne

Post by stuartjvnorton »

Murali4u wrote:[
Yes Ray, you are correct. But i 'm asking for the above scenario itself.
Otherwise you take up the scenario, while parsing the UK address contains Post code alone in some A column, suppose take the data in it as some junk or numeric characters are there since UK accepts/has alpha numeric character. I might use GBPREP or any GBxxxx ruleset.
1. Will it give a clear end result
2. What should i do with the unhandled data.
3. What should we give to make the data more clear.

Can you suggest the solution for this. Which ruleset will give the clear idea about it.

Thanks in advance
As Ray says, xxPREP is the place to start for this sort of stuff. Then run domain-specific rules over address, locality, name, etc.
That said, unstructured data over multiple fields is rarely a cookie-cutter type deal. That's what they pay us the medium-sized bucks for, right? ;-)

- Try it: STAN it and see what you get.
- Run it through an investigate job over each of the fields to do some pattern analysis. (If the data is stored across multiple fields and it's not obvious from the table metadata, word pattern analysis can help you work out the order it should be combined for it to make the most sense.)
- Check out the PAT file and see what it does.

The better you understand the data, the better idea you'll have about what to do with it.
As for what to do with the unhandled data, it depends. It could be junk, valid dat put in the wrong place, or valid data that the ruleset doesn't handle correctly. You need to run it through and see what you get.

Sorry I can't give the right answer straight up, but I don't know your data. When you understand it, the right answer will become apparent. :-)
Murali4u
Participant
Posts: 37
Joined: Sun Feb 21, 2010 12:27 pm

Post by Murali4u »

I understand the whole thing what you are trying to say Stuart.

I take up the scenario and tell you in detail.

Hope this helps to all the newbies.

Take a column which contains 'Name' as data values consider for UK or US.( I know you would apply GBNAME or USNAME ruleset.) It has some valid and invalid values like '123' 'john' 'kennedy123' '#$#Csa jenny'etc. As you illustrated it does not put into a wrong place or anything above it.

How do you do the standardization or correction with any stage.
Could you suggest me what are all the possible ways to do it and you have done it from your experience.
:o
Rockzz Tech
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

There is an infinite number of ways, and therefore no-one is going to attempt to do what you ask.

From my experience you do it by running data through Standardization then use Investigation on the unhandled patterns/data - at least the most frequently-occurring ones - to refine your rules. Since you can't alter the built-in rule sets, the refinement is usually through overrides.

Ultimately, as Stuart has already said, you must understand your data.

A heavier-duty alternative is to make a copy of the rule set, and modify the copy. Now you can vary every possible aspect of the rule set.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

What is "valid"?

Is "R2D2" valid in a name context?

That's not the right question. Is "R2D2" (or "kennedy123" for that matter) valid in YOUR name context?
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
stuartjvnorton
Participant
Posts: 527
Joined: Thu Apr 19, 2007 1:25 am
Location: Melbourne

Post by stuartjvnorton »

The thing about this sort of work is that while there's a science to it, there's also an art.
It's hard to teach the art, especially over email. :wink:

As Ray and I have both said, the first step is understanding the data you have, and what data that is acceptable in the situation.

As for what to do, it's as we have said: Investigate, STAN, know what the rules do and don't do.
Look at the unhandled data. Understand why it's there.
Look at the handled data. Understand why it was parsed the way it was. Is it parsed correctly?
Rinse and repeat.

It's not just a plug it in and your problems are solved type of activity you're doing here.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

stuartjvnorton wrote:Rinse and repeat.
I love it!!!
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Murali4u
Participant
Posts: 37
Joined: Sun Feb 21, 2010 12:27 pm

Post by Murali4u »

Thanks Ray and Stuart. I have done all the things that you both have sorted out,but out of my curiosity i'm just asking how to cleanse and correct the data values. thats it. Thanks for your comments bro :) :)
Rockzz Tech
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

No idea at all. I have not seen your data, I have not seen your business rules that make those data "bad". So how am I expected to be able to know how to "fix" them? A totally unreasonable request.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Murali4u
Participant
Posts: 37
Joined: Sun Feb 21, 2010 12:27 pm

Post by Murali4u »

[quote="ray.wurlod"]No idea at all. I have not seen your data, I have not seen your business rules that make those data "bad". So how am I expected to be able to know how to "fix" them? A totally unreasonable request. ...[/quote]

hi ray hold on for a min i dint ask or request you for to analyze my data or anything. I just asked these things only i suppose to do. Cool.
Rockzz Tech
Post Reply