Re: Can Access use Fuzzy Logic - RegEx
- From: "Kevin G. Anderson" <kgander@xxxxxxxxx>
- Date: 7 Apr 2006 07:07:57 -0700
David W. Fenton wrote:
"cassetti@xxxxxxxxx" <cassetti@xxxxxxxxx> wrote in
news:1144162213.753763.185180@xxxxxxxxxxxxxxxxxxxxxxxxxxxx:
Here's the issue:
I have roughly 20 MS excel spreadsheets, each row contains a
record. These records were hand entered by people in call centers.
The problem is, there can and are duplicate phone numbers, and
emails and addresses even person names. I need to sift through all
this data (roughly 300,000+ records and use fuzzy logic to break
it down, so that i have only unique records.
Can I use Access or what to sort through all this data?
I think you've asked the wrong question. You don't want fuzzy logic
but fuzzy criteria matching.
I've de-duped millions of records, and it's very hard work.
The first thing you need to do is decide what constitutes uniqueness
(name, name+address, etc.).
Then you need to get all records into the exactly the same format.
This means that you need to process certain fields to get them all
formatted the same. For instance, phone numbers might be entered as
123-456-7890 and (123) 456-7890 and 123.456.7890 and +1 123
456-7890. To de-dupe on phone numbers, you'd need to first process
out all the variations. What I'd do with phone numbers is first
strip out the leading +1 or 1, then I'd strip out all characters
except the numbers. I wouldn't reformat them, but leave them that
way because the de-duping process is not going to be helped by
adding more characters. When the final data is cleaned up, you may
choose to reformat the data (or just display it with the appropriate
parens and dashes).
Another problem is that if there are multiple phone numbers, you
might end up with them in the wrong columns. That is, if there are
just Phone1 and Phone2 instead of Home, Work, Mobile, you might end
up with two records that are the same person and the Phone1 field
might match the Phone2 field of the other record. You can
restructure the data to fix this problem (normalization), by putting
the phone numbers in a separate table with a link to the source
record and the phone number. Then you'd only have to search for dups
on a single field in this phone number table.
Of course, there could be lots of false matches if, say, there are
many different people sharing the same phone number, so you wouldn't
want to use phone number by itself to determine duplicates.
To do all this parsing, you'll need to familiarize yourself with the
VBA string functions, Replace(), Mid(), Left() and Right().
Addresses are much more complicated. I've got lots of code that I've
written to process addresses into standard formats, and to evalute
the contents and encode them in a format that is easily comparable.
But it's very complicated and I'm not prepared to share it here. The
main point I want to make is that this is a hard task.
Regular expressions are a much more powerful than VBA string functions
for parsing phone numbers, email addresses, etc. out of unstructured
data. And OLE makes it easy to use VBScript's RegEx object within an
Access app. Plus you'll be able to leverage your knowledge of RegEx in
just about every other programming evironment you encounter in the
future.
Search MSDN or the Internet to learn more about what you can do with
RegEx, or check out Jeffrey E. F. Friedl's "Mastering Regular
Expressions" book (published by O'Reilly).
Kevin
.
- Follow-Ups:
- Re: Can Access use Fuzzy Logic - RegEx
- From: David W. Fenton
- Re: Can Access use Fuzzy Logic - RegEx
- References:
- Can Access use Fuzzy Logic
- From: cassetti@xxxxxxxxx
- Re: Can Access use Fuzzy Logic
- From: David W. Fenton
- Can Access use Fuzzy Logic
- Prev by Date: Finding image dimensions in OLE Object field
- Next by Date: Re: Another Excel Automation Newbie Question
- Previous by thread: Re: Can Access use Fuzzy Logic
- Next by thread: Re: Can Access use Fuzzy Logic - RegEx
- Index(es):
Relevant Pages
|