Re: Can Access use Fuzzy Logic
- From: "David W. Fenton" <XXXusenet@xxxxxxxxxxxxxxxxxxx>
- Date: Wed, 05 Apr 2006 10:07:07 -0500
"cassetti@xxxxxxxxx" <cassetti@xxxxxxxxx> wrote in
news:1144162213.753763.185180@xxxxxxxxxxxxxxxxxxxxxxxxxxxx:
Here's the issue:
I have roughly 20 MS excel spreadsheets, each row contains a
record. These records were hand entered by people in call centers.
The problem is, there can and are duplicate phone numbers, and
emails and addresses even person names. I need to sift through all
this data (roughly 300,000+ records and use fuzzy logic to break
it down, so that i have only unique records.
Can I use Access or what to sort through all this data?
I think you've asked the wrong question. You don't want fuzzy logic
but fuzzy criteria matching.
I've de-duped millions of records, and it's very hard work.
The first thing you need to do is decide what constitutes uniqueness
(name, name+address, etc.).
Then you need to get all records into the exactly the same format.
This means that you need to process certain fields to get them all
formatted the same. For instance, phone numbers might be entered as
123-456-7890 and (123) 456-7890 and 123.456.7890 and +1 123
456-7890. To de-dupe on phone numbers, you'd need to first process
out all the variations. What I'd do with phone numbers is first
strip out the leading +1 or 1, then I'd strip out all characters
except the numbers. I wouldn't reformat them, but leave them that
way because the de-duping process is not going to be helped by
adding more characters. When the final data is cleaned up, you may
choose to reformat the data (or just display it with the appropriate
parens and dashes).
Another problem is that if there are multiple phone numbers, you
might end up with them in the wrong columns. That is, if there are
just Phone1 and Phone2 instead of Home, Work, Mobile, you might end
up with two records that are the same person and the Phone1 field
might match the Phone2 field of the other record. You can
restructure the data to fix this problem (normalization), by putting
the phone numbers in a separate table with a link to the source
record and the phone number. Then you'd only have to search for dups
on a single field in this phone number table.
Of course, there could be lots of false matches if, say, there are
many different people sharing the same phone number, so you wouldn't
want to use phone number by itself to determine duplicates.
To do all this parsing, you'll need to familiarize yourself with the
VBA string functions, Replace(), Mid(), Left() and Right().
Addresses are much more complicated. I've got lots of code that I've
written to process addresses into standard formats, and to evalute
the contents and encode them in a format that is easily comparable.
But it's very complicated and I'm not prepared to share it here. The
main point I want to make is that this is a hard task.
For names, you probably want to look into the name processing
schemes, like Soundex(), Soundex2() and Simul(). Plain old Soundex()
is virtually useless (the matches are way too loose), and Simul() is
rather heavy on processing, so I generally use a combination of
Soundex(), Soundex2() and substring comparisons. However, none of
these works well with anything but standard English/American names.
If there are lots of foreign names, it is going to be less
successful.
To summarize, the way I'd do this is import the spread***(s) into
Access, then add fields for the processed data (so I'd retain the
original values in case I needed to go back to them), and
restructure the data into more than one table. Then I'd do my
de-duping on the resulting fields. First I'd start with exact
matches on the processed/derived fields and then I'd work with
various forms of LIKE matches on the records to find "fuzzy"
matches.
The problem is that none of this can be easily done in an automated
fashion. It almost always requires a human being to make a final
determination of what really is a duplicate. What I do is try to do
as many of them with queries and code, and then have the remainder
evaluated by a human being. But with hundreds of thousands of
records, this can be impossible. So you're left with recognizing
that you can never get a perfectly de-duped resultset.
This is why it's important in a database application to have as many
protections against the production of duplicate records as possible.
This requires designing your data entry process to check for
duplicates and flag possible dupes for the users who then decide
whether or not to go ahead and create the new record, or use the
existing record. It's also important to build tools for the users so
that they can review possible duplicates and reconcile them. This
can be very complicated, and probably should be restricted to
administrative or well-trained users.
--
David W. Fenton http://www.dfenton.com/
usenet at dfenton dot com http://www.dfenton.com/DFA/
.
- Follow-Ups:
- Re: Can Access use Fuzzy Logic - RegEx
- From: Kevin G. Anderson
- Re: Can Access use Fuzzy Logic
- From: kaniest
- Re: Can Access use Fuzzy Logic - RegEx
- References:
- Can Access use Fuzzy Logic
- From: cassetti@xxxxxxxxx
- Can Access use Fuzzy Logic
- Prev by Date: Re: Question on primary key
- Next by Date: Re: Trouble with DAO "SEEK" in converting application to SQL Express back end.
- Previous by thread: Re: Can Access use Fuzzy Logic
- Next by thread: Re: Can Access use Fuzzy Logic
- Index(es):