Re: Can Access use Fuzzy Logic
- From: "David W. Fenton" <XXXusenet@xxxxxxxxxxxxxxxxxxx>
- Date: Thu, 06 Apr 2006 08:32:38 -0500
"kaniest" <kaniest@xxxxxxxxxxxxxxx> wrote in
news:443411b4$0$11068$e4fe514c@xxxxxxxxxxxxxx:
David W. Fenton wrote:
"kaniest" <kaniest@xxxxxxxxxxxxxxx> wrote in
news:4433f578$0$11079$e4fe514c@xxxxxxxxxxxxxx:
David W. Fenton wrote
"cassetti@xxxxxxxxx" <cassetti@xxxxxxxxx> wrote
Here's the issue:
I have roughly 20 MS excel spreadsheets, each row contains a
record. These records were hand entered by people in call
centers.
The problem is, there can and are duplicate phone numbers, and
emails and addresses even person names. I need to sift through
all this data (roughly 300,000+ records and use fuzzy logic to
break it down, so that i have only unique records.
Can I use Access or what to sort through all this data?
I think you've asked the wrong question. You don't want fuzzy
logic but fuzzy criteria matching.
One way is finding groups of similar records, then decide
statistically (fuzzy) wheter they represent the same
object. If so, pick its most probable correct attributes.
This statement is meaningless to me. Can you amplify what you
mean by it and how it would be implemented?
Several links on fuzzy logic has been given.
But what's the real-world application to *this* problem space?
The first thing you need to do is decide what constitutes
uniqueness (name, name+address, etc.).
Actually, when using fuzzy logic you want to decide what
constitutes similarity. Uniqueness will alraedy be defined
in the database.
Well, since the source data is a spreadsheet, there *aren't* any
definitions of what constitutes uniqueness.
Some people keep databases in directories and text files ...
Why not in a spreadsheet?
Huh?
The key poiht here is that the accuracy of any "fuzzy logic"
comparison is going to be increased if your source data is already
pre-processed and regularized. A spreadsheet has very few facilities
for maintaining data consistency during the data entry process. A
database *does* have the ability to define what constitutes
uniqueness at the db engine level, but in the present instance,
there is no engine enforcing any uniqueness rules in the data entry
process.
Of course, it could very well be that the spreadsheet is just a data
transfer medium, and not the data entry method. It may be that the
data actually comes out of a database and actually *did* have data
validation and uniqueness rules applied to it.
But the point I'm making is that any de-duping will be made more
accurate by pre-processing the data to weed out irregularities in
the data entry, by regularizing the format of the data values.
Similarity of phone numbers can also be defined as (a function
of) the number of common digits comparing reversed.
Eh? Telephone numbers are unique once they are all in a common
format. What you've described sounds like you'd consider 212
123-4567 and 212 123-7654 as the same statistically, since they
have the same digits. The numbers 212 123-4567 and 212 124-4567
differ by only one number, but are less likely to be an
indication of a duplicate record than 212 123-4567 and 212
123-4568.
I meant the length of a common tail. That could be one of
the applied measures, along with many other that take phone
numbers into account.
"Common tail" is a statistical term (if I'm understanding it
correctly, like the long tail of a distribution), not a database
term. You'll have to explain how it applies to database operations.
So, I don't see much that can be gained by statistical analysis
of phone numbers.
The point is that several competing measures of similarity can be
weighted againt each other. Simply pick measures that work,
whatever their "real" meaning may be.
I understand now. But it would help those of us who have no real
statistical background if, when you're posting in a database
newsgroup, you translate the jargon into terms that mean something.
This whole back and forth could have been avoided.
I don't know that the original poster has actually clarified whether
he really meant "fuzzy logic" as you're defining it, or if he meant
"fuzzy matches".
My address de-duping routines actually do use certain kinds of
measures of similarity between certain fields. But I've never done
any kind of real statisticial evaluation of this, just an ad hoc
choice based on eyeballing the results.
--
David W. Fenton http://www.dfenton.com/
usenet at dfenton dot com http://www.dfenton.com/DFA/
.
- Follow-Ups:
- Re: Can Access use Fuzzy Logic
- From: kaniest
- Re: Can Access use Fuzzy Logic
- References:
- Can Access use Fuzzy Logic
- From: cassetti@xxxxxxxxx
- Re: Can Access use Fuzzy Logic
- From: David W. Fenton
- Re: Can Access use Fuzzy Logic
- From: kaniest
- Re: Can Access use Fuzzy Logic
- From: David W. Fenton
- Re: Can Access use Fuzzy Logic
- From: kaniest
- Can Access use Fuzzy Logic
- Prev by Date: Re: Trouble with DAO "SEEK" in converting application to SQL Express back end.
- Next by Date: Re: Formatting Outlook text from Access VBA
- Previous by thread: Re: Can Access use Fuzzy Logic
- Next by thread: Re: Can Access use Fuzzy Logic
- Index(es):
Relevant Pages
|