Re: Can Access use Fuzzy Logic



"kaniest" <kaniest@xxxxxxxxxxxxxxx> wrote in
news:443411b4$0$11068$e4fe514c@xxxxxxxxxxxxxx:

David W. Fenton wrote:
"kaniest" <kaniest@xxxxxxxxxxxxxxx> wrote in
news:4433f578$0$11079$e4fe514c@xxxxxxxxxxxxxx:

David W. Fenton wrote
"cassetti@xxxxxxxxx" <cassetti@xxxxxxxxx> wrote
Here's the issue:

I have roughly 20 MS excel spreadsheets, each row contains a
record. These records were hand entered by people in call
centers.

The problem is, there can and are duplicate phone numbers, and
emails and addresses even person names. I need to sift through
all this data (roughly 300,000+ records and use fuzzy logic to
break it down, so that i have only unique records.

Can I use Access or what to sort through all this data?

I think you've asked the wrong question. You don't want fuzzy
logic but fuzzy criteria matching.

One way is finding groups of similar records, then decide
statistically (fuzzy) wheter they represent the same
object. If so, pick its most probable correct attributes.

This statement is meaningless to me. Can you amplify what you
mean by it and how it would be implemented?

Several links on fuzzy logic has been given.

But what's the real-world application to *this* problem space?

The first thing you need to do is decide what constitutes
uniqueness (name, name+address, etc.).

Actually, when using fuzzy logic you want to decide what
constitutes similarity. Uniqueness will alraedy be defined
in the database.

Well, since the source data is a spreadsheet, there *aren't* any
definitions of what constitutes uniqueness.

Some people keep databases in directories and text files ...
Why not in a spreadsheet?

Huh?

The key poiht here is that the accuracy of any "fuzzy logic"
comparison is going to be increased if your source data is already
pre-processed and regularized. A spreadsheet has very few facilities
for maintaining data consistency during the data entry process. A
database *does* have the ability to define what constitutes
uniqueness at the db engine level, but in the present instance,
there is no engine enforcing any uniqueness rules in the data entry
process.

Of course, it could very well be that the spreadsheet is just a data
transfer medium, and not the data entry method. It may be that the
data actually comes out of a database and actually *did* have data
validation and uniqueness rules applied to it.

But the point I'm making is that any de-duping will be made more
accurate by pre-processing the data to weed out irregularities in
the data entry, by regularizing the format of the data values.

Similarity of phone numbers can also be defined as (a function
of) the number of common digits comparing reversed.

Eh? Telephone numbers are unique once they are all in a common
format. What you've described sounds like you'd consider 212
123-4567 and 212 123-7654 as the same statistically, since they
have the same digits. The numbers 212 123-4567 and 212 124-4567
differ by only one number, but are less likely to be an
indication of a duplicate record than 212 123-4567 and 212
123-4568.

I meant the length of a common tail. That could be one of
the applied measures, along with many other that take phone
numbers into account.

"Common tail" is a statistical term (if I'm understanding it
correctly, like the long tail of a distribution), not a database
term. You'll have to explain how it applies to database operations.

So, I don't see much that can be gained by statistical analysis
of phone numbers.

The point is that several competing measures of similarity can be
weighted againt each other. Simply pick measures that work,
whatever their "real" meaning may be.

I understand now. But it would help those of us who have no real
statistical background if, when you're posting in a database
newsgroup, you translate the jargon into terms that mean something.
This whole back and forth could have been avoided.

I don't know that the original poster has actually clarified whether
he really meant "fuzzy logic" as you're defining it, or if he meant
"fuzzy matches".

My address de-duping routines actually do use certain kinds of
measures of similarity between certain fields. But I've never done
any kind of real statisticial evaluation of this, just an ad hoc
choice based on eyeballing the results.

--
David W. Fenton http://www.dfenton.com/
usenet at dfenton dot com http://www.dfenton.com/DFA/
.



Relevant Pages

  • Re: Can Access use Fuzzy Logic
    ... uniqueness. ... in the database. ... Well, since the source data is a spreadsheet, there *aren't* any ... Telephone numbers are unique once they are all in a common ...
    (comp.databases.ms-access)
  • Re: WHY
    ... only seem to knwo a pure Microsoft environment. ... > database people in the country use VB or VBA or VB.net. ... >> And if there were people employed as *developers* who only knew ... >> Spreadsheet use isn't the core of of the jobs done by people who use ...
    (microsoft.public.excel)
  • Re: WHY
    ... I've been working a lot longer than that LoL ... re-creating the same spreadsheet, ... > I'm sure the database admins do all the work at, say, Boeing and Airbus, ... > Or perhaps you meant implicitly to restrict your remarks to developers. ...
    (microsoft.public.excel)
  • Re: WHY
    ... Spreadsheet use isn't the core of of the jobs done by people who use (and ... low level database developers are always in some demand. ... low level database developers are always in some demand. ...
    (microsoft.public.excel)
  • Re: Create Non Access DB with VB??
    ... reproduce or use any of the files that make up the Microsoft Access database ... a general-purpose word-processing, spreadsheet, or database management ... general purpose word processing, ...
    (microsoft.public.vb.database)