Re: Open source fuzzy project
- From: "Bruno Di Stefano" <Bruno.DiStefano@xxxxxxxxx>
- Date: 11 Jun 2006 11:39:04 -0700
Lee,
congratulations & thank you for dedicating your energies to
this problem. There is a need for more of such initiatives.
Looking at your wiki, http://tickett.net/dedupe/index.php/Main_Page,
it seems to me that you need to split the problem into smaller, more
manageable, chunks and get help from others or re-use what others
have done.
In http://tickett.net/dedupe/index.php/Main_Page, you wrote:
+++++++++++++++++++++++
The project will be looking at data (the intention is to begin looking
at customer name/address data but this may widen over time) and ways
to intelligently detect duplicates using fuzzy matching methods and
algorithms.
+++++++++++++++++++++++
A bit further down, looking at
http://tickett.net/dedupe/index.php/Dirty_data,
it's clear that your application domain is ducplication of entries
in mailing lists. One of the problems is that mailing lists are sold
and many organization end having the same individual's entry,
but written in different way, as it appeared in each original mailing
list.
In a multicultural society like Canada, where I live, there are major
problems with common mispellings. I receive mail with my own
name spelt as follows:
- Bruno N. Di Stefano (correct with middle initial)
- Bruno Di Stefano (correct without middle initial)
- Bruno Distefano (incorrect)
- Bruno Di Stephano (incorrect)
- Bruno Distephano (incorrect)
- Bruno De Stefano (incorrect)
- Bruno De Stephano (incorrect)
- Bruno Destefano (incorrect)
- Bruno Destephano (incorrect)
- etc
I spare you further erroneous variation, but I point that the most off
the mark is probably Dastaphano (it beats me how it came up).
There also variations where the "Di" is interpreted as a middle
name and the last name becomes "Stefano" or "Stephano" (this has
created problems in conferences where I was pre-registred
and where there were different line ups to pick up badges and
other material).
Then there are different ways of writingthe same address, i.e.
- 99 Fuzzy Logic Street (suite 1234)
- 99 Fuzzy Logic Street (# 1234)
- 99 Fuzzy Logic Street (unit 1234)
- 99 Fuzzy Logic Street (apt. 1234) ("apt" in an office building?)
- 1234-99 Fuzzy Logic Street
An of course there are those who write the unit number on a
separate line and those who drop the word "street"
In the USA & Canada postal codes are rather narrowly defined in a
geographic area, that is few people live in each postal area.
Duplicates
are difficult. However, given the large number of people of Korean
and Chinese origin in the high tek industry, can you imagine
how many people have the last name "kim", "lee", "chan", "chung",
and "lam" in the same company, at the same address, and with the
same postal code? In one company where I worked, at one point
I counted 14 software designers whose last name was "Chan"
and some of them had similar initials.
All of the above is not to say that the task is impossible, but
that one probably needs to create some sort of metadata
format to convert from several possible formats to a standard
one. Of course, one needs also to have a "localisation" variable
to keep in mind linguistic issues. "Rue De La Liberte, 15" (written
with or without accents) may become "15 Freedom Road" in a
bilingual border area within a bilingual country.
I will write on a separate posting about the methodology
(fuzzy vs crips) issue (see posting of Dmitry Kazakov at
the end of this posting).
Best regards
Bruno Di Stefano
-- Bruno Di Stefano
----------------------------------------------------------
Bus.: nup...@xxxxxxxxxxxx www3.sympatico.ca/nuptek
IEEE: b.distef...@xxxxxxxx Bruno.DiStef...@xxxxxxxxx
http://bruno.distefano.googlepages.com/home
--------------------------------------------------------------------------------------
On 07/06/06, ltick...@xxxxxxxxx <ltick...@xxxxxxxxx> wrote:
Afternoon, Morning, Evening to you all,
I posted this message recently in comp.ai.fuzzy but it has been
suggested this may be more appropriate a place for it...
I have recently begun an open source project "dedupe" on sourceforge (
http://sourceforge.net/projects/dedupe ) and have setup a wiki (
http://dedupe.sourceforge.net ) to collate information and discuss the
project progress, direction etc!
If anyone is interested (just to look) or feels they can contribute
please get in touch!
Thanks
L
On 7 Jun 2006 15:09:26 -0700, ltick...@xxxxxxxxx wrote:
I wonder if there is a distinct difference betweenI'd say that the latter should be based on the former. But I never used
fuzzy logic? (math/truth/logic) and fuzzy matching (string based/ai) ?
fuzzy approach to string matching. In the crisp one, a pattern either
matches the string or not. Any pattern is equivalent to some set of
strings, usually represented as a cyclic graph. Each path in the graph
gives a matched string. One obvious fuzzy extension could be in make
the
graph fuzzy. For example, an alternation operator P1 | P2 could be
extended
by hanging possibilities/necessities on the operands. Atomic patterns
could
yield possibilities/necessities of match. The problem with this is that
one
should track all alternatives (so all paths of the graph.) Fuzzy graphs
are
min/max (min along the path, max across paths) or max/min. They are a
lot
easier to navigate than probabilistic ones (based on */+). There are
some
powerful heuristics to cut off many less promising paths, but still
there
is an evident danger of geometrical explosion.
--
Regards,
Dmitry A. Kazakov
http://www.dmitry-kazakov.de
--
-- Bruno Di Stefano
----------------------------------------------------------
Bus.: nup...@xxxxxxxxxxxx www3.sympatico.ca/nuptek
IEEE: b.distef...@xxxxxxxx Bruno.DiStef...@xxxxxxxxx
http://bruno.distefano.googlepages.com/home
ltickett@xxxxxxxxx wrote:
Afternoon, Morning, Evening to you all,
I've just stumbled upon this group and begun reading and have found
some really interesting info/discussions!
I have recently begun an open source project "dedupe" on sourceforge (
http://sourceforge.net/projects/dedupe ) and have setup a wiki (
http://dedupe.sourceforge.net ) to collate information and discuss the
project progress, direction etc!
If anyone is interested (just to look) or feels they can contribute
please get in touch!
Thanks
L
ltickett@xxxxxxxxx
.
- References:
- Open source fuzzy project
- From: ltickett
- Open source fuzzy project
- Prev by Date: Artificial Intrelligence (AI) Glossary from Web Services free
- Next by Date: Re: Open source fuzzy project
- Previous by thread: Re: Open source fuzzy project
- Next by thread: Re: Open source fuzzy project
- Index(es):
Relevant Pages
|
|