Re: Algorithm/Theory help: Patterns, comparing, calculating 'distance' between?
- From: "Ted Dunning" <ted.dunning@xxxxxxxxx>
- Date: Tue, 10 Jan 2006 00:34:47 GMT
The definition and computation distances between strings is a heavily
worked problem.
To get a handle on the literature, look for Levenshtein distance (check
my spelling, too) and for substring matching.
In a nutshell, most reasonable string edit distance measures involve a
normalization step (such as expansion of & to and and elimination of
repeated spaces for English text or conversion of a peptide sequence to
nucleotide sequence for genomic work).
Then you find the minimum cost sequence of edits that are required to
convert one string to another (or to a substring of another). To do
this, it is common to define a positive cost for each non-trivial edit
(deletion, transposition and so on). The cost may depend on the
content as well (so deleting a space after another space might cost
nearly nothing ... this could make the normalization redundant).
The minimum cost can then be used as a distance measure between the
strings. Moreover, if you define each edit as independent of all the
others in terms of computing cost, then you can use moderately
efficient algorithms (dynamic programming) to find the minimum cost and
the sequence of edits.
There is a natural inspiration for edit distance algorithms if you
think of random editing processes such as might occur in biological
evolution. The edit distance is just the negative logarithm of the sum
of the likelihood of the sequences of edits that get the same answer.
You can approximate the minimum edit cost form of distance in many
cases by looking at how many common substrings there are. The cost of
computing edit distance is M x N where M and N are the lengths of the
strings in question while the cost of most reasonable common substring
algorithm can be made M + N or even min(M, N) if one is much larger
than the other (as in genomics) or if the longer string is really a
great wad of shorter strings (as in comparing, say, names and
addresses).
Combining these two approaches is also possible.
Start with this. Most of the literature consists of adding frills to
this basic starting point.
Also, to your question about a generic starting point, I would
recommend making the cost of deletions and insertions the same and
forgetting about transpositions and combinations for the moment. You
may want to add transpositions later, possibly along with change (i.e.
delete + insert combo). Combinations larger than this are probably
best handled by the search algorithm.
[ comp.ai is moderated. To submit, just post and be patient, or if ]
[ that fails mail your article to <comp-ai@xxxxxxxxxxxxxxxxxx>, and ]
[ ask your news administrator to fix the problems with your system. ]
.
- References:
- Prev by Date: Re: Algorithm/Theory help: Patterns, comparing, calculating 'distance' between?
- Next by Date: CFP: TIME 2006 Temporal Representation and Reasoning Symposium
- Previous by thread: Re: Algorithm/Theory help: Patterns, comparing, calculating 'distance' between?
- Next by thread: Re: Algorithm/Theory help: Patterns, comparing, calculating 'distance' between?
- Index(es):
Relevant Pages
|