Re: Algorithm/Theory help: Patterns, comparing, calculating 'distance' between?



The definition and computation distances between strings is a heavily
worked problem.

To get a handle on the literature, look for Levenshtein distance (check
my spelling, too) and for substring matching.

In a nutshell, most reasonable string edit distance measures involve a
normalization step (such as expansion of & to and and elimination of
repeated spaces for English text or conversion of a peptide sequence to
nucleotide sequence for genomic work).

Then you find the minimum cost sequence of edits that are required to
convert one string to another (or to a substring of another). To do
this, it is common to define a positive cost for each non-trivial edit
(deletion, transposition and so on). The cost may depend on the
content as well (so deleting a space after another space might cost
nearly nothing ... this could make the normalization redundant).

The minimum cost can then be used as a distance measure between the
strings. Moreover, if you define each edit as independent of all the
others in terms of computing cost, then you can use moderately
efficient algorithms (dynamic programming) to find the minimum cost and
the sequence of edits.

There is a natural inspiration for edit distance algorithms if you
think of random editing processes such as might occur in biological
evolution. The edit distance is just the negative logarithm of the sum
of the likelihood of the sequences of edits that get the same answer.

You can approximate the minimum edit cost form of distance in many
cases by looking at how many common substrings there are. The cost of
computing edit distance is M x N where M and N are the lengths of the
strings in question while the cost of most reasonable common substring
algorithm can be made M + N or even min(M, N) if one is much larger
than the other (as in genomics) or if the longer string is really a
great wad of shorter strings (as in comparing, say, names and
addresses).

Combining these two approaches is also possible.

Start with this. Most of the literature consists of adding frills to
this basic starting point.

Also, to your question about a generic starting point, I would
recommend making the cost of deletions and insertions the same and
forgetting about transpositions and combinations for the moment. You
may want to add transpositions later, possibly along with change (i.e.
delete + insert combo). Combinations larger than this are probably
best handled by the search algorithm.

[ comp.ai is moderated. To submit, just post and be patient, or if ]
[ that fails mail your article to <comp-ai@xxxxxxxxxxxxxxxxxx>, and ]
[ ask your news administrator to fix the problems with your system. ]
.



Relevant Pages

  • Re: Minimum levenshtein distance for a set of words
    ... > | - set of all words which have the minimum distance if finite ... > The algorithm in there that perhaps comes closest to doing ... > smallest *total* distance to the given strings. ... For example for a set A with two elements and cost values all the same ...
    (sci.math)
  • Re: Verizon Reduces Prices for Phone Service
    ... and the website still indicated it cost $6.95 per month until ... > add a very low rate long distance plan. ... I make a lot of local toll calls. ...
    (comp.dcom.telecom)
  • Re: Food snob?
    ... Kind of like Skype; you have to ... It only costs money if you aren't going computer to computer, but it is really cheap, and isn't a monthly cost. ... distance is and that's your DSL bill. ... still isn't much (double that if you've got a crt). ...
    (rec.food.cooking)
  • Re: A tale of two technologies
    ... The problem with the rest is that there are good alternatives that cost ... I use Lingo for my long ... overseas when I'm on the road, so my VZW phone is perfect for travel. ... summer they use their VZW minutes if they have to call long distance during ...
    (alt.cellular.verizon)
  • Re: Minimum levenshtein distance for a set of words
    ... | - tells me how many words have this distance ... The algorithm in there that perhaps comes closest to doing ... smallest *total* distance to the given strings. ... Then I suppose by adjusting the weights you could find the ...
    (sci.math)