Re: tips for writing regular expression interpreter/compiler?



Derek Haas wrote:
> I am trying to implement a small subset of the regular expression
> languages found in perl/python/et cetera (maybe just Kleene closure
> and alternation for now).

There should still be the demo software for RE->FA in the
comp.compilers archive. Along with each demo (each about 400-500 lines
of programming) are texts that provide a general description of the
algorithms used to make the conversions.

The dfa and nfa programs perform conversions as above. The rex program
is grep, souped up to handle general boolean combinations (A intersect
B, A - B) and interleave products (e.g. x^y^z = x(y^z) | y(z^x) |
z(x^y); x^y = xy | yx).

The simplest approach: just represent a regular expression as you would
any operator expression, (with the operators being A*, A+, A?, A|B, AB
and constants 0, 1 and S for character sets S); and perform operations
by doing algebraic computations on these expressions (that's how nfa,
dfa and rex work; rex does computations on a need-to-use basis while
reading through input, which is the same strategy most grep
implementations take).

What NFA does is reduce an expression to normal form:
0, 1, S normal form
A* -> 1 | A+ = normal form
A+ -> A A* -> ...
A? -> 1 | A = normal form
A | B = normal form
0A -> 0 = normal form
1A -> A -> ...
xA = normal form
A* B -> B | A+ B = normal form
A+ B -> A (A* B) -> ...
A? B -> B | A B = normal form
(A|B) C -> AC | BC = normal form
(AB) C -> A (BC) -> ...
Denoting the normal form of an expression A by NF(A), then when it
proceeds to do is compute sets by the following:
<0> = {}; <1> = {1}; <S> = {S 1}
<A | B> = <A> union <B>
<xA> = {xA}
<A> = <NF(A)>, if A is not in normal form.
In the process of finding these sets, repeats may occur (e.g. <1+> = <1
1*> = <1*> = <1 | 1+> = {1} union <1+>), which are tracked and absorbed
as they occur.

Each expression A produces a set <A> whose elements are either 1 or
other expressions of the form xB. This yields the rules (A final) <->
(1 is in <A>); (x: A -> B) <-> {xB} is in A).

The starting expression is equated to the start state; and the
processing is run on <S>, and then on any of the other expressions
spawned from this until the full automaton is determined.

For nfa, and dfa, the GREP subset syntax is not used for S; S may only
be a symbolic name. But they both support non-recursive definitions.
(If you allow recursive definitions, then you no longer had regular
expressions but context-free expressions).
.



Relevant Pages

  • Re: Fraudulent eBay listing
    ... O.K. I had not noticed that you were using plain grep, ... in egrep), * simply stands for "zero or more of the preceding ... Interpret PATTERN as an extended regular expression. ...
    (rec.crafts.metalworking)
  • Re: grep
    ... > Grep understands two different versions of regular expression syntax: ... > will output any line that contains cat OR dog OR bird or any combination. ... > Does anyone know how to construct a regular expression or in any way get ... > bird dog cat ...
    (Fedora)
  • Re: [Q] Need help with grep expression
    ... post in to the correct news group - 'grep' is unix/linux shell scripting ... 'grep and replace' (linux command similar to 'sed')... ... then you'd need a regular expression and something like the ...
    (comp.lang.php)
  • Re: Text Editor
    ... > I was wondering if there is a text editor equivalent to TextPad for ... See the man pages for grep, egrep, fgrep. ... I tend to just use egrep and the simple regular expressions. ... " Grep understands two different versions of regular expression ...
    (Fedora)
  • Re: Create DFA from regular expression?
    ... > necessary first to create the NFA? ... You can create it directly from the regular expression using "derivatives", ... A DFA that accepts R will have a transition on 'a' from the ... and it's much easier just to make the NFA first and convert it. ...
    (comp.theory)