RfD - Escaped Strings (long)



RfD - S\" and quoted strings with escapes
21 August 2006, Stephen Pelc

20060821 First draft

Rationale
=========

Problem
-------
The word S" 6.1.2165 is the primary word for generating strings.
In more complex applications, it suffers from several deficiencies:
1) the S" string can only contain printable characters,
2) the S" string cannot contain the '"' character,
3) the S" string cannot be used with wide characters as dicussed
in the Forth 200x internationalisation and XCHAR proposals.

Current practice
----------------
At least SwiftForth, gForth and VFX Forth support S\" with very
similar operations. S\" behaves like S", but uses the '\' character
as an escape character for the entry of characters that cannot be
used with S".

This technique is widespread in languages other than Forth.

It has benefit in areas such as
1) construction of multiline strings for display by operating
system services,
2) construction of HTTP headers,
3) generation of GSM modem control strings.

The majority of current Forth systems contain code, either in the
kernel or in application code, that assumes char=byte=au. To avoid
breaking existing code, we have to live with this practice.

Considerations
--------------
We are trying to integrate several issues:

1) no/least code breakage
2) minimal standards changes
3) variable width character sets
4) small system functionality

Item 1) is about the common char=byte=au assumption.
Item 2) includes the use of COUNT to step through memory and the
impact of char in the file word sets.
Item 3) has to rationalise a fixed width serial/comms channel
with 1..4 byte characters, e.g. UTF-8
Item 4) should enable 16 bit systems to handle UTF-8 and UTF-32.

The basis of my current approach is to use the terminology of
primitive characters and extended characters. A primitive character
(called a pchar here)is a fixed-width unit handled by EMIT and
friends. It corresponds to the current ANS definition of a
character. An extended character (called an xchar here) consists
of one or more primitive characters and represents the encoding
for a "display unit". A string is represented by caddr/len
in terms of primitive characters.

The consequences of this are:

1) No existing code is broken.
2) Most systems have only one keyboard and only one screen/display
unit, but may have several additional comms channels. The
impact of a keyboard driver having to convert Chinese or Russian
characters into a (say) UTF-8 sequence is minimal compared to
handling the key stroke sequences. Similarly on display.
3) Comms channels and files work as expected.
4) 16-bit embedded systems can handle all character widths as they
are described as strings.
5) No conflict arises with the XCHARs proposal.

Multiple encodings can be handled if they share a common primitive
character size - nearly all of these are described in terms of octets:
TCP/IP, UTF-8, UTF-16, UTF-32, ...

The XCHARs proposal can be used to handle extended characters on the
stack. XEMIT and friends allow us to handle some additional odd-ball
requirements such as 9-bit control characters, e.g. for the MDB
bus used by vending machines.

Solution
--------
To ease discussion we refer to character handled by C@, C! and
friends as "primitive characters" or pchars. Characters that may
be wider than a pchar are called "extended characters" or xchars.
These are compatible with the XCHARs proposal. This proposal
does note requires systems to handle xchars, but does not
disenfranchise those that do.

S\" is used like S" but treats the '\' character specially. One
or more characters after the '\' indicate what is substituded.
The following list is what is currently available in the Forth
systems surveyed.

\a BEL (alert, ASCII 7)
\b BS (backspace, ASCII 8)
\e ESC (not in C99, ASCII 27)
\f FF (form feed, ASCII 12)
\l LF (ASCII 10)
\m CR/LF pair (ASCII 13, 10) - for HTML etc.
\n newline - CRLF for Windows/DOS, LF for Unices
\q double-quote (ASCII 34)
\r CR (ASCII 13)
\t HT (tab, ASCII 9)
\v VT (ASCII 11)
\z NUL (ASCII 0)
\" "
\[0-7]+ Octal numerical character value, finishes at the
first non-octal character
\x[0-9a-f]+ Hex numerical character value, finishes at the
first non-hex character
\\ backslash itself
\ before any other character represents that character

The following three of these cause parsing and readability
problems. As far as I know, requiring characters to come in
8 bit units will not upset any systems. Systems with characters
less than 7 bits are non-compliant, and I know of no 7 bit CPUs.
Al current systems use character units of 8 bits or more.

\[0-7]+ Octal numerical character value, finishes at the
first non-octal character
\x[0-9a-f]+ Hex numerical character value, finishes at the
first non-hex character

Why do we need two representations, both of variable length?
This proposal selects the hexadecimal representation, requiring
two hex digits. A consequence of this is that xchars must be
represented as a sequence of pchars. Although initially seen as a
problem by some people, it avoids the endian problems involved
in storing an xchar.

\ before any other character represents that character

This is an unnecessary general case, and so is not mandated.


Proposal
========

6.2.xxxx S\"
s-slash-quote CORE EXT

Interpretation:
Interpretation semantics for this word are undefined.

Compilation: ( "ccc<quote>" -- )
Parse ccc delimited by " (double-quote), using the translation
rules below. Append the run-time semantics given below to the
current definition.

Translation rules:
Characters are processed one at a time and appended to the
compild string. If the character is a '\' character it is
processed by parsing and substituting one or more characters
as follows:
\a BEL (alert, ASCII 7)
\b BS (backspace, ASCII 8)
\e ESC (not in C99, ASCII 27)
\f FF (form feed, ASCII 12)
\l LF (ASCII 10)
\m CR/LF pair (ASCII 13, 10)
\n implementation dependent newline, e.g. CR/LF, LF, or LF/CR.
\q double-quote (ASCII 34)
\r CR (ASCII 13)
\t HT (tab, ASCII 9)
\v VT (ASCII 11)
\z NUL (ASCII 0)
\" "
\xAB A and B are Hexadecimal numerical characters. The resulting
character is the conversion of these two characters.
\\ backslash itself

Run-time: ( -- c-addr u )
Return c-addr and u describing a string consisting of the translation
of the characters ccc. A program shall not alter the returned string.

See: 3.4.1 Parsing, 6.2.0855 C" , 11.6.1.2165 S" , A.6.1.2165 S"

Labelling
=========
ENVIRONMENT? impact
name stack conditions

Ambiguous conditions occur:
If a hex value is more than two characters
If \x is not followed by by two hexadecimal characters


Reference Implementation
========================
(as yet untested)
Taken from the VFX Forth source tree and modified to remove most
implementation dependencies. Assumes the use of the # and $ numeric
prefices to indicate decimal and hexadecimal respectively.

decimal

: PLACE \ c-addr1 u c-addr2 --
\ *G Copy the string described by c-addr1 u to a counted string at
\ ** the memory address described by c-addr2.
2dup 2>r \ write count last
1 chars + swap move
2r> c! \ to avoid in-place problems
;

: $, \ caddr len --
\ *G Lay the string into the dictionary at *\fo{HERE}, reserve
\ ** space for it and *\fo{ALIGN} the dictionary.
dup >r
here place
r> 1 chars + allot
align
;

: addchar \ char string --
\ *G Add the character to the end of the counted string.
tuck count + c!
1 swap c+!
;

: append \ c-addr u $dest --
\ *G Add the string described by C-ADDR U to the counted string at
\ ** $DEST. The strings must not overlap.
>r
tuck r@ count + swap cmove \ add source to end
r> c+! \ add length to count
;

: extract2H \ caddr len -- caddr' len' u
\ *G Extract a two-digit hex number in the given base from the
\ ** start of the* string, returning the remaining string
\ ** and the converted number.
base @ >r hex
0 0 2over >number 2drop drop
>r 2 chars /string r>
r> base !
;

create EscapeTable \ -- addr
\ *G Table of translations for \a..\z.
7 c, \ \a
8 c, \ \b
char c c, \ \c
char d c, \ \d
#27 c, \ \e
#12 c, \ \f
char g c, \ \g
char h c, \ \h
char i c, \ \i
char j c, \ \j
char k c, \ \k
#10 c, \ \l
char m c, \ \m
#10 c, \ \n (Unices only)
char o c, \ \o
char p c, \ \p
char " c, \ \q
#13 c, \ \r
char s c, \ \s
9 c, \ \t
char u c, \ \u
#11 c, \ \v
char w c, \ \w
char x c, \ \x
char y c, \ \y
0 c, \ \z

l: CRLF$ \ -- addr ; CR/LF as counted string
2 c, #13 c, #10 c,

internal
: addEscape \ caddr len dest -- caddr' len'
\ *G Add an escape sequence to the counted string at dest,
\ ** returning the remaining string.
over 0= \ zero length check
if drop exit endif
>r \ -- caddr len ; R: -- dest
over c@ [char] x = if \ hex number?
1 chars /string extract2H r> addchar exit
endif
over c@ [char] m = if \ CR/LF pair?
1 chars /string #13 r@ addchar #10 r> addchar exit
endif
over c@ [char] n = if \ CR/LF pair?
1 chars /string crlf$ count r> append exit
endif
over c@ [char] a [char] z within? if
over c@ [char] a - EscapeTable + c@ r> addchar
else
over c@ r> addchar
endif
1 chars /string
;
external

: parse\" \ caddr len dest -- caddr' len'
\ *G Parses a string up to an unescaped '"', translating '\'
\ ** escapes to characters much as C does. The
\ ** translated string is a counted string at *\i{dest}
\ ** The supported escapes (case sensitive) are:
\ *D \a BEL (alert)
\ *D \b BS (backspace)
\ *D \e ESC (not in C99)
\ *D \f FF (form feed)
\ *D \l LF (ASCII 10)
\ *D \m CR/LF pair - for HTML etc.
\ *D \n newline - CRLF for Windows/DOS, LF for Unices
\ *D \q double-quote
\ *D \r CR (ASCII 13)
\ *D \t HT (tab)
\ *D \v VT
\ *D \z NUL (ASCII 0)
\ *D \" "
\ *D \xAB Two char Hex numerical character value
\ *D \\ backslash itself
\ *D \ before any other character represents that character
dup >r 0 swap c! \ zero destination
begin \ -- caddr len ; R: -- dest
dup
while
over c@ [char] " <> \ check for terminator
while
over c@ [char] \ = if \ deal with escapes
1 /string r@ addEscape
else \ normal character
over c@ r@ addchar 1 /string
endif
repeat then
dup \ step over terminating "
if 1 /string endif
r> drop
;

: readEscaped \ "string" -- caddr
\ *G Parses an escaped string from the input stream according to
\ ** the rules of *\fo{parse\"} above, returning the address
\ ** of the translated counted string in *\fo{PAD}.
source >in @ /string tuck \ -- len caddr len
pad parse\" nip
- >in +!
pad
;

: S\" \ "string" -- caddr u
\ *G As *\fo{S"}, but translates escaped characters using
\ ** *\fo{parse\"} above.
readEscaped count state @ if
compile (s") $,
then
; IMMEDIATE


Test Cases
==========
Forth source to test the reference implementation.


--
Stephen Pelc, stephenXXX@xxxxxxxxxxxx
MicroProcessor Engineering Ltd - More Real, Less Time
133 Hill Lane, Southampton SO15 5AF, England
tel: +44 (0)23 8063 1441, fax: +44 (0)23 8033 9691
web: http://www.mpeforth.com - free VFX Forth downloads
.



Relevant Pages

  • [TOMOYO #15 3/8] Common functions for TOMOYO Linux.
    ... This file contains common functions (e.g. policy I/O, pattern matching). ... Since TOMOYO Linux is a name based access control, ... TOMOYO Linux's string manipulation functions make reviewers feel crazy, ... the Linux kernel accepts all characters but NUL character ...
    (Linux-Kernel)
  • RfD: Escaped Strings
    ... the S" string can only contain printable characters, ... the S" string cannot contain the '"' character, ... \b BS (backspace, ASCII 8) ... chars + swap move ...
    (comp.lang.forth)
  • RfD: Escaped Strings version 4
    ... the S" string can only contain printable characters, ... the S" string cannot contain the '"' character, ... as an escape character for the entry of characters that cannot be ... \b BS (backspace, ASCII 8) ...
    (comp.lang.forth)
  • RfD: Escaped Strings version 4
    ... the S" string can only contain printable characters, ... the S" string cannot contain the '"' character, ... as an escape character for the entry of characters that cannot be ... \b BS (backspace, ASCII 8) ...
    (comp.lang.forth)
  • Re: RfD: Escaped Strings
    ... the S" string can only contain printable characters, ... the S" string cannot contain the '"' character, ... \b BS (backspace, ASCII 8) ... \ ** escapes to characters much as C does. ...
    (comp.lang.forth)