Re: RfD: XCHAR wordset (for UTF-8 and alike)
- From: anton@xxxxxxxxxxxxxxxxxxxxxxxxxx (Anton Ertl)
- Date: Tue, 27 Sep 2005 16:09:09 GMT
Bernd Paysan <bernd.paysan@xxxxxx> writes:
>Problem:
>
>ASCII is only appropriate for the English language. Most western languages
>however fit somewhat into the Forth frame, since a byte is sufficient to
>encode the few special characters in each (though not always the same
>encoding can be used; latin-1 is most widely used, though).
Actually Unicode (in its UCS-4/UTF-32 encoding) would also fit in the
ANS Forth frame. However, most near-ANS code around has an
environmental dependency on 1 chars = 1 au, and I think that more
existing programs work with a system that uses 1-au chars and xchars
(even when processing wider xchars) than with a system that uses n-au
chars (n>1).
> Since ANS Forth specifies ASCII encoding, only
>ASCII-compatible encodings may be used.
That's sounds like a requirement should therefore be part of the
proposal, not the problem description.
The on-stack representation of ASCII characters should certainly be
ASCII. For the in-memory representation that would also have some
advantages: in particular, programs that access individual characters
using char (not xchar) words would work correctly on strings
consisting only of ASCII characters (and ANS Forth does not give any
guarantee for other characters anyway).
>Proposal
I would have waited for some more time (and experience) before making
such a proposal (I am still unsure which words to include and which
not). But since you made it, let's collect the feedback.
>Words:
>
>XC-SIZE ( xc -- u )
>Computes the memory size of the XCHAR xc in address units.
>
>XC@+ ( xc_addr1 -- xc_addr2 xc )
>Fetchs the XCHAR xc at xc_addr1. xc_addr2 points to the first memory
>location after xc.
>
>XC!+ ( xc xc_addr1 -- xc_addr2 )
>Stores the XCHAR xc at xc_addr1. xc_addr2 points to the first memory
>location after xc.
This is unsafe, as it writes an unknown amount of data behind
xc_addr1. One can use it safely in combination with XC-SIZE, but then
it is easier to use XC!+? (see below).
Providing this word, but not XC!+? discourages safe programming
practices and encourages creating buffer overflows.
In other words, this might become Forth's strcat().
It's probably best not to standardize this word.
>XCHAR+ ( xc_addr1 -- xc_addr2 )
>Adds the size of the XCHAR stored at xc_addr1 to this address, giving
>xc_addr2.
>
>XCHAR- ( xc_addr1 -- xc_addr2 )
>Goes backward from xc_addr1 until it finds an XCHAR so that the size of this
>XCHAR added to xc_addr2 gives xc_addr1. Note: XCHAR- isn't guaranteed to
>work for every possible encoding.
>
>X-SIZE ( xc_addr u -- n )
>n is the number of monospace ASCII characters that take the same space to
>display as the the XCHAR string starting at xc_addr, using u address units.
Maybe another name would be harder to confuse with XC-SIZE. How about
X-WIDTH or XC-WIDTH?
>XKEY ( -- xc )
>Reads an XCHAR from the terminal.
>
>XEMIT ( xc -- )
>Prints an XCHAR on the terminal.
Currently Gforth also implements:
+X/STRING ( xc-addr1 u1 -- xc-addr2 u2 )
like 1 /STRING
-X/STRING ( xc-addr1 u1 -- xc-addr2 u2 )
like -1 /STRING
XC@ ( xc-addr -- xc )
like C@
DEFER XC!+? ( xc xc-addr1 u1 -- xc-addr2 u2 f )
safe version of XC!+, f specifies success
-TRAILING-GARBAGE ( addr u1 -- addr u2 )
remove trailing incomplete xc
Of course, some of these can be defined from others, but it's not
clear to me yet which ones are the set that we want to select.
>The following words behave different when the XCHAR extension is present:
That is actually a compatible extension of ANS Forth's CHAR and
[CHAR]; for ASCII characters they behave exactly the same, and for
others ANS Forth does not specify a behaviour. So I would not say
"behave different", but use wording such as "extend the semantics of
...."
>Open issues are file reading and writing (conversion on the fly or leave as
>it is?).
Definitely conversion on the fly. There must be only one character
encoding in memory. However, we have not implemented that yet.
- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: http://www.complang.tuwien.ac.at/forth/ansforth/forth200x.html
EuroForth 2005: http://www.complang.tuwien.ac.at/anton/euroforth2005/
.
- Follow-Ups:
- Re: RfD: XCHAR wordset (for UTF-8 and alike)
- From: Bruce McFarling
- Re: RfD: XCHAR wordset (for UTF-8 and alike)
- References:
- RfD: XCHAR wordset (for UTF-8 and alike)
- From: Bernd Paysan
- RfD: XCHAR wordset (for UTF-8 and alike)
- Prev by Date: EuroForth 2005 Call for papers and participation
- Next by Date: Re: RfD: XCHAR wordset (for UTF-8 and alike)
- Previous by thread: Re: RfD: XCHAR wordset (for UTF-8 and alike)
- Next by thread: Re: RfD: XCHAR wordset (for UTF-8 and alike)
- Index(es):
Relevant Pages
|