Re: RfD: XCHAR wordset (for UTF-8 and alike)



Bernd Paysan <bernd.paysan@xxxxxx> writes:
>Problem:
>
>ASCII is only appropriate for the English language. Most western languages
>however fit somewhat into the Forth frame, since a byte is sufficient to
>encode the few special characters in each (though not always the same
>encoding can be used; latin-1 is most widely used, though).

Actually Unicode (in its UCS-4/UTF-32 encoding) would also fit in the
ANS Forth frame. However, most near-ANS code around has an
environmental dependency on 1 chars = 1 au, and I think that more
existing programs work with a system that uses 1-au chars and xchars
(even when processing wider xchars) than with a system that uses n-au
chars (n>1).

> Since ANS Forth specifies ASCII encoding, only
>ASCII-compatible encodings may be used.

That's sounds like a requirement should therefore be part of the
proposal, not the problem description.

The on-stack representation of ASCII characters should certainly be
ASCII. For the in-memory representation that would also have some
advantages: in particular, programs that access individual characters
using char (not xchar) words would work correctly on strings
consisting only of ASCII characters (and ANS Forth does not give any
guarantee for other characters anyway).

>Proposal

I would have waited for some more time (and experience) before making
such a proposal (I am still unsure which words to include and which
not). But since you made it, let's collect the feedback.

>Words:
>
>XC-SIZE ( xc -- u )
>Computes the memory size of the XCHAR xc in address units.
>
>XC@+ ( xc_addr1 -- xc_addr2 xc )
>Fetchs the XCHAR xc at xc_addr1. xc_addr2 points to the first memory
>location after xc.
>
>XC!+ ( xc xc_addr1 -- xc_addr2 )
>Stores the XCHAR xc at xc_addr1. xc_addr2 points to the first memory
>location after xc.

This is unsafe, as it writes an unknown amount of data behind
xc_addr1. One can use it safely in combination with XC-SIZE, but then
it is easier to use XC!+? (see below).

Providing this word, but not XC!+? discourages safe programming
practices and encourages creating buffer overflows.

In other words, this might become Forth's strcat().

It's probably best not to standardize this word.

>XCHAR+ ( xc_addr1 -- xc_addr2 )
>Adds the size of the XCHAR stored at xc_addr1 to this address, giving
>xc_addr2.
>
>XCHAR- ( xc_addr1 -- xc_addr2 )
>Goes backward from xc_addr1 until it finds an XCHAR so that the size of this
>XCHAR added to xc_addr2 gives xc_addr1. Note: XCHAR- isn't guaranteed to
>work for every possible encoding.
>
>X-SIZE ( xc_addr u -- n )
>n is the number of monospace ASCII characters that take the same space to
>display as the the XCHAR string starting at xc_addr, using u address units.

Maybe another name would be harder to confuse with XC-SIZE. How about
X-WIDTH or XC-WIDTH?

>XKEY ( -- xc )
>Reads an XCHAR from the terminal.
>
>XEMIT ( xc -- )
>Prints an XCHAR on the terminal.

Currently Gforth also implements:

+X/STRING ( xc-addr1 u1 -- xc-addr2 u2 )
like 1 /STRING

-X/STRING ( xc-addr1 u1 -- xc-addr2 u2 )
like -1 /STRING

XC@ ( xc-addr -- xc )
like C@

DEFER XC!+? ( xc xc-addr1 u1 -- xc-addr2 u2 f )
safe version of XC!+, f specifies success

-TRAILING-GARBAGE ( addr u1 -- addr u2 )
remove trailing incomplete xc

Of course, some of these can be defined from others, but it's not
clear to me yet which ones are the set that we want to select.

>The following words behave different when the XCHAR extension is present:

That is actually a compatible extension of ANS Forth's CHAR and
[CHAR]; for ASCII characters they behave exactly the same, and for
others ANS Forth does not specify a behaviour. So I would not say
"behave different", but use wording such as "extend the semantics of
...."

>Open issues are file reading and writing (conversion on the fly or leave as
>it is?).

Definitely conversion on the fly. There must be only one character
encoding in memory. However, we have not implemented that yet.

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: http://www.complang.tuwien.ac.at/forth/ansforth/forth200x.html
EuroForth 2005: http://www.complang.tuwien.ac.at/anton/euroforth2005/
.



Relevant Pages

  • CfV: Xchar wordset
    ... ASCII is only appropriate for the English language. ... Since ANS Forth specifies ASCII encoding ... The xchar wordset does not solve problems that come from using ... Fetches the xchar at xc-addr1. ...
    (comp.lang.forth)
  • CfV: Xchar wordset (formatting repost)
    ... ASCII is only appropriate for the English language. ... Since ANS Forth specifies ASCII encoding ... The xchar wordset does not solve problems that come from using ... Fetches the xchar at xc-addr1. ...
    (comp.lang.forth)
  • RfD: XCHAR wordset
    ... ASCII is only appropriate for the English language. ... always the same encoding can be used; ... the xc datatype are in the XCHAR EXT wordset. ... dup maxascii u< IF drop 1 EXIT THEN \ special case ASCII ...
    (comp.lang.forth)
  • RfD: XCHAR wordset (for UTF-8 and alike)
    ... ASCII is only appropriate for the English language. ... encoding can be used; latin-1 is most widely used, ... languages, different char-sets have to be used, several of them ... Computes the memory size of the XCHAR xc in address units. ...
    (comp.lang.forth)
  • Re: purging non ascii chars
    ... Sometimes the data contains non ascii characters and I want to keep ...  set the encoding of the input correctly to make Ammar's ...
    (comp.lang.ruby)