Re: RfD: c-addr/len



anton@xxxxxxxxxxxxxxxxxxxxxxxxxx (Anton Ertl) writes:

mhx@xxxxxx (Marcel Hendrix) writes:
anton@xxxxxxxxxxxxxxxxxxxxxxxxxx (Anton Ertl) writes Re: RfD: c-addr/len

Robert Epprecht <epprecht@xxxxxxxxx> writes:
I did not
know that there is no practical value in using CHARS.

The lack of practical value comes from the fact that maintained Forth
systems with "1 CHARS > 1" don't exist in practice.

If unicode ever becomes necessary, I intend to go for 32-bit characters
(4 bytes) to keep things simple. Would there be problems (apart from
a tiny bit enlarged dataspace)?

Most programs have an environmental dependency on 1 CHARS = 1 and
would not work on your system unless you also make the address unit
32-bits (which would cause problems when calling foreign functions).

This isn't dependency on 1 CHARS = 1, they depend on unioctet character
encoding, i.e. the dependency is more strict. That it works with UTF-8
is mere coincidence.

#download the UTF-8 example program
wget http://www.complang.tuwien.ac.at/forth/utf8/example.fs

This test is checking quite another thing: it checks system of being
8-bit clean and accepting codes of higher half of codeset as letters.
This invalidates your conclusion.

#now run it on the various Forth systems
gforth -e "include example.fs cr bye"
iforth "cr include example.fs cr bye"
vfxlin "cr include example.fs cr bye"
spf4 example.fs CR BYE

It worked on all these Forth systems that AFAIK have no particular
support for UTF-8. Note that the program uses a word name that is not
in ASCII.

The next test was cutting and pasting the program on the command line
rather than including it.

It worked on all systems, but there were some shortcomings:

On Gforth 0.6.2, command-line editing does not work properly for the
non-ASCII characters, but just pasting is fine.

Vfxlin has a similar problem; moreover, it works when the code is
copied after vfxlin was started, but not if it was copied before
(it shows "#" for the non-ASCII characters then; strange).

On iForth, the code looks funny in the command-line editor, but it
works.

On SP-Forth, there is no command-line editing, only backspace. That
worked perfectly, however, which was unexpected. Maybe SP-Forth has
special support for UTF-8. What does not work properly is the error
position indicator if there are non-ASCII characters before or in the
erroneous word.

So, it worked only to the extent of using string as a whole, where you
had to process the text, your test failed.

You didn't test anything where character count matters, for instance
text formatting. Even simple line folding would show that.

The point of all that is to show that in most places one deals with
strings without having to know how they break into display characters;
even stuff deep inside a Forth system like the dictionary and file
inclusion (with parsing etc.) just deals with UTF-8 strings like it
deals with plain ASCII strings, and therefore it just works.

Sure, because this stuff depends on unioctet encoding, splitting text
into words at ASCII blanks. And all failures you noticed, you've
attributed to non-working error place indicator and such.

The only parts in Gforth that had to be changed to support UTF-8 and
other variable-with encodings was command-line editing and error
indication (because there display characters matter). And of course
we also added the xchars wordset so that applications can deal with
extended display characters, too.

Thus, you care only of Gforth and nothing else.


--
HE CE3OH...
.



Relevant Pages

  • Re: String based hashCode
    ... I mean that the lengght of string is fixed an no more than 11 chars ... characters. ... There are 65536 different Strings of length one. ... Meanwhile, there are 2^32 = 4,294,967,296 distinct hashCode() ...
    (comp.lang.java.programmer)
  • Re: Why R6RS is controversial
    ... the semantics of the language, ... behavior of grapheme-cluster characters under most linguistic ... as the strings grow longer. ... Normalization is hideously complicated, and may require many ...
    (comp.lang.scheme)
  • Re: Unicode LISP??
    ... I'm not experienced with Common Lisp library, ... terms of strings rather than characters. ... have their representation upgraded if they are updated in place. ...
    (comp.lang.lisp)
  • Re: not quite 1252
    ... The kill_gremlins function is intended to fix Unicode strings that have been obtained by decoding 8-bit strings using 'latin1' instead of 'cp1252'. ... In fact it wasn't, it was UTF-8 like Sergei wrote, but it was easy to convert it to cp1252, no problem. ... characters to documents marked up as ISO 8859-1 or other encodings. ...
    (comp.lang.python)
  • Re: How to check variables for uniqueness ?
    ... characters is the sequence SS. ... is simply capitalizing strings. ... The fact that case mapping in English /is/ simple is neither here not ... That is a fair criticism of the Unicode position. ...
    (comp.lang.java.programmer)