aus and chars (was: CMOVE wrong?)



"J Thomas" <jethomas5@xxxxxxxxx> writes:
What good are raw address units? I remember asking somebody who was on
the standards committee in the early days about that, and he mumbled
something about somebody who'd been on the committee for a while who
had a system that was nibble-addressed, who insisteed address units
shouldn't be bytes. But after they catered to him he dropped out.

A more relevant example in the early-90s timeframe is 16-bit
characters on byte-addressed machines, and Jack Woehr even implemented
such a Forth system to demonstrate that the Forth-94 approach works in
that setting (I don't think this system got much use, though).

If
you can address individual nibbles I don't see much you can do with
those addresses.

You can use the native addresses of the hardware.

An alternative would be to use a different, non-native representation
for addresses, such that 1 chars = 1. The downside here is that there
would have to be conversions between the Forth addresses and the
native addresses in a number of places, in particular in all
memory-access primitives, and when communicating with non-Forth
software.

Given Forth's character as a language that is close to the metal, I
feel that such a non-native representation for addresses is somewhat
against the spirit of the language.

In any case, the Win32Forth users have some experience with using a
non-native address representation, as once upon a time they used
addresses relative to some base (and conversions words like REL>ABS
and ABS>REL) instead of the native addresses in their system.

A language that took this approach is BCPL: it uses word addressing,
with consecutive words having consecutive addresses, and it does not
perform type checking; therefore it cannot use native addresses
(except on word-addressed machines). AmigaOS was partly written in
BCPL, and from what I read about it, the address conversion necessary
in many places was a significant pain.

So it seems to me that if you happen to have a system where an address
unit is smaller than a char, you could do:

: ALLOT CHARS ALLOT ;
: MOVE CHARS MOVE ;
: ERASE CHARS ERASE ;
: UNUSED CHARS UNUSED ;
: ALLOCATE CHARS ALLOCATE ;
: RESIZE CHARS RESIZE ;
: DUMP CHARS DUMP ;

: CHARS ; IMMEDIATE

and all your standard programs will work just as before

create a 5 cells allot
5 a 4 cells + !
a 4 cells erase
a 4 cells + @ .

would produce the wrong result on such a system. Ok, that would be
fixable by redefining CELLS (and FLOATS etc), but here's the killer:

s" abcdef" drop 3 chars + c@ emit

Here your system with the redefined CHARS would try to access the char
starting at the fourth au instead of at the fourth char of the string.

You cannot just redefine + in Forth to work as you would need. You
could do it in StrongForth.

Basically, what you are thinking of is something similar to the
approach that C took: automatic scaling of address arithmetic by type
size. That's appropriate for C with it's type-aware compiler (and
even there I think it's not the greatest idea), but not for Forth.


However, I think that 1 CHARS = 1 is something that all current and
future standard systems (apart from that demonstration system by Jack
Woehr) will support, for the following reasons:

- There is a large amount of code around that assumes that 1 CHARS = 1.

- While there is hardware where an au is smaller than a byte, probably
nobody will implement a standard Forth system on such hardware (too
little resources).

- In the early 1990s the eventual transition to 16-bit fixed-width
Unicode characters (UCS-2) looked likely. However, UCS-2 was
superseded by the variable-width UTF-16 and the fixed-width
UCS-4/UTF-32, and apart from a few niches no transition to UCS-2 or
UCS-4 has happened. Instead, the transition to Unicode that I see
is mainly towards variable-width UTF-8 with 8-bit granularity; other
popular encodings like GB18030 (or tGB2312 and GBK that GB18030 is
based on) are also variable-width with byte granularity.

The Forth-94 approach does not fit variable-width encodings, but
there is the xchars proposal for dealing with variable-width
encodings, and it is compatible with 1 CHARS = 1 for byte- or
word-addressed machines (especially if the encoding has byte
granularity).

So, there is no need to drop 1 CHARS = 1 to support Unicode.

- And even if somebody would implement a standard system on a
nibble-addressed machine, or a system with UCS-4 characters on a
byte-addressed machine, they would probably choose the BCPL-style
approach; while that is painful in places, being able to run the
code mentioned above without having to find all the places where
CHARS has been forgotten etc. is worth the pain.

If I am right about the support of 1 CHARS = 1 by all significant
systems, then you can just rely on that support; we might also
formally standardize this extension in Forth200x.

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: http://www.forth200x.org/forth200x.html
EuroForth 2007: http://www.complang.tuwien.ac.at/anton/euroforth2007/
.



Relevant Pages

  • Re: RfD: c-addr/len
    ... Then I wonder why you are asking us to get rid of CHARS ... encodings and MIME, when common practice is using MIME. ... Thus what we see for now, you design standard, you make another series ...
    (comp.lang.forth)
  • Re: What Is Wrong With Newswatcher?
    ... chars that demands the client to hard wrap the lines either ... don't recall brackets being a standard in the rfc to enclose wrapped ...    A multi-line data block is used in certain commands and responses. ...
    (comp.sys.mac.advocacy)
  • Re: C90 penetration
    ... and just specify that the "precision" ... A bit-field whose size is CHAR_BIT, can certainly be represented by a char; bit fields of a larger size clearly cannot be represented as a single char. ... Also, the standard requires that if a bit field with a width of 3 is followed by a bit field with a width of 4, then if there's enough room in the allocation unit, they must be allocated in adjacent bits within that allocation unit; that's pretty hard to do if you implement them as chars. ...
    (comp.lang.c)
  • Re: char and au size
    ... 1, and all supported Forth-94 systems implement this, so it might be a ... supports several platforms for which chars and AU are different, ... So we won't see standard systems on such small machines, ...
    (comp.lang.forth)
  • Re: RfD: XCHAR wordset (for UTF-8 and alike)
    ... one alternative to variable length XCHARS would be to look at ... With fixed length chars that simply means ... > for that kind of code, variable-width characters ... > work just as well as fixed-width characters, so why deal with all the ...
    (comp.lang.forth)