Re: RfD: XCHAR wordset (Version 3)
- From: Bernd Paysan <bernd.paysan@xxxxxx>
- Date: Tue, 25 Nov 2008 09:55:21 +0100
m_l_g3 wrote:
On Nov 24, 12:36 am, Bernd Paysan <bernd.pay...@xxxxxx> wrote:
Common encodings:
Input and files commonly are often encoded iso-latin-1 or utf-8.
Believe me, a lot goes in CP-866.
Or the quite similar KOI-8. At least that's also ASCII compatible ;-).
X-SIZE ( xc_addr u1 -- u2 ) XCHAR
Computes the memory size of the first xchar stored at xc_addr in
pchars.
is u1 used at all? what happens if u1<u2? In particular, with u1=0?
Ok, needs clarification: u1 is the buffer length; there's an ambiguous condition if the xchar is incomplete due to limited buffer length. X-SIZE should not access bytes outside of the buffer. I'll change the reference implementation so that it returns 0 when u1=0 - otherwise, u1 will be ignored in the UTF-8 case, since the first byte determines the length.
I think I should add some notes about ambiguous conditions, too.
PARSE ( xc "text<xc>" -- addr u )
Parse a text in the input stream with the xchar xc as delimiter.
Then here will go WORD.
I consider WORD as obsolescent (better use PARSE-NAME), especially when used with non-blank arguments.
Do you plan on-the-fly encoding switching?
No. The only switch of encoding that's reasonable possible is from ASCII to some ASCII-compatible encoding, but going back is not a good idea.
As to changing the semantics of READ-FILE... currently, BIN does
nothing.
If you add conversion semantics to it, you will break A LOT of code.
In BIN mode, certainly no conversion may happen. In text mode, it's reasonable to assume that some conversion may happen, i.e. you set the file encoding, and it transcodes the file to the internal representation.
As a person having experience of working with text in the same
language
but different encodings, I propose to leave READ/WRITE-FILE alone, and
have re-coding functions that work in memory.
With such a re-coding function, everybody can write his own READ/WRITE-RECODE-FILE word. I think this idea is useful.
As to your code, what is an xc: an Unicode code point or a sequence of
utf8 bytes backed into one cell?
In my code xc is an Unicode code point, but a sequence of utf8 bytes backed into one cell would be a valid implementation, too.
I would prefer to deal with code points since they IMO are more
meaningful.
My opinion, too. I will put this as comment into the informal section.
--
Bernd Paysan
"If you want it done right, you have to do it yourself"
http://www.jwdt.com/~paysan/
.
- Follow-Ups:
- Re: RfD: XCHAR wordset (Version 3)
- From: Peter Fälth
- Re: RfD: XCHAR wordset (Version 3)
- References:
- RfD: XCHAR wordset (Version 3)
- From: Bernd Paysan
- Re: RfD: XCHAR wordset (Version 3)
- From: m_l_g3
- RfD: XCHAR wordset (Version 3)
- Prev by Date: Re: where do I learn chuck moore style forth coding?
- Next by Date: Re: RfD: Escaped Strings (Version 6)
- Previous by thread: Re: RfD: XCHAR wordset (Version 3)
- Next by thread: Re: RfD: XCHAR wordset (Version 3)
- Index(es):
Relevant Pages
|