Re: RfD: XCHAR wordset (Version 3)



m_l_g3 wrote:

On Nov 24, 12:36 am, Bernd Paysan <bernd.pay...@xxxxxx> wrote:
Common encodings:

Input and files commonly are often encoded iso-latin-1 or utf-8.

Believe me, a lot goes in CP-866.

Or the quite similar KOI-8. At least that's also ASCII compatible ;-).

X-SIZE ( xc_addr u1 -- u2 ) XCHAR
Computes the memory size of the first xchar stored at xc_addr in
pchars.

is u1 used at all? what happens if u1<u2? In particular, with u1=0?

Ok, needs clarification: u1 is the buffer length; there's an ambiguous condition if the xchar is incomplete due to limited buffer length. X-SIZE should not access bytes outside of the buffer. I'll change the reference implementation so that it returns 0 when u1=0 - otherwise, u1 will be ignored in the UTF-8 case, since the first byte determines the length.

I think I should add some notes about ambiguous conditions, too.

PARSE ( xc "text<xc>" -- addr u )
Parse a text in the input stream with the xchar xc as delimiter.

Then here will go WORD.

I consider WORD as obsolescent (better use PARSE-NAME), especially when used with non-blank arguments.

Do you plan on-the-fly encoding switching?

No. The only switch of encoding that's reasonable possible is from ASCII to some ASCII-compatible encoding, but going back is not a good idea.

As to changing the semantics of READ-FILE... currently, BIN does
nothing.
If you add conversion semantics to it, you will break A LOT of code.

In BIN mode, certainly no conversion may happen. In text mode, it's reasonable to assume that some conversion may happen, i.e. you set the file encoding, and it transcodes the file to the internal representation.

As a person having experience of working with text in the same
language
but different encodings, I propose to leave READ/WRITE-FILE alone, and
have re-coding functions that work in memory.

With such a re-coding function, everybody can write his own READ/WRITE-RECODE-FILE word. I think this idea is useful.

As to your code, what is an xc: an Unicode code point or a sequence of
utf8 bytes backed into one cell?

In my code xc is an Unicode code point, but a sequence of utf8 bytes backed into one cell would be a valid implementation, too.

I would prefer to deal with code points since they IMO are more
meaningful.

My opinion, too. I will put this as comment into the informal section.

--
Bernd Paysan
"If you want it done right, you have to do it yourself"
http://www.jwdt.com/~paysan/
.



Relevant Pages

  • Re: utf8 vs iso8859-1 speed/responsiveness
    ... Glibc internal encoding is UTF32/UCS4, and modern toolkits, thus ... on RH9 as well. ... conversion happens everywhere on the fly. ... So regardless of RH9 or FC2, ...
    (Fedora)
  • Re: Proposal: require 7-bit source strs
    ... I'm referring to a time when there was no encoding ... It would be possible to go back and find all strings ... That's why I specified to do this after conversion to ... make the assumption that the character set is ASCII-based, ...
    (comp.lang.python)
  • Re: Proposal to extend documentation about interop
    ... > utf-8 encoding of the character FF. ... > I solved it by doing the conversion of UTF-8 to bytes and when going back to ...
    (microsoft.public.dotnet.framework.interop)
  • Re: New keyword orif and its implications
    ... code showing that the conversion of an enumerated type to and from a ... the usage of conversion functions at either end of a connection is ... the necessary conditions for the conversion functions to dissolve away....as ... A binary encoding is by definition ...
    (comp.lang.vhdl)
  • Converting default encoding for windows to utf8
    ... I want to run a program in java, which needs to understand utf8 ... encoding, when i run from the command line the conversion of text files ...
    (comp.lang.java.programmer)