Re: RfD: XCHAR wordset (Version 3)



Anton Ertl wrote:
Hm, what about "distributing an internationalized program as source code"?
How do you do that when you don't know what kind of charsets the systems
use?

As long as the program does not contain non-ASCII data, it does not
need to know the encoding of the data it processes in order to work.

Any internationalized and localized program *will* contain non-ASCII data:
The strings for the translated texts.

Changing encodings is messy. The e-mail case you mentioned allows to change
encodings; there is an RFC that even defines all the available encodings. It
is messy, that's why the W3C dropped support for many different encodings in
XHTML (HTML supports it).

Let's rephrase it:

* A standard system can provide ways to change the internal and external
encoding, to support legacy applications.

* A legacy application that uses one or several non-Unicode encodings is not
a standard program. This doesn't matter, it never has been. It can use the
XCHAR words to deal with the non-Unicode encodings if the vendor provides an
extension to change the encoding, and use this in a transition phase before
converting the data to UTF-8.

* How to deal with multiple different encodings is outside the scope of the
current xchar proposal. IMHO this probably should stay outside the scope of
any standard, because different systems might have different legacy
requirements.

Not on the Unix systems I use. I set LANG=C, and whenever I work on
an account that doesn't, something fails pretty soon and reminds me to
set LANG=C there, too.

Hm, setting LANG=C makes your local terminal UTF-8 unaware. Certainly things
break when you log into a system which assumes UTF-8. I usually only set
LC_NUMERIC=C, and leave LANG=de_DE.UTF-8 (that fixes most of the annoying
problems like programs printing into postscript files using German number
conventions). Gforth's io.c sets the LC_NUMERIC locale internal to C
particularly for this reason (otherwise f. would print 123,456 instead of
123.456).

--
Bernd Paysan
"If you want it done right, you have to do it yourself"
http://www.jwdt.com/~paysan/
.



Relevant Pages

  • Re: CfV: Xchar wordset
    ... that's my opinion - UTF-8 is the only reasonable forward-looking ... all the others are legacy and messy. ... different encodings say that it works (with some ... Xchars go beyond UTF-8 and can deal with all sorts ...
    (comp.lang.forth)
  • Re: CfV: Xchar wordset
    ... Then you should stop talking about "xchar" entities at once. ... Revoke your proposal and redo everything to handle UTF-8 specifically. ... different encodings say that it works (with some ... future standard. ...
    (comp.lang.forth)
  • Re: RfD: XCHAR wordset (for UTF-8 and alike)
    ... >>apps out there that use multiple encodings. ... >of those substantial apps are ignoring the standard already. ... >Xchars were designed for dealing with one encoding used throughout ...
    (comp.lang.forth)
  • Re: regexp
    ... > if I understand the specs correctly, then the POSIX regexp interface works ... > only for 8bit encodings. ... Since utf-8 is a de-facto standard in Unix world there is probably ...
    (comp.unix.programmer)
  • Re: CString and UTF-8
    ... and variable character width encodings are a nightmare for in-memory ... The current proposal for the standard is to introduce char16_t and char32_t ... (UTF-16 vs. UTF-32) ...
    (microsoft.public.vc.mfc)