Re: RfD: XCHAR wordset (Version 3)



According to Stephen Pelc <stephenXXX@xxxxxxxxxxxxxxxxxxxx>:
It is also worth remembering that for UTF-8, a character number
(I forget the exact terminology) is not the same as its byte
sequence.

In the Unicode world, they say "code point". Code points range from 0 to
10FFFF. A few values are permanently excluded. This space of code points
is not all allocated yet, and no extension of that range is currently
foreseen. ASCII characters (0 to 127) map to the same code point values.
Actually, latin-1 characters (0 to 255) map to the same code point
values.

Now, a "character" is something which is not easy to define in all
generality. A "character" could be seen as an elementary brick for
making words, but that is an awfully american point of view. To
designate a graphical unit, the term "glyph" is often used. In Unicode,
a glyph may result from the combination of several code points; for
instance, the U+00E9 code point stands for "LATIN SMALL LETTER E ACUTE"
(very common in French, for instance) which can also be represented as
two successive code points, U+0065 (LATIN SMALL LETTER E) and U+0301
(COMBINING ACUTE ACCENT). Both the single code point U+00E9, and the
sequence U+0065 U+0301, designate the same glyph.


Unicode also defines some _encodings_ which tell how sequences of code
points may become streams of bits or bytes. They often use the term
"code unit" to designate an element of an encoded stream, e.g. an octet.
With UTF-8, each code point becomes a sequence of 1 to 4 octets; UTF-8
has some convenient properties with regards to ASCII: an ASCII character
(i.e. a code point between 0 and 127) is encoded as a single octet with
the same value, whereas all other code points are encoded as sequences
of 2 to 4 octets which all have a value between 128 and 253. Thus, an
ASCII stream is also an UTF-8 stream with the same meaning.

In the early times of Unicode, the range was only 0 to FFFF; hence there
was an encoding (often dubbed "BMP") where each code point became a
16-bit word (i.e. two octets on any octet-oriented stream). Java and
Windows jumped in the Unicode bandwagon at that point, and it turned out
to be a bit too early, because the 16-bit range proved to be too small
for all the scripts in the world (many Asian scripts use glyphs by the
thousands). Hence UTF-16 was invented, which is to BMP what UTF-8 is to
ASCII. UTF-16 is an encoding over a stream of 16-bit code units. In
UTF-16, each code point becomes either one or two code units. Code
points from the 0..FFFF range become a single code unit, while code
points from the 10000..10FFFF range become two code units. This scheme
uses a reserved range in the original 0..FFFF range, called "surrogates"
(namely, U+D800 to U+DFFF). Lone surrogates are not supposed to happen
by themselves, and code points beyond FFFF are encoded as two successive
surrogates. UTF-16 is what Java uses for its strings (the JVM offers
access to 16-bit code units -- "char" in the Java world -- without
defining an encoding in memory). Windows now uses UTF-16, encoded in
little-endian convention (the Windows kernel expects such strings, but
there are userland functions which translate to and from UTF-8, for
compatibility with application code which wishes to remain close to
ASCII).


The whole thing has some depressing consequences. Namely, even if you
can use a flat regular encoding (e.g. UTF-32, where each code point
becomes a sequence of exactly four octets), you still have to consider
the possibility that a glyph may use several code points. This is not
that Unicode is badly designed; rather, human scripts and languages are
just complexs thing which cannot be really compacted into a
one-size-fits-all encoding. Programmatically, you have to handle
variable-sized chunks of RAM.


Without a shorthand term such as pchar, we won't be able to deal with
internationalisation or localisation in a sensible way.

In the C world, they call them "multibyte characters" or "wide
characters", shortnamed "wchar" (or "wchar_t" for the syntaxic type). In
most modern Unix-like systems, wchar_t is a 32-bit type, and a wchar_t
value contains a Unicode code point. On Windows, a wchar_t is a 16-bit
type and contains a code point from the "first plane" (the 0..FFFF
range), which may be a surrogate in a surrogate pair.


--Thomas Pornin
.



Relevant Pages

  • Re: eval and unicode
    ... encoding your terminal/file/whatnot is written in. ... you have a byte string that starts with u, then ", then something ... The first item in the sequence is \u5fb9 -- a unicode code point. ...
    (comp.lang.python)
  • Re: C# and encodings
    ... and they can be encoded into a binary stream using an encoding that either supports the full Unicode character set or an encoding that supports the subset that a codepage represents. ...
    (microsoft.public.dotnet.languages.csharp)
  • Re: Can I make unicode in a repr() print readably?
    ... Unicode output if the stream supports it, ... UnicodeErrors if encoding them with the stream encoding fails. ...
    (comp.lang.python)
  • Re: From python to LaTeX in emacs on windows
    ... encoding for the python file by a magic comment and for the input data file. ... > sequence, in this case it is the UTF-8 encoding of unicode text. ...
    (comp.lang.python)
  • Re: Writing UTF-8 string to UNICODE file
    ... > I am having no fun at all trying to write utf-8 strings to a unicode file. ... it were a distinct encoding. ... is a byte stream and unicode has nothing to do with bytes. ... If you write a unicode string to something that wants a byte stream, ...
    (comp.lang.python)

Loading