Re: RfD: XCHAR wordset (Version 3)
- From: Thomas Pornin <pornin@xxxxxxxxx>
- Date: 25 Nov 2008 12:18:53 GMT
According to Stephen Pelc <stephenXXX@xxxxxxxxxxxxxxxxxxxx>:
It is also worth remembering that for UTF-8, a character number
(I forget the exact terminology) is not the same as its byte
sequence.
In the Unicode world, they say "code point". Code points range from 0 to
10FFFF. A few values are permanently excluded. This space of code points
is not all allocated yet, and no extension of that range is currently
foreseen. ASCII characters (0 to 127) map to the same code point values.
Actually, latin-1 characters (0 to 255) map to the same code point
values.
Now, a "character" is something which is not easy to define in all
generality. A "character" could be seen as an elementary brick for
making words, but that is an awfully american point of view. To
designate a graphical unit, the term "glyph" is often used. In Unicode,
a glyph may result from the combination of several code points; for
instance, the U+00E9 code point stands for "LATIN SMALL LETTER E ACUTE"
(very common in French, for instance) which can also be represented as
two successive code points, U+0065 (LATIN SMALL LETTER E) and U+0301
(COMBINING ACUTE ACCENT). Both the single code point U+00E9, and the
sequence U+0065 U+0301, designate the same glyph.
Unicode also defines some _encodings_ which tell how sequences of code
points may become streams of bits or bytes. They often use the term
"code unit" to designate an element of an encoded stream, e.g. an octet.
With UTF-8, each code point becomes a sequence of 1 to 4 octets; UTF-8
has some convenient properties with regards to ASCII: an ASCII character
(i.e. a code point between 0 and 127) is encoded as a single octet with
the same value, whereas all other code points are encoded as sequences
of 2 to 4 octets which all have a value between 128 and 253. Thus, an
ASCII stream is also an UTF-8 stream with the same meaning.
In the early times of Unicode, the range was only 0 to FFFF; hence there
was an encoding (often dubbed "BMP") where each code point became a
16-bit word (i.e. two octets on any octet-oriented stream). Java and
Windows jumped in the Unicode bandwagon at that point, and it turned out
to be a bit too early, because the 16-bit range proved to be too small
for all the scripts in the world (many Asian scripts use glyphs by the
thousands). Hence UTF-16 was invented, which is to BMP what UTF-8 is to
ASCII. UTF-16 is an encoding over a stream of 16-bit code units. In
UTF-16, each code point becomes either one or two code units. Code
points from the 0..FFFF range become a single code unit, while code
points from the 10000..10FFFF range become two code units. This scheme
uses a reserved range in the original 0..FFFF range, called "surrogates"
(namely, U+D800 to U+DFFF). Lone surrogates are not supposed to happen
by themselves, and code points beyond FFFF are encoded as two successive
surrogates. UTF-16 is what Java uses for its strings (the JVM offers
access to 16-bit code units -- "char" in the Java world -- without
defining an encoding in memory). Windows now uses UTF-16, encoded in
little-endian convention (the Windows kernel expects such strings, but
there are userland functions which translate to and from UTF-8, for
compatibility with application code which wishes to remain close to
ASCII).
The whole thing has some depressing consequences. Namely, even if you
can use a flat regular encoding (e.g. UTF-32, where each code point
becomes a sequence of exactly four octets), you still have to consider
the possibility that a glyph may use several code points. This is not
that Unicode is badly designed; rather, human scripts and languages are
just complexs thing which cannot be really compacted into a
one-size-fits-all encoding. Programmatically, you have to handle
variable-sized chunks of RAM.
Without a shorthand term such as pchar, we won't be able to deal with
internationalisation or localisation in a sensible way.
In the C world, they call them "multibyte characters" or "wide
characters", shortnamed "wchar" (or "wchar_t" for the syntaxic type). In
most modern Unix-like systems, wchar_t is a 32-bit type, and a wchar_t
value contains a Unicode code point. On Windows, a wchar_t is a 16-bit
type and contains a code point from the "first plane" (the 0..FFFF
range), which may be a surrogate in a surrogate pair.
--Thomas Pornin
.
- References:
- RfD: XCHAR wordset (Version 3)
- From: Bernd Paysan
- Re: RfD: XCHAR wordset (Version 3)
- From: Bernd Paysan
- Re: RfD: XCHAR wordset (Version 3)
- From: Elizabeth D Rather
- Re: RfD: XCHAR wordset (Version 3)
- From: Stephen Pelc
- RfD: XCHAR wordset (Version 3)
- Prev by Date: Re: RfD: XCHAR wordset (Version 3)
- Next by Date: Re: RfD: XCHAR wordset (Version 3)
- Previous by thread: Re: RfD: XCHAR wordset (Version 3)
- Next by thread: Re: RfD: XCHAR wordset (Version 3)
- Index(es):
Relevant Pages
|
Loading