Re: Case-sensitivity as option?
- From: Bernd Paysan <bernd.paysan@xxxxxx>
- Date: Tue, 06 Jan 2009 23:32:26 +0100
Thomas Pornin wrote:
According to Bernd Paysan <bernd.paysan@xxxxxx>:
Unicode can have up to 2^31 code points, so no practical limitation.
Code points beyond 0x10FFFF cannot be encoded with UTF-16, however. This
is also the current limit of "valid code point values" in Unicode. Now
that Microsoft has committed to UTF-16 (it is used throughout the
Windows system libraries and kernel), it is unlikely that Unicode will
allow code points value to exceed 0x10FFFF in the near future. That is
still quite large.
The C# language specification already claims "Unicode characters with code
points above 0x10FFFF are not supported". I.e. if Unicode extends above the
limitations of UTF-16, it's Microsoft who will be left behind. IIRC, they
were already left behind when UTF-16 introduced surrogates, and it took
some time for them to recover.
IMHO, it would be fairly trivial to extend the available kernel calls in
Windows to UTF-8. Just use it as an 8-bit codepage, so that the *A system
calls can work with it (they are all code-page specific, anyway). The
original idea that the *W system calls contain one character per short is
already broken with UTF-16, so changing the assumption on *A system calls
for a specific code-page should not be such a tough problem, either.
BTW: If the "end" of UTF-16 comes near, you could as well define the
surrogates that lead to $0Fxxxx and $10xxxx as new super-surrogates, which
together give you the full 32 bit UCS-4 space. I think this is the
appropriate way to deal with Microsoft. I would suggest that the actual
code points from $000F0000 onwards should *not* be reserved, but that
super-surrogates must be used for these even though there would be a way to
encode them with normal surrogates.
The UTF-8 encoding scheme can be trivially extended to about 2^31 code
points (going beyond would allow encoded text to include bytes equal to
0xFE or 0xFF, which would break BOM detection).
Well, BOM detection doesn't work with UTF-8 texts, anyway (and you shouldn't
put a BOM into UTF-8 texts, since it breaks the ASCII compatibility, e.g.
for shell scripts). Even if you extend UTF-8 to larger than 2^31, by
inclusion of $FE and $FF (for 7 and 8 bytes size), you will not be able to
produce a valid BOM, and the other $FExx and $FFxx code points are valid
Unicode. In UTF-8, the next byte following $FE and $FF however must be in
the form of %10xxxxxx. BTW: I checked with the algorithm I use in bigFORTH:
Works fine for 2^32, produces $FE as first byte. I think the generally
quoted limit to 2^31 comes from using signed ints ;-). There is a hard
limit to 42 (!) bits, when you allow all prefixes up to $FF (followed by 7
bytes in the form %10xxxxxx).
UTF-32 (aka UCS-4) allows up to 2^32.
Yes.
--
Bernd Paysan
"If you want it done right, you have to do it yourself"
http://www.jwdt.com/~paysan/
.
- Follow-Ups:
- Re: Case-sensitivity as option?
- From: Thomas Pornin
- Re: Case-sensitivity as option?
- References:
- Case-sensitivity as option?
- From: Helmar
- Re: Case-sensitivity as option?
- From: m_l_g3
- Re: Case-sensitivity as option?
- From: Thomas Pornin
- Re: Case-sensitivity as option?
- From: Bernd Paysan
- Re: Case-sensitivity as option?
- From: Thomas Pornin
- Case-sensitivity as option?
- Prev by Date: Re: ANSI escape control codes for the Forth console
- Next by Date: Re: ANSI escape control codes for the Forth console
- Previous by thread: Re: Case-sensitivity as option?
- Next by thread: Re: Case-sensitivity as option?
- Index(es):
Relevant Pages
|