Re: Case-sensitivity as option?



Thomas Pornin wrote:

According to Bernd Paysan <bernd.paysan@xxxxxx>:
Unicode can have up to 2^31 code points, so no practical limitation.

Code points beyond 0x10FFFF cannot be encoded with UTF-16, however. This
is also the current limit of "valid code point values" in Unicode. Now
that Microsoft has committed to UTF-16 (it is used throughout the
Windows system libraries and kernel), it is unlikely that Unicode will
allow code points value to exceed 0x10FFFF in the near future. That is
still quite large.

The C# language specification already claims "Unicode characters with code
points above 0x10FFFF are not supported". I.e. if Unicode extends above the
limitations of UTF-16, it's Microsoft who will be left behind. IIRC, they
were already left behind when UTF-16 introduced surrogates, and it took
some time for them to recover.

IMHO, it would be fairly trivial to extend the available kernel calls in
Windows to UTF-8. Just use it as an 8-bit codepage, so that the *A system
calls can work with it (they are all code-page specific, anyway). The
original idea that the *W system calls contain one character per short is
already broken with UTF-16, so changing the assumption on *A system calls
for a specific code-page should not be such a tough problem, either.

BTW: If the "end" of UTF-16 comes near, you could as well define the
surrogates that lead to $0Fxxxx and $10xxxx as new super-surrogates, which
together give you the full 32 bit UCS-4 space. I think this is the
appropriate way to deal with Microsoft. I would suggest that the actual
code points from $000F0000 onwards should *not* be reserved, but that
super-surrogates must be used for these even though there would be a way to
encode them with normal surrogates.

The UTF-8 encoding scheme can be trivially extended to about 2^31 code
points (going beyond would allow encoded text to include bytes equal to
0xFE or 0xFF, which would break BOM detection).

Well, BOM detection doesn't work with UTF-8 texts, anyway (and you shouldn't
put a BOM into UTF-8 texts, since it breaks the ASCII compatibility, e.g.
for shell scripts). Even if you extend UTF-8 to larger than 2^31, by
inclusion of $FE and $FF (for 7 and 8 bytes size), you will not be able to
produce a valid BOM, and the other $FExx and $FFxx code points are valid
Unicode. In UTF-8, the next byte following $FE and $FF however must be in
the form of %10xxxxxx. BTW: I checked with the algorithm I use in bigFORTH:
Works fine for 2^32, produces $FE as first byte. I think the generally
quoted limit to 2^31 comes from using signed ints ;-). There is a hard
limit to 42 (!) bits, when you allow all prefixes up to $FF (followed by 7
bytes in the form %10xxxxxx).

UTF-32 (aka UCS-4) allows up to 2^32.

Yes.

--
Bernd Paysan
"If you want it done right, you have to do it yourself"
http://www.jwdt.com/~paysan/
.



Relevant Pages

  • Re: Perl opting for double-byte chars?
    ... sure Unicode has something to do with your problem, ... Without knowing the version of Perl you're using and the platform ... The UTF-8 encoding uses variable-length character ... perldoc Encode ...
    (comp.lang.perl.misc)
  • Re: Unicode string libraries
    ... encoding negotiation. ... old languages which have adopted Unicode without much pain. ... compatibility with too many old programs; but char as a holder for UTF-8 ... The limitations of UTF-16 ...
    (comp.programming)
  • Re: unicode in ruby
    ... doesn't support unicode strings natively? ... put on Unix ages ago. ... (When Unix filesystems can write UTF-16 as ... translate to UTF-8 and/or follow the nonsensical POSIX rules for native ...
    (comp.lang.ruby)
  • Re: Psycopg and queries with UTF-8 data
    ... > how do I get my utf-8 encoded data into the DB? ... This sounds like the usual unicode/utf-8 confusion: ... So unicode objects encapsulate abstract unicode character sequence - however ... Do encode the unicode object in utf-8, and pass that to the psycopg. ...
    (comp.lang.python)
  • Re: AfxMessageBox?
    ... except that unfortunately there are now surrogate pairs in UTF-16. ... This means that any program that does string manipulation assuming each wchar_t is a single character is technically incorrect, ... Microsoft 16-bit "Unicode" no longer has the advantage that motivated its creation. ... I confess that one reason I like UTF-8 is that is backward compatible with code that assumed all ASCII characters. ...
    (microsoft.public.vc.mfc)