Re: New Year's Resolution (was Re: cell phones, was: car help, was: Starving people refuse to eat food aid)



On Dec 28, 12:26 am, Mike Ash <m...@xxxxxxxxxxx> wrote:
In article
<121e774e-6a58-40e1-8e08-0fe7ffd3c...@xxxxxxxxxxxxxxxxxxxxxxxxxxxx>,

 cryptoguy <treifam...@xxxxxxxxx> wrote:
Bare Unicode would double the size, assuming our texts stayed in the
BMP. It would never triple it, unless you started to post in Linear B
or cuneiform. UTF-8 crunches it down, so all the characters that match
US-ASCII stay as one byte. Fortunately, this is most characters used
in the mostly-English parts of the Internet

There's no such thing as "bare Unicode". Unicode describes a mapping of
conceptual characters to code points, and a multitude of encodings which
map sequences of those code points to bytes. (It does more than this,
it's not meant to be a complete enumeration of Unicode.)

If something is "encoded in Unicode", it could be in any number of
encodings. There's no one default encoding. UTF-16 produces two bytes
per code point in the BMP, four bytes per code point outside of it.
UTF-8 produces 1-4 bytes depending. UTF-32 produces 4 bytes per code
point all the time.

I believe you understand this, but find the wording wherein "Unicode" is
used to refer to some specific (but unspecified) encoding to increase
the confusion on the topic.

As noted above, I am not a standards wonk. I know that some modern OSs
(Win Mobile is one that springs to mind) use two-byte encodings as the
default for characters in strings, internally. This is referred to
loosely as 'doublebyte' or 'Unicode'. For regular, English-language
strings, this maps to leaving the first byte of the pair a null, and
the second as the US-ASCII value, which makes conversion easy.

This is done for localization purposes; if all the canned strings of
an application are stored in a table, and data strings are all handled
as doublebyte (aka 'Unicode', loosely speaking), it becomes much
easier to produce multi-language versions of the app; you need to
translate the table, and in GUI apps be aware of how differing
languages change the length of the displayed text.

pt
.



Relevant Pages

  • Re: How to check variables for uniqueness ?
    ... characters is the sequence SS. ... is simply capitalizing strings. ... The fact that case mapping in English /is/ simple is neither here not ... That is a fair criticism of the Unicode position. ...
    (comp.lang.java.programmer)
  • Re: regular expressions and the LOCALE flag
    ... Strings with the 'u' prefix are Unicode strings, ... to be explicit, if the local encoding is 'utf8', none of the following will get a hit: ... Characters are categorised according to the ...
    (comp.lang.python)
  • Re: How to Get the ByteLength from CString when it is Unicode
    ... UTF8 is one of many MBCS encodings. ... Unicode is not an MBCS; UTF8 is (or at least the WideCharToMultiByte API call thinks it ... The number of characters is based on interpreting 'character' as WCHAR in Unicode and CHAR ...
    (microsoft.public.vc.mfc)
  • Unicode and ANSI Common Lisp
    ... sequences of Unicode code points, is the internal format that best ... better left to a higher layer above ANSI Common Lisp. ... how characters are counted is the only way for LENGTH to return the same ... values across implementations for the same external strings. ...
    (comp.lang.lisp)
  • Re: Optimization of code
    ... that leet alphabet, with excessive accents. ... Latest MSVC releases can handle UNICODE C sources, ... Swedish, German, French, Hungarian, etc. that use accented characters). ... that require ASCII text strings as part of their protocol. ...
    (microsoft.public.vc.mfc)