Re: New Year's Resolution (was Re: cell phones, was: car help, was: Starving people refuse to eat food aid)

Keith wrote:
Cryptoengineer <petert...@xxxxxxxxx> wrote:
"Keith F. Lynch" <k...@xxxxxxxxxxxxxx> wrote:
I've seen postings that were double or even triple size. See
I'm curious as to from where TC pasted that. The original message is

Thanks for finding that. I'm also curious what that 24-bit format is
called. There doesn't seem to be such a thing as UTF-24.
and it clearly goes to hell after the first 20 or so lines.

Its still UTF-8, or rather, a mangled UTF-8, but recognizable to any
competant software engineer.

UTF-8 is a variable length encoding of Unicode characters which in
most cases, reduces the space required, but in certain situations uses
up to 4 bytes. Characters in the range 0-127 require a single byte,
and are identical to US-ASCII.

Unicode is a method of encoding characters with a enough variety to
handle pretty much anything that been used, ever. Two byte Unicode
comprises the Basic Multilingual Plane (BMP), and covers most living
languages (a huge chunk is devoted to Chinese). There are longer
versions that cover dead languages, etc.

Some two byte Unicode characters require a 3-byte encoding under
UTF-8, which looks like this:

1110yyyy 10yyyyxx 10xxxxxx where yyyyyyyy and xxxxxxxx are the two
unicode characters.

now compare this with the 'encrypted' value for 'e' that you

11100110 10010100 10000000

The value for 'e' 01100101, has been placed in the high-order Unicode
byte (yyyyyyyy), instead of the low order one (xxxxxxxx), where it
should have been. In fact, for a value under 128, such as 'e', the one
byte UTF-8 encoding should have been used, leaving it as 01100101

This three byte sequence is the result of a bug in the encoding
software, not some exotic encoding system. This fact is obvious to
anyone skilled in the art.

Keith: If a program makes what is clearly an
error in handling a well defined encoding system,
and spews garbage into a post, and you, without
mentioning that it is buggy output you're
referring to, use it as the sole example to
support your contention that 'UTF-8 can triple
the length of a usenet posting' (or words to
that effect)....

Once again, I never said that UTF-8 can triple the length of a Usenet

Here's what you said, in afu, on Nov 28:

- start quote -

Unicode is useful for someone who wants to read a mixture, on the same
page, of Arabic, Japanese, Hebrew, Korean, Greek, and Russian. I
don't think the other 99.99% of us should have to put up with an
ecoding that doubles or triples the sizes of all emails, newsgroup
postings, blog entries, and web pages, breaks nearly all existing
software, and makes new software much harder to write, for the
convenience of that 0.01%.

- end quote -

(I'll ignore the fact that the languages you cite are used (at a
guess) by about 25% of the world population, not 0.01%).

Bare Unicode would double the size, assuming our texts stayed in the
BMP. It would never triple it, unless you started to post in Linear B
or cuneiform. UTF-8 crunches it down, so all the characters that match
US-ASCII stay as one byte. Fortunately, this is most characters used
in the mostly-English parts of the Internet

And how am I supposed to know whether that
was a bug or simply a format I wasn't familiar
with? As you saw in the message whose URL I
gave, the message *did* contain meaningful
text, though it was in a weird format it
took me some time to figure out.

You should have known the same way I did; by being an educated
professional, skilled in the art, and up to date. The header stated
that it was UTF-8, and the mangled triplets were clearly incorrectly
coded 3-byte UTF-8.

... all I can say is that you open yourself to accusations of acting
in bad faith.

If that's what you still think, I guess I can't change your mind.
Nevertheless, you're wrong.

The facts remain:

1. You claimed Unicode 'doubles or triples' the length of all
newsgroup postings.

2. The only posting you can point do where UTF-8 encoded Unicode
significantly increased the size of a message turns out, on trivial
inspection, to be the result of a software bug.

3. You cite that message as the proof of your claim, but don't mention
its buggy nature.

I'll leave it for others to decide whether you posted in bad faith, or
out of ignorance.


Relevant Pages

  • [PATCH] UTF-8 input: composing non-latin1 characters, and copy-paste
    ... One can put the keyboard driver into Unicode mode, load a Unicode keymap, and get single keystrokes generate valid UTF-8 for non-ASCII characters. ...
  • Re: Unicode string libraries
    ... encoding negotiation. ... old languages which have adopted Unicode without much pain. ... compatibility with too many old programs; but char as a holder for UTF-8 ... The limitations of UTF-16 ...
  • Re: Help me!! Why java is so popular
    ... Well, Unicode is not a storage encoding system, or anything like that. ... Unicode is primarily a mapping from characters (in the linguistic conceptual ... French, Russian, Japanese and Korean songs. ...
  • Re: Unicode string libraries
    ... I know that Perl uses UTF-8 as its internal string representation. ... characters defined within the BMP). ... search on UTF-8 encodings is equivalent to a search on Unicode ... it makes sense to choose other criteria for your internal encoding. ...
  • Re: DB2 UTF-8 ODBC double conversion
    ... Unicode considers the various UTFs flavors completely equivalent. ... Just various encoding forms for the same thing. ... they can't use your database to represent as many characters as ... are required in order to support the GB-18030 Chinese National standard. ...