Re: UTF-8 Character Encodings and "NO-BREAK SPACE" (dec: 202, hex: CA) Character



On Feb 1, 5:25 pm, Ben C <spams...@xxxxxxxxx> wrote:
On 2009-02-01, mrde...@xxxxxxxxx <mrde...@xxxxxxxxx> wrote:



On Feb 1, 4:48 pm, Ben C <spams...@xxxxxxxxx> wrote:
On 2009-02-01, mrde...@xxxxxxxxx <mrde...@xxxxxxxxx> wrote:

On Feb 1, 4:14 am, Ben C <spams...@xxxxxxxxx> wrote:
On 2009-02-01, Andre de Cavaignac <m...@xxxxxxxxx> wrote:
[...]

The following hex is an example of the issue:
00000250  20 64 65 73 69 67 6e 2e  20 ca 49 0d 0a 68 61 76  | design.
? I..hav|
[...]
202 is definitely the circumflexed E in ISO-8859-1, and the unicode
character 202 is also the circumflexed E. But it may be the NO-BREAK
SPACE in some other encoding. If so I don't know which one. But this is
one way to explain what is happening.
[...]
As it turns out, the problem is not with the encoding, but with the
headers that define the character set.  Both headers (MIME and HTML)
define the character set as UTF-8, however the document is actually
encoded in Mac-Roman.  In the Mac-Roman character set, 202 (0xCA) is
in fact the "NO-BREAK SPACE".

Ah, that explains it. The headers say it's UTF-8, but the bytes are not
valid UTF-8. So the text editor falls back on its default. You would
expect the default to be ISO-8859-1 for most tools (giving you an E with
a circumflex), but evidently it's Mac-Roman for some.
You're probably using a Mac. Actually I can tell you are from the
headers on your message:

    X-HTTP-UserAgent: Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_5_6;
    en-us)

When opened in a normal text editor, which tries to determine the type
of encoding from the byte stream itself (rather than a header), it is
properly opened as Mac-Roman.

I would think it's practically impossible in most cases to guess that
something is Mac-Roman rather than one of the other 8-bit encodings.
Your editor is just falling back on its default.

Browsers are looking at the HTML header
(<meta content="text/html;charset=UTF-8" http-equiv="Content-Type">),
while normal text editors look at the raw file.  I suppose mail
clients are determining the encoding from the raw file, before
rendering it as HTML, and that is why it renders properly there.

There is undoubtedly a bug in one or more mail clients, which mark
text bodies as UTF-8, rather than their real encoding, Mac-Roman.

Certainly. Mac-Roman is rather a strange encoding to be using anyway. If
I were fixing that bug I'd make the contents UTF-8 rather than change
the header to Mac-Roman.

Interestingly, Windows Mail and Outlook also render it
"correctly" (I'm guessing using Mac-Roman).  There must be a bit more
to it than a default fallback...

They may just be displaying nothing at all. They try to decode UTF-8,
find an octet sequence they don't like, and just move on. Are you sure
they're really showing a no-break space?

Well, they should be showing an E with an accent circumflex if they
are truly following UTF-8, so they must be handling that 0xCA
somehow...

Oddly enough, both Notepad and some simple .NET code
(File.ReadAllText) will try to use UTF-8, so its not a platform-
specific behavior.

If you look at the hex I displayed earlier, which is the raw text,
taken using different methods, you see this:
20 ca 49
which corresponds to:
<space>?I

This is both clear from the hexdump output above, as well as just
manually looking it up in the UTF-8 character tables. 20 is a space,
49 is an "I" and CA is most certainly between them. If mail was
decoding as UTF-8, you would expect an accent circumflex.

They may just be ignoring it (they shouldn't if they are just decoding
as UTF-8), but they are definitely adding space where the character
belongs. A single "20" looks different than "20 CA" in the mail
readers.
.



Relevant Pages

  • Re: Understanding simplest HTML page
    ... Even the BBC managed to put invalid ... > technical details of using a particular encoding, ... Bengali and so on using utf-8 ... Mozilla has routines for automatically guessing at character ...
    (comp.infosystems.www.authoring.html)
  • Re: Writing to the newsgroup?
    ... you should be able to set the encoding and use the encoding you ... I'm not familiear with Unitype Global writer, ... However, if you use its help feature to inquire about 'character encoding', ... Here's the UTF-8 test. ...
    (sci.lang.japan)
  • Re: Attention: European C/C++/C#/Java Programmers-Call for Input
    ... No other encodings - no Latin-1, no UTF-16, no home-made character sets, no extra fonts. ... Look at existing tools and source code that supports UTF-8, and see how it can make your work easier and give a result that users might actually be able to *use*. ... A couple of days work here is a drop in the ocean compared to the man-years it will take to work with your home-made encoding, and you will at least have the benefit of a better understanding of your problem. ... I would have little more than UNICODE left. ...
    (comp.arch.embedded)
  • Re: Mixing text and binary I/O
    ... In case of complex encodings like UTF-8, ... Backed by a buffer ... used encoding), raise proper exception because it's an encoding error in ... that means "decode exactly one character". ...
    (comp.lang.java.programmer)
  • =?ISO-8859-1?Q?Re=3A_How_to_upload_a_=A3?=
    ... A reference to a character that will display as this glyph ... Correctly encoding some bytes so as to be recognised as this ... ASCII-like encodings are old and only cope with a character set of up ... straight for UTF-8. ...
    (alt.html)