Re: UTF-8 Character Encodings and "NO-BREAK SPACE" (dec: 202, hex: CA) Character
- From: mrdecav@xxxxxxxxx
- Date: Sun, 1 Feb 2009 15:05:48 -0800 (PST)
On Feb 1, 5:25 pm, Ben C <spams...@xxxxxxxxx> wrote:
On 2009-02-01, mrde...@xxxxxxxxx <mrde...@xxxxxxxxx> wrote:
On Feb 1, 4:48 pm, Ben C <spams...@xxxxxxxxx> wrote:
On 2009-02-01, mrde...@xxxxxxxxx <mrde...@xxxxxxxxx> wrote:
On Feb 1, 4:14 am, Ben C <spams...@xxxxxxxxx> wrote:
On 2009-02-01, Andre de Cavaignac <m...@xxxxxxxxx> wrote:
[...]
[...]The following hex is an example of the issue:
00000250 20 64 65 73 69 67 6e 2e 20 ca 49 0d 0a 68 61 76 | design.
? I..hav|
[...]202 is definitely the circumflexed E in ISO-8859-1, and the unicode
character 202 is also the circumflexed E. But it may be the NO-BREAK
SPACE in some other encoding. If so I don't know which one. But this is
one way to explain what is happening.
As it turns out, the problem is not with the encoding, but with the
headers that define the character set. Both headers (MIME and HTML)
define the character set as UTF-8, however the document is actually
encoded in Mac-Roman. In the Mac-Roman character set, 202 (0xCA) is
in fact the "NO-BREAK SPACE".
Ah, that explains it. The headers say it's UTF-8, but the bytes are not
valid UTF-8. So the text editor falls back on its default. You would
expect the default to be ISO-8859-1 for most tools (giving you an E with
a circumflex), but evidently it's Mac-Roman for some.
headers on your message:You're probably using a Mac. Actually I can tell you are from the
X-HTTP-UserAgent: Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_5_6;
en-us)
When opened in a normal text editor, which tries to determine the type
of encoding from the byte stream itself (rather than a header), it is
properly opened as Mac-Roman.
I would think it's practically impossible in most cases to guess that
something is Mac-Roman rather than one of the other 8-bit encodings.
Your editor is just falling back on its default.
Browsers are looking at the HTML header
(<meta content="text/html;charset=UTF-8" http-equiv="Content-Type">),
while normal text editors look at the raw file. I suppose mail
clients are determining the encoding from the raw file, before
rendering it as HTML, and that is why it renders properly there.
There is undoubtedly a bug in one or more mail clients, which mark
text bodies as UTF-8, rather than their real encoding, Mac-Roman.
Certainly. Mac-Roman is rather a strange encoding to be using anyway. If
I were fixing that bug I'd make the contents UTF-8 rather than change
the header to Mac-Roman.
Interestingly, Windows Mail and Outlook also render it
"correctly" (I'm guessing using Mac-Roman). There must be a bit more
to it than a default fallback...
They may just be displaying nothing at all. They try to decode UTF-8,
find an octet sequence they don't like, and just move on. Are you sure
they're really showing a no-break space?
Well, they should be showing an E with an accent circumflex if they
are truly following UTF-8, so they must be handling that 0xCA
somehow...
Oddly enough, both Notepad and some simple .NET code
(File.ReadAllText) will try to use UTF-8, so its not a platform-
specific behavior.
If you look at the hex I displayed earlier, which is the raw text,
taken using different methods, you see this:
20 ca 49
which corresponds to:
<space>?I
This is both clear from the hexdump output above, as well as just
manually looking it up in the UTF-8 character tables. 20 is a space,
49 is an "I" and CA is most certainly between them. If mail was
decoding as UTF-8, you would expect an accent circumflex.
They may just be ignoring it (they shouldn't if they are just decoding
as UTF-8), but they are definitely adding space where the character
belongs. A single "20" looks different than "20 CA" in the mail
readers.
.
- Follow-Ups:
- References:
- UTF-8 Character Encodings and "NO-BREAK SPACE" (dec: 202, hex: CA) Character
- From: mrdecav
- Re: UTF-8 Character Encodings and "NO-BREAK SPACE" (dec: 202, hex: CA) Character
- From: Jukka K. Korpela
- Re: UTF-8 Character Encodings and "NO-BREAK SPACE" (dec: 202, hex: CA) Character
- From: Andre de Cavaignac
- Re: UTF-8 Character Encodings and "NO-BREAK SPACE" (dec: 202, hex: CA) Character
- From: Ben C
- Re: UTF-8 Character Encodings and "NO-BREAK SPACE" (dec: 202, hex: CA) Character
- From: mrdecav
- Re: UTF-8 Character Encodings and "NO-BREAK SPACE" (dec: 202, hex: CA) Character
- From: Ben C
- Re: UTF-8 Character Encodings and "NO-BREAK SPACE" (dec: 202, hex: CA) Character
- From: mrdecav
- Re: UTF-8 Character Encodings and "NO-BREAK SPACE" (dec: 202, hex: CA) Character
- From: Ben C
- UTF-8 Character Encodings and "NO-BREAK SPACE" (dec: 202, hex: CA) Character
- Prev by Date: Re: UTF-8 Character Encodings and "NO-BREAK SPACE" (dec: 202, hex: CA) Character
- Next by Date: Re: testing web page compatibility with older browsers
- Previous by thread: Re: UTF-8 Character Encodings and "NO-BREAK SPACE" (dec: 202, hex: CA) Character
- Next by thread: Re: UTF-8 Character Encodings and "NO-BREAK SPACE" (dec: 202, hex: CA) Character
- Index(es):
Relevant Pages
|