Re: When X.ZIP downLoads a post or eMail, Windows-1252 is the default.



In <news:slrnh41iu3.8ag.catwheezel@xxxxxxxxxxxxxxxxxxxxxxxxxxxxx>,
Whiskers <catwheezel@xxxxxxxxxxxxx> wrote:

On 2009-06-22, »Q« <boxcars@xxxxxxx> wrote:
In <news:slrnh3uu7q.c4p.catwheezel@xxxxxxxxxxxxxxxxxxxxxxxxxxxxx>,
Whiskers <catwheezel@xxxxxxxxxxxxx> wrote:

Organization: is an alien concept

On 2009-06-22, _@xxxxxxxxxxxxxxxxxxxxxxxxx
<_@xxxxxxxxxxxxxxxxxxxxxxxxx> wrote:
When X.ZIP downLoads a post or eMail, Windows-1252 is the
default.

unKnown charSets, borked charSets ( e.g. UTF-7 ),
and subSet charSets ( e.g. ISO-8859-1 and US-ASCII ) are ignored;
Windows-1252, the default, is used instead.

It's very common for a newsReader to specify ISO-8859-1
when should've specified Windows-1252.

'Windows-1252' is a bodge peculiar to Microsoft products only, and
doesn't fit into the ISO standards.

HAving an ISO number isn't what matters when it comes to charsets
for MIMEd stuff. What matters is IANA, and Windows-1252 is fine.
<http://www.iana.org/assignments/charset-reg/windows-1252>

IANA acknowledge that Windows-1252 exists; they make no value
judgement.

The people working on the HTML5 spec seem to agree with Relf that
ISO-8859-1 must (their word in the latest draft) be treated as
Windows-1252. AIUI, their reasoning is the same as Relf's, that
web servers/pages often specify ISO-8859-1 when they should have
specified Windows-1252.
<http://www.w3.org/TR/html5/infrastructure.html#character-encodings-0>

I'm not saying it's *good* that html and other MIMEish stuff has
become infested with MS' bad ideas, but it has.

W3.org (who are talking about the World Wide Web, not about usenet or
email) do not say that ISO-8859-1 must be treated as Windows-1252.
What they say is

.-----
| When a user agent would otherwise use an encoding given in the
first | column of the following table to either convert content to
Unicode | characters or convert Unicode characters to bytes, it must
instead use the | encoding given in the cell in the second column of
the same row. When a | byte or sequence of bytes is treated
differently due to this encoding | aliasing, it is said to have been
misinterpreted for compatibility. '-----

In any case that matters at all, that's treating ISO-8859-1 as
Windows-1252.

That seems to be a bodge to acknowledge that there are web pages
which use Microsoft-only character encodings but have been
mis-identified as using ISO standard character encodings.

Except that those are *not* "Microsoft-only" character encodings,
that's correct. It takes care of the same problem that Relf's
treatment of ISO-8859-1 as Windows-1252 in posts he receives; that was
my point.

So they want web browsers that use UTF-8 for display purposes, to
deliberately assume that some pages are mis-identified and to use
whichever character set contains the greater number of 'printable'
characters when converting the character sets for display.

Not UTF-8, but Unicode.

That's what the (mercifully small) table is about. They acknowledge that
this is a horrible bodge:

.-----
| Note: The requirement to treat certain encodings as other encodings
| according to the table above is a willful violation of the W3C
Character | Model specification.
'-----

(Some web browsers allow the user to change the character encoding
used for display, so those users still have the option to see
correctly encoded pages as they should be).

It's a draft, and even if that part of it remains in the eventual HTML5
spec, it'll be a while before there are conformant browsers; AFAIK,
there aren't any yet for HTML4. ;)

They do not say that a newsgroup article posted in ISO-8859-1 should
be treated as though it had been encoded in Windows-1252

Of course, they are two different kinds of media. But they have a
common problem (Windows-1252 mislabeled as ISO-8859-1) and there's a
common kludge/solution (treat ISO-8859-1 as Windows-1252). I'm not
advocating doing that, but it's not surprising that it works ok for
Relf.

- and they certainly do not pretend that the Windows character set
supplants the ISO one.

There's no need for one to supplant the other; as far as MIME goes,
they all have the same standing.
.



Relevant Pages

  • Re: Encoding conversion problem
    ... characters outside the encoding of the JVM, ... character is representable in the JVM, ... Generally if the encoding you specify for I/O is different from the encoding in your data store, ...
    (comp.lang.java.databases)
  • Re: Encoding conversion problem
    ... The JVM itself only uses one encoding; ... In my posts I tried to specify the encoding of the DBMS and the JVM ... characters in a column that are not part of the specified character encoding ...
    (comp.lang.java.databases)
  • Re: Encoding conversion problem
    ... The JVM itself only uses one encoding; ... In my posts I tried to specify the encoding of the DBMS and the JVM ... characters in a column that are not part of the specified character encoding ...
    (comp.lang.java.databases)
  • Re: MSBUILD and ASCII Copyright
    ... The project file could not be loaded. ... character in the given encoding. ... Did you specify the encoding at the start of the file, ...
    (microsoft.public.dotnet.framework)
  • =?ISO-8859-1?Q?Re=3A_How_to_upload_a_=A3?=
    ... A reference to a character that will display as this glyph ... Correctly encoding some bytes so as to be recognised as this ... ASCII-like encodings are old and only cope with a character set of up ... straight for UTF-8. ...
    (alt.html)