Re[3]: A message with octets up to 127 _forced_ by MUA into '7bit, us-ascii': an interesting and disturbing example of how some people (mis)understand RFC2045



On Fri, 10 Mar 2006, Maksym Kozub wrote:
MC> It causes the mail reader program to issue an annoying (and
MC> completely false) warning message about characters that it can't
MC> display properly.
MC> It causes the mail reader program to waste time in a character set
MC> conversion that turns out to be unnecessary.
Both points force me to ask a simple question. There are some MUAs
which cannot handle certain charsets (e.g. UTF-8 or JIS) properly.
Does it mean that other MUAs worldwide should prohibit their owners
from sending in those charsets?

Non-sequitor.

The issue at hand is whether a completely unnecessary character set declaration should be used in spite of the very real possibility that doing so will cause hardship upon the recipient.

Common sense dictates that you should not send JIS to a person who you know can not read Japanese. My environment is set to JIS; but the messages that I send that contain no Japanese text appear in US-ASCII (which they are) and not Japanese.

The problem is that you seem to have the mistaken belief that the character set declaration has a semantic that it does not have: a declaration to the recipient that responses should be in that character set.

That Orwell-like situation seems to be
the extreme result of your logic.

Have you read Orwell, in particular "1984" and "Animal Farm"? Animal Farm is an allegory of the history of the late unlamented USSR, although it can also be applied to the history of any Communist revolution. 1984 is a grim view of (what was in 1948) the future, based upon British Socialist pessimism of the directions taken by the British Left (with the historical example of the USSR and the Spanish Republic) combined with completely incorrect assumptions of the USA. Above all, Orwell, a man with impeccible Socialist credentials, wrote about the dangers and horrors inflicted upon humanity by the Left.

Neither book has anything to do with the topics of character sets, e-mail, or mechanisms to achieve interoperability. It may be fashionable on college campuses to use Orwell's name to attack something one disagrees with (especially by kids who have never read his books), but that is not a valid technical argument.

I never do spam filtering based on charset indicated in the headers. I
understand that some people (including you) do that; I just don't take
it as a reason for other people's MUAs to behave in a certain way.

It isn't a reason; it is a consequence.

Back when character set handling was designed, spam was essentially unknown.

Well, as I already said, there is no such thing as KOI8-RU

Actually, there is. It's the old name for KOI8-U. I was one of the first external developers contacted by the Ukrainian KOI8 team to support it, and at that time it had the KOI8-RU name. Later, it was renamed to KOI8-U and KOI8-RU was deprecated.

To
prove that it _is_ an error on your paart, here is a quote from
RFC 2319 on KOI8-U:
The lower part of the KOI8-U Ukrainian Character Set is a complete
copy of ASCII, just as it's used in KOI8-R and other non-ASCII
codepages.

This is unfortunate wording in RFC 2319, but it doesn't prove your argument. To understand how this actually work, take a look at ISO 2022. In 8 bit charsets, ASCII is in G0, and the extensions are in G1.

Note that UTF-8 is not a character set; it is an *encoding* of the Unicode character set which has the property of "0x00 - 0x7f is ASCII". That property is NOT in the Unicode character set, although U+0000 - U+007f in Unicode correspond to ASCII.

Hence the distinction in email between "character set" and "charset".

MC> But your MUA, knowing that it can not display JIS, will issue an
MC> error message warning that it can not display Japanese text.
First, my MUA will not do that, since it can display Japanese
charsets.
Second, even it weren't the case, should it be the reason for _your_
MUA to prevent you from sending messages in certain charsets?

The reason is courtesy.

You don't know what the recipient's capabilities are. The greater demands that you place upon the recipient's capabilities, the less likely it is that he will be able to read your message. If these demands are unnecessary, then you may cause your recipient trouble for no good reason.

The only reason for tagging ASCII text with a charset tag other than US-ASCII is if the text uses ESC to indicate ISO 2022 shifts of G0. Japanese is the only language that still routinely uses that practice (the ISO-2022-JP charset). There are historical reasons why Japan uses it (a much-beloved computer system of the 1970s and 1980s that natively used 7-bit bytes for text); but Japan is also transitioning to Unicode.

I think I might add to misunderstanding by saying something about
'default charset'. I must emphasize that it's not a charset which is
used by my MUA when it doesn't know which one to use. Instead, as I
said, it's a situation where my MUA has been _expressly instructed by
me_ to create all messages by default in a charset other than
US-ASCII.

You may believe that your "MUA has been expressly instructed by [you] to create all messages by default in a charset other than US-ASCII"; but that is not the way that MUAs work.

Instead, your MUA has been instructed that, when your outgoing messages use non-ASCII characters, they are to be sent by default in such-and-such charset (which is not US-ASCII).

More sophisticated MUAs are instructed that, when your outgoing messages use non-ASCII characters, they are to be sent by default in such-and-such charset (which is not US-ASCII) if it is possible to encode the text in that charset; otherwise they are to send it in UTF-8.

Now, it is true that a few MUAs attempt (with highly varying degrees of success!) to use the charset of the message being replied-to as an overriding default and/or preserve the charset of MIME encoded-words. I alluded to this last paragraph earlier, when I referred to your belief that the character set declaration has a semantic of "declaration to the recipient that responses should be in that character set" that it does not have.

I'll construct a stronger point. Forger about any sort of defaults,
whethert 'default default' or 'user-instructed defaults'. Suppose
there is a particular counterpart whose MUA is able to treat e.g.
KOI8-U properly, and I know that. He likes (for whatever reason :))
receiving messages from me in KOI8-U, and I know that. He doesn't do
any charset-based spam filtering, and I know that. I type a message
for him in my MUA, and this MUA has a menu option: 'Character Set'.
The message is in English. All English letters are part of KOI8-U (see
my reference to RFC 2319 above), so it's completely legitimate for me
to select 'KOI8-R' there. To make it clear: even though it may be not
necessary/wise/whatever, it's not illegitimate either. I save a
message to my Outbox folder, I look at its headers, and - oops - it
says 'us-ascii'. Do you still say it's good behaviour?

Of course it's good behavior. The message is in US-ASCII, not KOI8-U.

There is absolutely no reason to declare KOI8-U unless there are KOI8-U characters in the text which require that declaration.

If you, in the course of editing it, insert any Ukrainian characters, then it promotes to KOI8-U. If you delete those characters, it demotes back to US-ASCII.

-- Mark --

http://panda.com/mrc
Democracy is two wolves and a sheep deciding what to eat for lunch.
Liberty is a well-armed sheep contesting the vote.
.



Relevant Pages

  • Re: Changing the default charset for composing messages
    ... > correct default for the localized version of Entourage you're using. ... > UTF-8 if your message contains characters from more than one character set. ... > will just choose the correct charset on the basis of the characters you've ...
    (microsoft.public.mac.office.entourage)
  • Re: [OT] Funny Sig
    ... > on English language newsgroups are doing. ... The us-ascii charset is the most used charset in english groups. ... As soon as I insert a special character, ...
    (news.software.readers)
  • Re: UTF-8 without external modules on Perl 5.0
    ... Hum, effectively, I didn't realize all the aspect about this charset ... character in iso-8859-* table only. ... And for this I found a pure Perl module called Unicode::UTF8simple ... they will input using these two languages (and will read in these two ...
    (comp.lang.perl.misc)
  • Re: utf8 output from database
    ... > set up to display that particular character. ... And I'm not sure UTF-8 ... The charset parameter doesn't 'do' anything. ... character repertoire applies. ...
    (comp.lang.php)
  • Re: A note on computing thugs and coding bums
    ... code is valid for any character set that is legal in C (which is a ... characters in the required source character set ... A String, in C Sharp or Java, can be redefined. ... allow programmers to handle some other data format, ...
    (comp.programming)