Re: Corrupted email subject with 2 different character encoding



On Wed, 28 Dec 2005, Wing wrote:
I'm now working on a auto-reply email system and found the following
raw subject line is corrupted when display in outlook 2003.
Subject: =?UTF8?Q?Testing
=E5=A4=9A=E5=B9=B4=E5=89=8D=E7=95=99=E5=AD=B8=E8=A5=BF=E7=8F=AD=E7=89=99=E7=9A=84=E7=84=A6=E5=B0=8F=E5=A7=90?=
<=?ISO-2022-JP?B?GyRCNkMkLCQvJE4lOSVGITwlOCU7JUMlSCRyN0gkKCEiGyhC?=>

That subject line has two errors in a form of MIME syntax called "encoded words".


The UTF-8 encoded word is incorrect in many ways:
  1) The charset name is wrong.  It should be "UTF-8", not "UTF8".
  2) There is whitespace after "Testing".  This is prohibited inside
     encoded words.  Space characters must be either "_" or "=20".
     [RFC 2047, pp 4-5]
  3) Encoded words are limited to a maximum of 75 characters; this
     one is way over the limit.  [RFC 2047, page 4]

The ISO-2022-JP quoted word is also incorrect. There are angle brackets ("<" and ">") around it. This is not allowed. [RFC 2047, page 7]

Thunderbird can display it without any problem.

Thunderbird is in violation of RFC 2047. This is good news for virus-writers, since mishandling of such things are a good vector for viruses to attack. There are probably hackers who are now looking at this bug in Thunderbird to see if an attack is possible.


The subject line contains two differrent character encoding: UTF8 and
ISO-2022-JP. I found that the japanese chararcter can be displayed
properly while the UTF8 text is corrupted.

Any clues to fix the problem?

The fix is to generate proper syntax when sending message, and not depend upon mail readers to have bugs that will accept incorrect syntax.


I tried to make sense of the ISO-2022-JP text but couldn't. The first character is the kanji for the verb "to be surprised, frightened, amazed, taken aback, etc." but is followed by hiragana suggesting an adverb, then another hiragana suggesting a noun. What then follows is a sentence fragment "carries the stage set," (yes, ending with a comma). At that point, I gave up.

I recommend that you stick to a single character set, and not mix character sets in a message.

-- Mark --

http://panda.com/mrc
Democracy is two wolves and a sheep deciding what to eat for lunch.
Liberty is a well-armed sheep contesting the vote.
.



Relevant Pages

  • Re: Problem with text/html parts?
    ... some messages didn't seem to display what I expected, ... I'm going to guess that you put this entry in because you were getting ... From the mhshow man page: ... Because a content of type text might be in a non-ASCII character set, ...
    (comp.mail.mh)
  • Re: richard heathfield
    ... > printable subset of the character set onto a canvas, ... Do you know of some documentation that explains the above in depth? ... And then it was all over bar the shouting. ... > standard ISO C program to display the actual characters. ...
    (comp.programming)
  • Mails as attachments
    ... as Webserver.. ... This message uses a character set that is not ... If the text doesn't display correctly, ... that can display the original character set. ...
    (SunManagers)
  • LYNX, character sets, DECterm
    ... character set a page is served in from iso-8859-1 to iso-8859-15, ... EUR with newer browsers (presumably, some newer browsers would display ...
    (comp.os.vms)
  • Re: charset
    ... With unicode, ... not a character set ... A charset is a MIME concept that represents not only coded character set, but also a character encoding scheme and perhaps other concepts. ... Liberty is a well-armed sheep contesting the vote. ...
    (comp.mail.mime)