Re: Corrupted email subject with 2 different character encoding
- From: Mark Crispin <mrc@xxxxxxxxxxxxxxxxxx>
- Date: Wed, 28 Dec 2005 10:21:43 -0800
On Wed, 28 Dec 2005, Wing wrote:
I'm now working on a auto-reply email system and found the following raw subject line is corrupted when display in outlook 2003. Subject: =?UTF8?Q?Testing =E5=A4=9A=E5=B9=B4=E5=89=8D=E7=95=99=E5=AD=B8=E8=A5=BF=E7=8F=AD=E7=89=99=E7=9A=84=E7=84=A6=E5=B0=8F=E5=A7=90?= <=?ISO-2022-JP?B?GyRCNkMkLCQvJE4lOSVGITwlOCU7JUMlSCRyN0gkKCEiGyhC?=>
That subject line has two errors in a form of MIME syntax called "encoded words".
The UTF-8 encoded word is incorrect in many ways:
1) The charset name is wrong. It should be "UTF-8", not "UTF8".
2) There is whitespace after "Testing". This is prohibited inside
encoded words. Space characters must be either "_" or "=20".
[RFC 2047, pp 4-5]
3) Encoded words are limited to a maximum of 75 characters; this
one is way over the limit. [RFC 2047, page 4]The ISO-2022-JP quoted word is also incorrect. There are angle brackets ("<" and ">") around it. This is not allowed. [RFC 2047, page 7]
Thunderbird can display it without any problem.
Thunderbird is in violation of RFC 2047. This is good news for virus-writers, since mishandling of such things are a good vector for viruses to attack. There are probably hackers who are now looking at this bug in Thunderbird to see if an attack is possible.
The subject line contains two differrent character encoding: UTF8 and ISO-2022-JP. I found that the japanese chararcter can be displayed properly while the UTF8 text is corrupted.
Any clues to fix the problem?
The fix is to generate proper syntax when sending message, and not depend upon mail readers to have bugs that will accept incorrect syntax.
I tried to make sense of the ISO-2022-JP text but couldn't. The first character is the kanji for the verb "to be surprised, frightened, amazed, taken aback, etc." but is followed by hiragana suggesting an adverb, then another hiragana suggesting a noun. What then follows is a sentence fragment "carries the stage set," (yes, ending with a comma). At that point, I gave up.
I recommend that you stick to a single character set, and not mix character sets in a message.
-- Mark --
http://panda.com/mrc Democracy is two wolves and a sheep deciding what to eat for lunch. Liberty is a well-armed sheep contesting the vote. .
- Follow-Ups:
- References:
- Prev by Date: Re: Corrupted email subject with 2 different character encoding
- Next by Date: Re: Corrupted email subject with 2 different character encoding
- Previous by thread: Re: Corrupted email subject with 2 different character encoding
- Next by thread: Re: Corrupted email subject with 2 different character encoding
- Index(es):
Relevant Pages
|