Re: persian languages charset, and what DOCTYPE?



On Sat, 8 Apr 2006, Harlan Messinger wrote:

else, and the appearance in two places of "تست2", once after the
date at the top, and once as the first item in list of Recent Posts.
The first one appears in the page source as "تست2"

Yes, I'd spotted that, and noted that if interpreted as utf-8 it turns
out as Arabic-script characters, which made it seem as if that part
had been inserted into it incorrectly.

and the second appears as
"تست2", the character entity
representation of the same thing.

Blimey, so it does! I hadn't spotted that at first look. So it's
worse than just broken!!

Furthermore, I now see loads of hrefs like these:

http://journalhome.com/razavi/21877/%26Oslash%3B%26ordf%3B%26Oslash%3B%26sup3%3B%26Oslash%3B%26ordf%3B2.html

*Shudder*

For what it's worth - coming back to the تست2 which we saw, if I
convert[1] that from utf-8 to us-ascii encoding then the result reads:

تست2

which can be decoded e.g with my trusty decoding ring (;-) at
http://ppewww.ph.gla.ac.uk/~flavell/unicode/unidata06.html


At this kind of third-hand remove from the original complainant, and
with me only understanding the theory of the character representation,
without being able to read Farsi - nor have the slightest inclination
to tangle with the mess that comes out of MS's attempts to extrude
something resembling HTML, I'm afraid I can't go much further than to
say that these pages seem to be dreadfully broken; it's a wonder that
anything comes out as intended.

good luck (you-all will need it!)

[1] by "convert" I mean, in Seamonkey (nee Mozilla), manually set
View> Encoding to utf-8, then File> Edit Page, then in Composer,
"Save and change character encoding". Unfortunately it doesn't
offer us-ascii as an option, but any 8-bit encoding which doesn't
cover Arabic would suffice for this purpose - e.g Armenian, Thai,
whatever you like. (Perhaps we should ask the Mozilla folks to
support saving in us-ascii explicitly?).
.



Relevant Pages

  • Re: XML::LibXML UTF-8 toString() -vs- nodeValue()
    ... rest of the code in an encoding interpreted with another code? ... Almost all encodings today are supersets of US-ASCII. ... Wide character in print at -e line 1. ... because now the use encoding comes too late: The compiler would have to ...
    (comp.lang.perl.misc)
  • Re: wide character (unicode) and multi-byte character
    ... character representation in computer is another specific encoding ... character and multi-byte character are just general terms used on Windows to ...
    (microsoft.public.dotnet.languages.vc)
  • =?ISO-8859-1?Q?Re=3A_How_to_upload_a_=A3?=
    ... A reference to a character that will display as this glyph ... Correctly encoding some bytes so as to be recognised as this ... ASCII-like encodings are old and only cope with a character set of up ... straight for UTF-8. ...
    (alt.html)
  • Re: C# and encodings
    ... But if windows has numerous code pages, ... encoding, and thus have only 255 code points matched to characters? ... Unicode can't be represented in only 8-bits, ... But Notepad supports Unicode and yet it only recognizes 255 character, ...
    (microsoft.public.dotnet.languages.csharp)
  • Re: Writing to the newsgroup?
    ... you should be able to set the encoding and use the encoding you ... I'm not familiear with Unitype Global writer, ... However, if you use its help feature to inquire about 'character encoding', ... Here's the UTF-8 test. ...
    (sci.lang.japan)

Loading