Re: Posting with XHR and ISO-8859-15



On 2008-12-21 04:12, Thomas 'PointedEars' Lahn wrote:
Conrad Lender wrote:
[...] I know that I could escape() the content and use
application/x-www-form-urlencoded; that would avoid the encoding
issue altogether, because the byte values would then all be <128.

You are mistaken. First of all, UTF-8 code units can be byte values
greater than 127 (0x7F).

All characters in the application/x-www-form-urlencoded format are
strictly 7-bit. If I'm not mistaken, the first 128 characters in ASCII,
Latin-9, and Unicode are the same, so there wouldn't be any troubles
with the encoding.

Second, you would need to use
encodeURIComponent() for that; escape() is not Unicode-safe. In
fact, using escape() would generate URL-encoded strings that do not
conform to the URI Specification (currently, RFC 3986), for example
escape("¿") == "%BF" here. While that was not a problem before
Unicode and the various Unicode Transformation Formats, it certainly
is now: encodeURIComponent("ÿ") == "%C3%BF", but unescape("%C3%BF")
== "ÿ".

I realize that escape() can't be used for all of the Unicode set, and
that escape/unescape and encodeURIComponent/decodeURIComponent have to
be used symetrically. If the server can't handle UTF-8 sequences in
URIs, I can't use encodeURIComponent.

Actually, using the URL-encoded format was never an option. I can't use
encodeURIComponent() for the reason stated above, and escape() will fail
with Latin-9 characters that aren't in Latin-1 (like the € sign). I was
looking for a way to send raw 8-bit ISO-8859-15 data, and if that wasn't
possible, I'd use Unicode and add a patch to the server-side code.

(Side note: obviously the whole application should be upgraded to
Unicode, but we're talking about 200k lines of code and 120 database
tables - if the client won't pay for it, it's not going to happen.)

The question was whether it's possible to send 8-bit data in a
different encoding than UTF-8.
....
With multipart/form-data, on the other hand, everything is possible
(see RFC 2388). The issue is, though, that current ECMAScript
implementations are Unicode-safe and thus use UTF-16 internally for
string values (even for "\xe9"). And UTF-16 is a superset of
ISO-8859-xx in the sense that not every character that can be encoded
in UTF-16 can also be encoded in ISO-8859-xx. But, AFAIK, every
character that can be encoded in UTF-16 can also be encoded in UTF-8.
Which would explain why UTF-8 is used by default.

It's the same character set, and both encodings can handle all of it
(although somewhat inefficiently outside the Basic Multilingual Plane).

I just saw your other reply; I know about the internal representation of
characters in ECMAScript, and I whole-heartedly agree that the
application needs to be upgraded. It's not my choice, though, unless I
want to do it in my spare time.

Since it appears to be impossible to send 8-bit non-UTF-8 request bodies
with the XHR object, and URL-encoding in that Latin-X sets would be
ambiguous, I'll go ahead and modify the server-side code for this
service. I'm not happy about it, because it means having different
character sets in at least one place, but it seems I have no choice.

Thanks for your reply.


- Conrad
.



Relevant Pages

  • Re: Unicode Support
    ... > Not knowing much about UTF-8 (my Unicode knowledge extends as far as ... > literal strings of this form as long as the character code for quote ... > can never appear in a MBCS (multibyte character sequence). ... then XP Notepad directly understands UNICODE and you can ...
    (alt.lang.asm)
  • Re: Attention: European C/C++/C#/Java Programmers-Call for Input
    ... Simply make a straight decision now - you will use UTF-8. ... character format) much like UTF-8 which itself ... I would have little more than UNICODE left. ... generator is assembly language. ...
    (comp.arch.embedded)
  • Re: Attention: European C/C++/C#/Java Programmers-Call for Input
    ... No other encodings - no Latin-1, no UTF-16, no home-made character sets, no extra fonts. ... Look at existing tools and source code that supports UTF-8, and see how it can make your work easier and give a result that users might actually be able to *use*. ... A couple of days work here is a drop in the ocean compared to the man-years it will take to work with your home-made encoding, and you will at least have the benefit of a better understanding of your problem. ... I would have little more than UNICODE left. ...
    (comp.arch.embedded)
  • Re: Unicode Delphi Win32 - which approach
    ... UTF-8 encoding is different from ANSI,... ... The first 256 Unicode characters map to the ANSI character set. ... Delphi supports UCS-2 on both platforms. ...
    (borland.public.delphi.non-technical)
  • Re: Posting with XHR and ISO-8859-15
    ... I know that I could escape() the content and use ... you would need to use encodeURIComponent() for ... problem before Unicode and the various Unicode Transformation Formats, ... Universal Character Set were created. ...
    (comp.lang.javascript)

Quantcast