Re: Posting with XHR and ISO-8859-15
- From: Conrad Lender <crlender@xxxxxxxxx>
- Date: Sun, 21 Dec 2008 05:05:33 +0100
On 2008-12-21 04:12, Thomas 'PointedEars' Lahn wrote:
Conrad Lender wrote:
[...] I know that I could escape() the content and use
application/x-www-form-urlencoded; that would avoid the encoding
issue altogether, because the byte values would then all be <128.
You are mistaken. First of all, UTF-8 code units can be byte values
greater than 127 (0x7F).
All characters in the application/x-www-form-urlencoded format are
strictly 7-bit. If I'm not mistaken, the first 128 characters in ASCII,
Latin-9, and Unicode are the same, so there wouldn't be any troubles
with the encoding.
Second, you would need to use
encodeURIComponent() for that; escape() is not Unicode-safe. In
fact, using escape() would generate URL-encoded strings that do not
conform to the URI Specification (currently, RFC 3986), for example
escape("¿") == "%BF" here. While that was not a problem before
Unicode and the various Unicode Transformation Formats, it certainly
is now: encodeURIComponent("ÿ") == "%C3%BF", but unescape("%C3%BF")
== "ÿ".
I realize that escape() can't be used for all of the Unicode set, and
that escape/unescape and encodeURIComponent/decodeURIComponent have to
be used symetrically. If the server can't handle UTF-8 sequences in
URIs, I can't use encodeURIComponent.
Actually, using the URL-encoded format was never an option. I can't use
encodeURIComponent() for the reason stated above, and escape() will fail
with Latin-9 characters that aren't in Latin-1 (like the € sign). I was
looking for a way to send raw 8-bit ISO-8859-15 data, and if that wasn't
possible, I'd use Unicode and add a patch to the server-side code.
(Side note: obviously the whole application should be upgraded to
Unicode, but we're talking about 200k lines of code and 120 database
tables - if the client won't pay for it, it's not going to happen.)
....The question was whether it's possible to send 8-bit data in a
different encoding than UTF-8.
With multipart/form-data, on the other hand, everything is possible
(see RFC 2388). The issue is, though, that current ECMAScript
implementations are Unicode-safe and thus use UTF-16 internally for
string values (even for "\xe9"). And UTF-16 is a superset of
ISO-8859-xx in the sense that not every character that can be encoded
in UTF-16 can also be encoded in ISO-8859-xx. But, AFAIK, every
character that can be encoded in UTF-16 can also be encoded in UTF-8.
Which would explain why UTF-8 is used by default.
It's the same character set, and both encodings can handle all of it
(although somewhat inefficiently outside the Basic Multilingual Plane).
I just saw your other reply; I know about the internal representation of
characters in ECMAScript, and I whole-heartedly agree that the
application needs to be upgraded. It's not my choice, though, unless I
want to do it in my spare time.
Since it appears to be impossible to send 8-bit non-UTF-8 request bodies
with the XHR object, and URL-encoding in that Latin-X sets would be
ambiguous, I'll go ahead and modify the server-side code for this
service. I'm not happy about it, because it means having different
character sets in at least one place, but it seems I have no choice.
Thanks for your reply.
- Conrad
.
- Follow-Ups:
- Re: Posting with XHR and ISO-8859-15
- From: Thomas 'PointedEars' Lahn
- Re: Posting with XHR and ISO-8859-15
- References:
- Posting with XHR and ISO-8859-15
- From: Conrad Lender
- Re: Posting with XHR and ISO-8859-15
- From: Thomas 'PointedEars' Lahn
- Re: Posting with XHR and ISO-8859-15
- From: Conrad Lender
- Re: Posting with XHR and ISO-8859-15
- From: Thomas 'PointedEars' Lahn
- Posting with XHR and ISO-8859-15
- Prev by Date: Re: WebOS Project seeking for developers
- Next by Date: Re: WebOS Project seeking for developers
- Previous by thread: Re: Posting with XHR and ISO-8859-15
- Next by thread: Re: Posting with XHR and ISO-8859-15
- Index(es):
Relevant Pages
|