Re: Posting with XHR and ISO-8859-15
- From: Thomas 'PointedEars' Lahn <PointedEars@xxxxxx>
- Date: Sun, 21 Dec 2008 11:12:53 +0100
Conrad Lender wrote:
On 2008-12-21 04:12, Thomas 'PointedEars' Lahn wrote:
Conrad Lender wrote:
[...] I know that I could escape() the content and useYou are mistaken. First of all, UTF-8 code units can be byte values
application/x-www-form-urlencoded; that would avoid the encoding
issue altogether, because the byte values would then all be <128.
greater than 127 (0x7F).
All characters in the application/x-www-form-urlencoded format are
strictly 7-bit.
Incorrect. Can't be true because the HTML document character set is the
Universal Character Set, regardless of the encoding used. See HTML 4.01,
section 5.1, and RFC 3986, section 2.5, or simply test it.
If I'm not mistaken, the first 128 characters in ASCII, Latin-9, and
Unicode are the same,
Correct.
so there wouldn't be any troubles with the encoding.
Non sequitur.
Second, you would need to use encodeURIComponent() for that; escape()
is not Unicode-safe. In fact, using escape() would generate
URL-encoded strings that do not conform to the URI Specification
(currently, RFC 3986), for example escape("¿") == "%BF" here. While
that was not a problem before Unicode and the various Unicode
Transformation Formats, it certainly is now: encodeURIComponent("ÿ") ==
"%C3%BF", but unescape("%C3%BF") == "ÿ".
I realize that escape() can't be used for all of the Unicode set, and
that escape/unescape and encodeURIComponent/decodeURIComponent have to be
used symetrically.
The point of my example is that it is impossible for a decoder to recognize
whether it received a percent-encoding based on an 8-bit encoding or a
percent-encoding based on UTF-8 (no, UTF-8 is not an 8-bit encoding, only
its the code units are 8 bit wide). Therefore, UTF-8-percent-encoding is
specified in RFC 3986 for code points above 0x7F.
If the server can't handle UTF-8 sequences in URIs, I can't use
encodeURIComponent.
True.
Actually, using the URL-encoded format was never an option. I can't use
encodeURIComponent() for the reason stated above, and escape() will fail
with Latin-9 characters that aren't in Latin-1 (like the € sign).
There is no way to tell which encoding would be used with escape().
I was looking for a way to send raw 8-bit ISO-8859-15 data, and if that
wasn't possible, I'd use Unicode and add a patch to the server-side code.
As I have said (see below), it is generally possible with
multipart/form-data (else you could not upload text files encoded so,
unchanged), however you also have to consider how the ECMAScript
implementation handles strings.
(Side note: obviously the whole application should be upgraded to
Unicode, but we're talking about 200k lines of code and 120 database
tables - if the client won't pay for it, it's not going to happen.)
Until database conversion was done, conversion from UTF-8 input to
ISO-8859-15 can happen server-side (PHP: unicode_encode()). Because as you
won't have characters above code point 0xFF in the database, you won't have
false positives in queries.
Database tables can be recoded with little effort; BTDT for MySQL version
4.0 (does not store the encoding; data was partially ISO-8859-1 and
Windows-1252) to 5.x (using recode(1) and iconv(1) on the database dump to
convert it to UTF-8, then importing that in the new DB).
The server-side scripts do not need to be recoded.
....The question was whether it's possible to send 8-bit data in a
different encoding than UTF-8.
With multipart/form-data, on the other hand, everything is possible
(see RFC 2388). The issue is, though, that current ECMAScript
implementations are Unicode-safe and thus use UTF-16 internally for
string values (even for "\xe9"). And UTF-16 is a superset of
ISO-8859-xx in the sense that not every character that can be encoded
in UTF-16 can also be encoded in ISO-8859-xx. But, AFAIK, every
character that can be encoded in UTF-16 can also be encoded in UTF-8.
Which would explain why UTF-8 is used by default.
It's the same character set, and both encodings can handle all of it
(although somewhat inefficiently outside the Basic Multilingual Plane).
Define "same character set" and "both encodings".
[...] Since it appears to be impossible to send 8-bit non-UTF-8 request
bodies with the XHR object, and URL-encoding in that Latin-X sets would
be ambiguous, I'll go ahead and modify the server-side code for this
service. I'm not happy about it, because it means having different
character sets in at least one place,
No, it does not. There is a difference between the document character set
(which is always UCS) and the encoding used to represent characters from it.
but it seems I have no choice.
Exactly.
PointedEars
.
- Follow-Ups:
- Re: Posting with XHR and ISO-8859-15
- From: Conrad Lender
- Re: Posting with XHR and ISO-8859-15
- References:
- Posting with XHR and ISO-8859-15
- From: Conrad Lender
- Re: Posting with XHR and ISO-8859-15
- From: Thomas 'PointedEars' Lahn
- Re: Posting with XHR and ISO-8859-15
- From: Conrad Lender
- Re: Posting with XHR and ISO-8859-15
- From: Thomas 'PointedEars' Lahn
- Re: Posting with XHR and ISO-8859-15
- From: Conrad Lender
- Posting with XHR and ISO-8859-15
- Prev by Date: Bottom DIV handles events from top DIV layers only in IE6,7
- Next by Date: Re: Posting with XHR and ISO-8859-15
- Previous by thread: Re: Posting with XHR and ISO-8859-15
- Next by thread: Re: Posting with XHR and ISO-8859-15
- Index(es):
Relevant Pages
|