Re: HTML entities from input fields



On Wed, 11 Jan 2006, Jukka K. Korpela wrote:

> chernyshevsky@xxxxxxxxxxx wrote:
>
> > IE to encode characters outside of the current code-page
> > as HTML entities?
>
> You cannot force such grossly incorrect behavior.

Well, in as much as behaviour in this situation isn't defined, it's
hard to say that it's "incorrect", but it's certainly
counter-intuitive, and I'd rate it as distinctly sub-optimal, because
the results are ambiguous.

But over and above that, I would criticise the specification writers
for failing to grasp the fact that you can't prevent users from
submitting whatever characters they care to paste into the submission
fields: they should have made some kind of unambiguous provision for
what ought to happen in that case. I've seen at least two unambiguous
ways that it *could* be done (or "could have been done" if the
spec-writers had got there first) - but it seems too late to remedy
that now.

> You just need to be prepared to getting form data encoded that way,
> from IE and perhaps other browsers as well.

That's the reality of it, indeed.

> If you wish to be prepared to getting arbitrary character data (as a
> form designer should be, right?), make the page containing the form
> UTF-8 encoded. Browsers will then send the data in UTF-8 format
> (though of course, some old browsers may fail to do this - but there
> is little hope with them anyway).

Based on the observation that search services like Google have been
doing this for a couple of years already, it seems that they, at
least, rate this as practically feasible nowadays. Though they might
still have a fallback if they detect that NN4.* is calling (NN4.*
versions are quite capable of *rendering* utf-8, generally speaking:
and they indicate that capability in their Accept-charset header, but
when it comes to submitting utf-8 data, they get it horribly wrong).

thanks for the cite. I think it says everything else I could want to
add on the topic...

cheers
.



Relevant Pages

  • Re: Send string to IP address
    ... points are between U+0000 and U+007F, UTF-8 and ASCII encodings are ... UTF-8 can encode the entire Unicode ... If you've ever used MS DOS or QBASIC, you've seen the one IBM picked, containing lots of line-drawing characters and a handful of useful accents. ...
    (comp.lang.java.programmer)
  • Re: Calculate the length of text in bytes
    ... UTF-8 will encode characters from ASCII ...
    (comp.lang.javascript)
  • Re: converting UTF-8 to unicode hex with perl
    ... Because these characters is located between ... (from UTF-8 to UCS-2) ... use Encode; ... Encoded string: ...
    (comp.lang.perl.misc)
  • Re: Unicode and html - help for simple web site
    ... >> Browsers often have some kind of auto-recognition algorithm for ... was reported as utf-8: but surely these strings of cp1251 bytes could ... the Unicode requirement puts it (using the oblique "?" ... ASCII source code characters). ...
    (comp.infosystems.www.authoring.html)
  • Re: Question on Unicode characters reading and printing
    ... Are you sure it's an UTF-8 variation? ... the locate to an UTF-8 locale or use a UTF-8 conversion routine? ... > Now how can print characters in w_example. ... you must encode back to 16-bit. ...
    (comp.unix.programmer)