Re: Understanding simplest HTML page



On Sun, 20 Nov 2005, Lachlan Hunt wrote:

> Alan J. Flavell wrote:
> > On Sat, 19 Nov 2005, Eric Lindsay wrote:
> > > So if that is at all likely, serving using UTF-8 seems a better
> > > choice.
> >
> > It's an option which has much to commend it, but the author must
> > be capable of handling it. Even the BBC managed to put invalid
> > character data into their utf-8-encoded pages sometimes.
>
> Ideally, the content author should not need to be aware of all the
> technical details of using a particular encoding, such issues should
> be handled by the CMS/authoring tool programmers and the system
> administrators.

Ideally, you're right. However, at the discussion level of this
group, I suggest it's still useful to understand the underlying
mechanics, seeing that there are plenty of ways things can go wrong,
as my example from the BBC showed (they had seemingly concatenated
some iso-8859-1 data into their pages for readers from the
subcontinent - which represent Urdu, Bengali and so on using utf-8
encoding).

> The content authors should be able to enter whatever characters they
> like and the CMS/authoring tool should ensure that they are encoded
> correctly.

It's a fine idea, certainly.

> It's extremely easy to validate UTF-8 or ISO-8859-1 input

If you mean "on knowing that some non-trivial input must be either
utf-8 or iso-8859-1, it's easy to decide which", then I agree with
you.

utf-8 can be formally validated, although there's some small
probability of faking it. iso-8859-1 is just a sequence of octets:
it can only be verified by some kind of sanity check on its content.

> (other encodings may be harder, but still possible)

Mozilla has routines for automatically guessing at character
encodings, which are useful when no other information is available:
but they can go sadly wrong.

> and, IMHO, there's no excuse (beyond ignorance) for any CMS to not
> validate the encoding of the input, and thus no excuse for invalid
> character data to appear.

I'll give you an example. The web is awash with windows-1252 data
purporting to be iso-8859-1. Its authors claim that "it works", and I
can only admit "yes, it gives a fair impression of working" since most
browser developers have felt it necessary to accommodate this misuse.
By rights it should be declared invalid, but that's not going to
happen in this age of the world.

Let's hope for better in the future.

utf-8 is different. There's even a security mandate against
attempting to process invalid coding.

> > beware of are the characters of the Windows-specific repertoire
> > 128-159 decimal. For example the euro character is *not* €
> > (but I'd recommend coding that as € anyway).
>
> For HTML, € is acceptable, it will be handled just fine in all
> modern browsers.

And will be recognisable even in HTML browsers which for some reason
don't interpret it. Which was the point of recommending it.

> However, for XHTML, the numeric or hex character reference is better
> because entity references require a validating parser.

Fair comment.

> Besides, I'd recommend just using UTF-8 and entering .

For web authoring[1], it's a perfectly fine option, for those who are
comfortable with doing it. As I keep repeating. My point here was to
warn-off anyone who *was* using &-notation, from trying to reference
code points from Windows-1252 (as happens in so much HTML-extruding
software from a certain dominant vendor, until quite recently).

regards

[1] For posting to Usenet, on the other hand, it's not without its
gotchas, as we see here. SCNR.
.



Relevant Pages

  • Re: Writing to the newsgroup?
    ... you should be able to set the encoding and use the encoding you ... I'm not familiear with Unitype Global writer, ... However, if you use its help feature to inquire about 'character encoding', ... Here's the UTF-8 test. ...
    (sci.lang.japan)
  • Re: UTF-8 Character Encodings and "NO-BREAK SPACE" (dec: 202, hex: CA) Character
    ... SPACE in some other encoding. ... headers that define the character set. ... define the character set as UTF-8, ... encoded in Mac-Roman. ...
    (alt.html)
  • Re: Attention: European C/C++/C#/Java Programmers-Call for Input
    ... No other encodings - no Latin-1, no UTF-16, no home-made character sets, no extra fonts. ... Look at existing tools and source code that supports UTF-8, and see how it can make your work easier and give a result that users might actually be able to *use*. ... A couple of days work here is a drop in the ocean compared to the man-years it will take to work with your home-made encoding, and you will at least have the benefit of a better understanding of your problem. ... I would have little more than UNICODE left. ...
    (comp.arch.embedded)
  • Re: Mixing text and binary I/O
    ... In case of complex encodings like UTF-8, ... Backed by a buffer ... used encoding), raise proper exception because it's an encoding error in ... that means "decode exactly one character". ...
    (comp.lang.java.programmer)
  • =?ISO-8859-1?Q?Re=3A_How_to_upload_a_=A3?=
    ... A reference to a character that will display as this glyph ... Correctly encoding some bytes so as to be recognised as this ... ASCII-like encodings are old and only cope with a character set of up ... straight for UTF-8. ...
    (alt.html)