Re: Understanding simplest HTML page
- From: "Alan J. Flavell" <flavell@xxxxxxxxxxxx>
- Date: Sun, 20 Nov 2005 16:31:59 +0000
On Sun, 20 Nov 2005, Lachlan Hunt wrote:
> Alan J. Flavell wrote:
> > On Sat, 19 Nov 2005, Eric Lindsay wrote:
> > > So if that is at all likely, serving using UTF-8 seems a better
> > > choice.
> >
> > It's an option which has much to commend it, but the author must
> > be capable of handling it. Even the BBC managed to put invalid
> > character data into their utf-8-encoded pages sometimes.
>
> Ideally, the content author should not need to be aware of all the
> technical details of using a particular encoding, such issues should
> be handled by the CMS/authoring tool programmers and the system
> administrators.
Ideally, you're right. However, at the discussion level of this
group, I suggest it's still useful to understand the underlying
mechanics, seeing that there are plenty of ways things can go wrong,
as my example from the BBC showed (they had seemingly concatenated
some iso-8859-1 data into their pages for readers from the
subcontinent - which represent Urdu, Bengali and so on using utf-8
encoding).
> The content authors should be able to enter whatever characters they
> like and the CMS/authoring tool should ensure that they are encoded
> correctly.
It's a fine idea, certainly.
> It's extremely easy to validate UTF-8 or ISO-8859-1 input
If you mean "on knowing that some non-trivial input must be either
utf-8 or iso-8859-1, it's easy to decide which", then I agree with
you.
utf-8 can be formally validated, although there's some small
probability of faking it. iso-8859-1 is just a sequence of octets:
it can only be verified by some kind of sanity check on its content.
> (other encodings may be harder, but still possible)
Mozilla has routines for automatically guessing at character
encodings, which are useful when no other information is available:
but they can go sadly wrong.
> and, IMHO, there's no excuse (beyond ignorance) for any CMS to not
> validate the encoding of the input, and thus no excuse for invalid
> character data to appear.
I'll give you an example. The web is awash with windows-1252 data
purporting to be iso-8859-1. Its authors claim that "it works", and I
can only admit "yes, it gives a fair impression of working" since most
browser developers have felt it necessary to accommodate this misuse.
By rights it should be declared invalid, but that's not going to
happen in this age of the world.
Let's hope for better in the future.
utf-8 is different. There's even a security mandate against
attempting to process invalid coding.
> > beware of are the characters of the Windows-specific repertoire
> > 128-159 decimal. For example the euro character is *not* €
> > (but I'd recommend coding that as € anyway).
>
> For HTML, € is acceptable, it will be handled just fine in all
> modern browsers.
And will be recognisable even in HTML browsers which for some reason
don't interpret it. Which was the point of recommending it.
> However, for XHTML, the numeric or hex character reference is better
> because entity references require a validating parser.
Fair comment.
> Besides, I'd recommend just using UTF-8 and entering .
For web authoring[1], it's a perfectly fine option, for those who are
comfortable with doing it. As I keep repeating. My point here was to
warn-off anyone who *was* using &-notation, from trying to reference
code points from Windows-1252 (as happens in so much HTML-extruding
software from a certain dominant vendor, until quite recently).
regards
[1] For posting to Usenet, on the other hand, it's not without its
gotchas, as we see here. SCNR.
.
- Follow-Ups:
- Re: Understanding simplest HTML page
- From: Lachlan Hunt
- Re: Understanding simplest HTML page
- References:
- Understanding simplest HTML page
- From: Eric Lindsay
- Re: Understanding simplest HTML page
- From: Andreas Prilop
- Re: Understanding simplest HTML page
- From: Eric Lindsay
- Re: Understanding simplest HTML page
- From: Alan J. Flavell
- Re: Understanding simplest HTML page
- From: Eric Lindsay
- Re: Understanding simplest HTML page
- From: Alan J. Flavell
- Re: Understanding simplest HTML page
- From: Lachlan Hunt
- Understanding simplest HTML page
- Prev by Date: Re: image caption in xhtml
- Next by Date: Re: image caption in xhtml
- Previous by thread: Re: Understanding simplest HTML page
- Next by thread: Re: Understanding simplest HTML page
- Index(es):
Relevant Pages
|