Validating to a higher 4.01 standard (long)
- From: Lars Eighner <usenet@xxxxxxxxxxxxxxx>
- Date: Fri, 9 Jan 2009 23:16:21 +0000 (UTC)
Many people have written that they want to validate HTML 4.01 to a
higher standard so that it will be easier to process into other formats,
including various SGMLs.
Now of course in theory you can write your own DTD, post it in a public
site on the web, use your own PUBLIC identifier and include a pointer
to your DTD in you DOCTYPE. This is problematic for several reasons.
Real-world browsers do not parse documents and probably never will look up
your personal DTD, so the most likely thing to happen is that they will
ignore your DOCTYPE and go into quirks mode or something of similar
horribleness.
In adverse circumstances, the most reasonable thing to do is to cheat, and
this is about cheating. *It* *is* *cheating*, just to be clear, and there
will be folks along shortly to point out how bad it is.
The first step is to install a local SGML parser and learn how to use it.
I'm not going to discuss how to do this because this post will be long
enough as it. For parsers there is James Clark's SP, and the more recently
developed OpenSP (which is kind of a stupid name because SP *was* open
source. These packages contain parsers called nsgmls and onsgmls
respectively. You want the error output from these, and generally you will
throw away the regular parser. Anyway, whatever parser you install, you
should learn to parse your documents with them as is before you go to the
next step.
The next step is to copy these HTML documents to a new directory (so you do
not mess up your ability to check documents without any funny business):
HTML4.cat
HTML4.decl
HTMLlat1.ent
HTMLspecial.ent
HTMLsymbol.ent
frameset.dtd
loose.dtd
strict.dtd
The third step is to make a simple edit in the *new* copy of HTML4.decl.
look for:
NAMECASE GENERAL YES
ENTITY NO
and change it to:
NAMECASE GENERAL NO
ENTITY NO
This stops element names and attribute names from being case-insensitive.
Do *not* edit this:
OMITTAG YES
We are going to force elements that have content to use closing tags, but
this is not the place to do it. We are aiming at better HTML 4.01
documents, and in HTML 4.01, closing tags for empty elements *must* be
omitted.
Now I will summarize the carnage in the DTD because I am going to give one
that seems to work and you won't have to do it yourself unless you really
want to.
I picked the loose.dtd, which was stupid because I spent a lot of time
essentialy changing it to something close to strict. I tried to eliminate
all the deprecated elements and attributes. Care is required to do this
because some attributes are deprecated in some elements but not deprecated
in others (that is, you can't do this with search-and-replace unless you
supervise every change.
Then I changed all the elements with both optional opening and closing tags
so that they required both.
So something like:
<!ELEMENT BODY O O (%flow;)* +(INS|DEL) -- document body -->
became this:
<!ELEMENT BODY - - (%flow;)* +(INS|DEL) -- document body -->
Then I changed all the elements that have content and optional closing tags
to require both. So:
<!ELEMENT P - O (%inline;)* -- paragraph -->
became:
<!ELEMENT P - - (%inline;)* -- paragraph -->
*But* care must be taken not to change empty elements. So leave this alone:
<!ELEMENT HR - O EMPTY -- horizontal rule -->
Finally I changed all the element names to lowercase.
<!ELEMENT BODY - - (%flow;)* +(INS|DEL) -- document body -->
became:
<!ELEMENT body - - (%flow;)* +(ins|del) -- document body -->
And of course those elements not only have to be change where the element
is declared, but in all the various references to them, such as the INS
and DEL in the above.
Finally, you need to change the DOCTYPE of documents you want to check
by your new standard. Do not in anyway mess with the PUBLIC identifier
which is the "-//W3C//DTD HTML 4.01 Transitional//EN" part. The whole point
of this exercise is that you want browsers to accept your documents a HTML
4.01 without question, and the PUBLIC identifier is what does that.
You simply want to change the HTML that comes right after DOCTYPE to
html (don't change it in the PUBLIC identifier). This is what tells the
SGML parser what is the parent element of your document, and since all your
elements are now lowercase and the SGML parser does not allow case folding
because of what was changed in HTML.decl, it has to look for html, not HTML.
So this:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
"http://www.w3.org/TR/html4/loose.dtd">
becomes this:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
"http://www.w3.org/TR/html4/loose.dtd">
This shouldn't affect any browser's recognition of the document as 4.01
loose.
Finally when you invoke your SGML parser point it at the HTML4.cat file
in the new directory you made. You can point it back at the standard DTDs
in case you want to verify that your documents are indeed still valid HTML
4.01. Because the PUBLIC identifier should be exactly the same, you cannot
make one big catalogue file that will validate both ways.
If you undertake to edit the DTDs yourself, I suggest you make frequent
backups and validate a test document against the DTD you are editing because
the parser begins by parsing the DTD and will let you know if you have
screwed up the DTD. Also, there is nothing to be done to the frameset.dtd.
It really just enables elements that are already defined in the other DTDs.
Here, without the comments, is my DTD, which seems to work:
<!-- BEGIN altered DTD -->
<!ENTITY % HTML.Version "-//W3C//DTD HTML 4.01 Transitional//EN">
<!ENTITY % HTML.Frameset "IGNORE">
<!ENTITY % ContentType "CDATA">
<!ENTITY % ContentTypes "CDATA">
<!ENTITY % Charset "CDATA">
<!ENTITY % Charsets "CDATA">
<!ENTITY % LanguageCode "NAME">
<!ENTITY % Character "CDATA">
<!ENTITY % LinkTypes "CDATA">
<!ENTITY % MediaDesc "CDATA">
<!ENTITY % URI "CDATA">
<!ENTITY % Datetime "CDATA" >
<!ENTITY % Script "CDATA" >
<!ENTITY % Style*** "CDATA" >
<!ENTITY % FrameTarget "CDATA" >
<!ENTITY % Text "CDATA">
<!ENTITY % head.misc "script|style|meta|link|object" >
<!ENTITY % heading "h1|h2|h3|h4|h5|h6">
<!ENTITY % list "ul | ol">
<!ENTITY % preformatted "pre">
<!ENTITY % Color "CDATA" >
<!ENTITY % HTMLlat1 PUBLIC
"-//W3C//ENTITIES Latin1//EN//HTML"
"HTMLlat1.ent">
%HTMLlat1;
<!ENTITY % HTMLsymbol PUBLIC
"-//W3C//ENTITIES Symbols//EN//HTML"
"HTMLsymbol.ent">
%HTMLsymbol;
<!ENTITY % HTMLspecial PUBLIC
"-//W3C//ENTITIES Special//EN//HTML"
"HTMLspecial.ent">
%HTMLspecial;
<!ENTITY % coreattrs
"id ID #IMPLIED
class CDATA #IMPLIED
style %Style***; #IMPLIED
title %Text; #IMPLIED " >
<!ENTITY % i18n
"lang %LanguageCode; #IMPLIED
dir (ltr|rtl) #IMPLIED " >
<!ENTITY % events
"onclick %Script; #IMPLIED
ondblclick %Script; #IMPLIED
onmousedown %Script; #IMPLIED
onmouseup %Script; #IMPLIED
onmouseover %Script; #IMPLIED
onmousemove %Script; #IMPLIED
onmouseout %Script; #IMPLIED
onkeypress %Script; #IMPLIED
onkeydown %Script; #IMPLIED
onkeyup %Script; #IMPLIED " >
<!ENTITY % HTML.Reserved "IGNORE">
<![ %HTML.Reserved; [
<!ENTITY % reserved
"datasrc %URI; #IMPLIED
datafld CDATA #IMPLIED
dataformatas (plaintext|html) plaintext " >
]]>
<!ENTITY % reserved "">
<!ENTITY % attrs "%coreattrs; %i18n; %events;">
<!ENTITY % align "align (left|center|right|justify) #IMPLIED" >
<!ENTITY % fontstyle
"tt | i | b | big | small">
<!ENTITY % phrase "em | strong | dfn | code |
samp | kbd | var | cite | abbr | acronym" >
<!ENTITY % special
"a | img | object | br | script |
map | q | sub | sup | span | bdo | iframe">
<!ENTITY % formctrl "input | select | textarea | label | button">
<!ENTITY % inline "#PCDATA | %fontstyle; | %phrase; |
%special; | %formctrl;">
<!ELEMENT (%fontstyle;|%phrase;) - - (%inline;)*>
<!ATTLIST (%fontstyle;|%phrase;)
%attrs; >
<!ELEMENT (sub|sup) - - (%inline;)* >
<!ATTLIST (sub|sup)
%attrs; >
<!ELEMENT span - - (%inline;)* >
<!ATTLIST span
%attrs;
%reserved; >
<!ELEMENT bdo - - (%inline;)* >
<!ATTLIST bdo
%coreattrs;
lang %LanguageCode; #IMPLIED
dir (ltr|rtl) #REQUIRED >
<!ELEMENT br - O EMPTY >
<!ATTLIST br
%coreattrs; >
<!ENTITY % block
"p | %heading; | %list; | %preformatted; | dl | div |
noscript | noframes | blockquote | form | hr |
table | fieldset | address">
<!ENTITY % flow "%block; | %inline;">
<!ELEMENT body - - (%flow;)* +(ins|del) >
<!ATTLIST body
%attrs;
onload %Script; #IMPLIED
onunload %Script; #IMPLIED >
<!ELEMENT address - - ((%inline;)|P)* >
<!ATTLIST address
%attrs; >
<!ELEMENT div - - (%flow;)* >
<!ATTLIST div
%attrs;
%reserved; >
<!ENTITY % Shape "(rect|circle|poly|default)">
<!ENTITY % Coords "CDATA" >
<!ELEMENT a - - (%inline;)* -(A) >
<!ATTLIST a
%attrs;
charset %Charset; #IMPLIED
type %ContentType; #IMPLIED
name CDATA #IMPLIED
href %URI; #IMPLIED
hreflang %LanguageCode; #IMPLIED
target %FrameTarget; #IMPLIED
rel %LinkTypes; #IMPLIED
rev %LinkTypes; #IMPLIED
accesskey %Character; #IMPLIED
shape %Shape; rect
coords %Coords; #IMPLIED
tabindex NUMBER #IMPLIED
onfocus %Script; #IMPLIED
onblur %Script; #IMPLIED >
<!ELEMENT map - - ((%block;) | area)+ >
<!ATTLIST map
%attrs;
name CDATA #REQUIRED >
<!ELEMENT area - O EMPTY >
<!ATTLIST area
%attrs;
shape %Shape; rect
coords %Coords; #IMPLIED
href %URI; #IMPLIED
target %FrameTarget; #IMPLIED
nohref (nohref) #IMPLIED
alt %Text; #REQUIRED
tabindex NUMBER #IMPLIED
accesskey %Character; #IMPLIED
onfocus %Script; #IMPLIED
onblur %Script; #IMPLIED >
<!ELEMENT link - O EMPTY >
<!ATTLIST link
%attrs;
charset %Charset; #IMPLIED
href %URI; #IMPLIED
hreflang %LanguageCode; #IMPLIED
type %ContentType; #IMPLIED
rel %LinkTypes; #IMPLIED
rev %LinkTypes; #IMPLIED
media %MediaDesc; #IMPLIED
target %FrameTarget; #IMPLIED >
<!ENTITY % Length "CDATA" >
<!ENTITY % MultiLength "CDATA" >
<![ %HTML.Frameset; [
<!ENTITY % MultiLengths "CDATA" >
]]>
<!ENTITY % Pixels "CDATA" >
<!ENTITY % IAlign "(top|middle|bottom|left|right)" >
<!ELEMENT img - O EMPTY >
<!ATTLIST img
%attrs;
src %URI; #REQUIRED
alt %Text; #REQUIRED
longdesc %URI; #IMPLIED -- link to long description
(complements alt) --
name CDATA #IMPLIED
height %Length; #IMPLIED
width %Length; #IMPLIED
usemap %URI; #IMPLIED
ismap (ismap) #IMPLIED
vspace %Pixels; #IMPLIED >
<!ELEMENT object - - (param | %flow;)*
<!ATTLIST object
%attrs;
declare (declare) #IMPLIED
classid %URI; #IMPLIED
codebase %URI; #IMPLIED
data %URI; #IMPLIED
type %ContentType; #IMPLIED
codetype %ContentType; #IMPLIED
archive CDATA #IMPLIED
standby %Text; #IMPLIED
height %Length; #IMPLIED
width %Length; #IMPLIED
usemap %URI; #IMPLIED
name CDATA #IMPLIED
tabindex NUMBER #IMPLIED
vspace %Pixels; #IMPLIED
%reserved; >
<!ELEMENT param - O EMPTY >
<!ATTLIST param
id ID #IMPLIED
name CDATA #REQUIRED
value CDATA #IMPLIED
valuetype (DATA|REF|OBJECT) DATA
type %ContentType; #IMPLIED >
<!ELEMENT hr - O EMPTY >
<!ATTLIST hr
%attrs; >
<!ELEMENT p - - (%inline;)* >
<!ATTLIST p
%attrs; >
<!ELEMENT (%heading;) - - (%inline;)* >
<!ATTLIST (%heading;)
%attrs; >
<!ENTITY % pre.exclusion "img|object|big|small|sub|sup">
<!ELEMENT pre - - (%inline;)* -(%pre.exclusion;) >
<!ATTLIST pre
%attrs; >
<!ELEMENT q - - (%inline;)* >
<!ATTLIST q
%attrs;
cite %URI; #IMPLIED >
<!ELEMENT blockquote - - (%flow;)* >
<!ATTLIST blockquote
%attrs;
cite %URI; #IMPLIED >
<!ELEMENT (ins|del) - - (%flow;)* >
<!ATTLIST (ins|del)
%attrs;
cite %URI; #IMPLIED
datetime %Datetime; #IMPLIED >
<!ELEMENT dl - - (dt|dd)+ >
<!ATTLIST dl
%attrs; >
<!ELEMENT dt - - (%inline;)* >
<!ELEMENT dd - - (%flow;)* >
<!ATTLIST (dt|dd)
%attrs; >
<!ENTITY % OLStyle "CDATA" >
<!ELEMENT ol - - (li)+ >
<!ATTLIST ol
%attrs; >
<!ENTITY % ULStyle "(disc|square|circle)">
<!ELEMENT ul - - (li)+ >
<!ATTLIST ul
%attrs; >
<!ENTITY % LIStyle "CDATA" >
<!ELEMENT li - - (%flow;)* >
<!ATTLIST li
%attrs; >
<!ELEMENT form - - (%flow;)* -(form) >
<!ATTLIST form
%attrs;
action %URI; #REQUIRED
method (GET|POST) GET
enctype %ContentType; "application/x-www-form-urlencoded"
accept %ContentTypes; #IMPLIED
name CDATA #IMPLIED
onsubmit %Script; #IMPLIED
onreset %Script; #IMPLIED
target %FrameTarget; #IMPLIED
accept-charset %Charsets; #IMPLIED >
<!ELEMENT label - - (%inline;)* -(label) >
<!ATTLIST label
%attrs;
for IDREF #IMPLIED
accesskey %Character; #IMPLIED
onfocus %Script; #IMPLIED
onblur %Script; #IMPLIED >
<!ENTITY % InputType
"(text | password | checkbox |
radio | submit | reset |
file | hidden | image | button)" >
<!ELEMENT input - O EMPTY >
<!ATTLIST input
%attrs;
type %InputType; text
name CDATA #IMPLIED
value CDATA #IMPLIED
checked (checked) #IMPLIED
disabled (disabled) #IMPLIED
readonly (readonly) #IMPLIED
size CDATA #IMPLIED
maxlength NUMBER #IMPLIED
src %URI; #IMPLIED
alt CDATA #IMPLIED
usemap %URI; #IMPLIED
ismap (ismap) #IMPLIED
tabindex NUMBER #IMPLIED
accesskey %Character; #IMPLIED
onfocus %Script; #IMPLIED
onblur %Script; #IMPLIED
onselect %Script; #IMPLIED
onchange %Script; #IMPLIED
accept %ContentTypes; #IMPLIED
%reserved; >
<!ELEMENT select - - (optgroup|option)+ >
<!ATTLIST select
%attrs;
name CDATA #IMPLIED
size NUMBER #IMPLIED
multiple (multiple) #IMPLIED
disabled (disabled) #IMPLIED
tabindex NUMBER #IMPLIED
onfocus %Script; #IMPLIED
onblur %Script; #IMPLIED
onchange %Script; #IMPLIED
%reserved; >
<!ELEMENT optgroup - - (option)+ >
<!ATTLIST optgroup
%attrs;
disabled (disabled) #IMPLIED
label %Text; #REQUIRED >
<!ELEMENT option - - (#PCDATA) >
<!ATTLIST option
%attrs;
selected (selected) #IMPLIED
disabled (disabled) #IMPLIED
label %Text; #IMPLIED
value CDATA #IMPLIED >
<!ELEMENT textarea - - (#PCDATA) >
<!ATTLIST textarea
%attrs;
name CDATA #IMPLIED
rows NUMBER #REQUIRED
cols NUMBER #REQUIRED
disabled (disabled) #IMPLIED
readonly (readonly) #IMPLIED
tabindex NUMBER #IMPLIED
accesskey %Character; #IMPLIED
onfocus %Script; #IMPLIED
onblur %Script; #IMPLIED
onselect %Script; #IMPLIED
onchange %Script; #IMPLIED
%reserved; >
<!ELEMENT fieldset - - (#PCDATA,legend,(%flow;)*) >
<!ATTLIST fieldset
%attrs; >
<!ELEMENT legend - - (%inline;)* >
<!ENTITY % LAlign "(top|bottom|left|right)">
<!ATTLIST legend
%attrs;
accesskey %Character; #IMPLIED >
<!ELEMENT button - -
(%flow;)* -(a|%formctrl;|form|fieldset|iframe) >
<!ATTLIST button
%attrs;
name CDATA #IMPLIED
value CDATA #IMPLIED
type (button|submit|reset) submit
disabled (disabled) #IMPLIED
tabindex NUMBER #IMPLIED
accesskey %Character; #IMPLIED
onfocus %Script; #IMPLIED
onblur %Script; #IMPLIED
%reserved; >
<!ENTITY % TFrame "(void|above|below|hsides|lhs|rhs|vsides|box|border)">
<!ENTITY % TRules "(none | groups | rows | cols | all)">
<!ENTITY % TAlign "(left|center|right)">
<!ENTITY % cellhalign
"align (left|center|right|justify|char) #IMPLIED
char %Character; #IMPLIED
charoff %Length; #IMPLIED " >
<!ENTITY % cellvalign
"valign (top|middle|bottom|baseline) #IMPLIED" >
<!ELEMENT table - -
(caption?, (col*|colgroup*), thead?, tfoot?, tbody+)>
<!ELEMENT caption - - (%inline;)* >
<!ELEMENT thead - - (tr)+ >
<!ELEMENT tfoot - - (tr)+ >
<!ELEMENT tbody - - (tr)+ >
<!ELEMENT colgroup - - (col)* >
<!ELEMENT col - O EMPTY >
<!ELEMENT tr - - (th|td)+ >
<!ELEMENT (th|td) - - (%flow;)* >
<!ATTLIST table
%attrs;
summary %Text; #IMPLIED
width %Length; #IMPLIED
border %Pixels; #IMPLIED
frame %TFrame; #IMPLIED
rules %TRules; #IMPLIED
cellspacing %Length; #IMPLIED
cellpadding %Length; #IMPLIED
%reserved;
datapagesize CDATA #IMPLIED >
<!ENTITY % CAlign "(top|bottom|left|right)">
<!ATTLIST caption
%attrs; >
<!ATTLIST colgroup
%attrs;
span NUMBER 1
width %MultiLength; #IMPLIED
%cellhalign;
%cellvalign; >
<!ATTLIST col
%attrs;
span NUMBER 1
width %MultiLength; #IMPLIED
%cellhalign;
%cellvalign; >
<!ATTLIST (thead|tbody|tfoot)
%attrs;
%cellhalign;
%cellvalign; >
<!ATTLIST tr
%attrs;
%cellhalign;
%cellvalign; >
<!ENTITY % Scope "(row|col|rowgroup|colgroup)">
<!ATTLIST (th|td)
%attrs;
abbr %Text; #IMPLIED
axis CDATA #IMPLIED
headers IDREFS #IMPLIED
scope %Scope; #IMPLIED
rowspan NUMBER 1
colspan NUMBER 1
%cellhalign;
%cellvalign; >
<![ %HTML.Frameset; [
<!ELEMENT frameset - - ((frameset|frame)+ & noframes?) >
<!ATTLIST frameset
%coreattrs;
rows %MultiLengths; #IMPLIED
cols %MultiLengths; #IMPLIED
onload %Script; #IMPLIED
onunload %Script; #IMPLIED >
]]>
<![ %HTML.Frameset; [
<!ELEMENT frame - O EMPTY >
<!ATTLIST frame
%coreattrs;
longdesc %URI; #IMPLIED
name CDATA #IMPLIED
src %URI; #IMPLIED
frameborder (1|0) 1
marginwidth %Pixels; #IMPLIED
marginheight %Pixels; #IMPLIED
noresize (noresize) #IMPLIED
scrolling (yes|no|auto) auto >
]]>
<!ELEMENT iframe - - (%flow;)* >
<!ATTLIST iframe
%coreattrs;
longdesc %URI; #IMPLIED
name CDATA #IMPLIED
src %URI; #IMPLIED
frameborder (1|0) 1
marginwidth %Pixels; #IMPLIED
marginheight %Pixels; #IMPLIED
scrolling (yes|no|auto) auto
height %Length; #IMPLIED
width %Length; #IMPLIED >
<![ %HTML.Frameset; [
<!ENTITY % noframes.content "(body) -(noframes)">
]]>
<!ENTITY % noframes.content "(%flow;)*">
<!ELEMENT noframes - - %noframes.content; >
<!ATTLIST noframes
%attrs; >
<!ENTITY % head.content "title & base?">
<!ELEMENT head - - (%head.content;) +(%head.misc;) >
<!ATTLIST head
%i18n;
profile %URI; #IMPLIED >
<!ELEMENT title - - (#PCDATA) -(%head.misc;) >
<!ATTLIST title %i18n>
<!ELEMENT base - O EMPTY >
<!ATTLIST base
href %URI; #IMPLIED
target %FrameTarget; #IMPLIED >
<!ELEMENT meta - O EMPTY >
<!ATTLIST meta
%i18n;
http-equiv NAME #IMPLIED
name NAME #IMPLIED
content CDATA #REQUIRED
scheme CDATA #IMPLIED >
<!ELEMENT style - - %Style*** >
<!ATTLIST style
%i18n;
type %ContentType; #REQUIRED
media %MediaDesc; #IMPLIED
title %Text; #IMPLIED >
<!ELEMENT script - - %Script; >
<!ATTLIST script
charset %Charset; #IMPLIED
type %ContentType; #REQUIRED
src %URI; #IMPLIED
defer (defer) #IMPLIED
event CDATA #IMPLIED
for %URI; #IMPLIED >
<!ELEMENT noscript - - (%flow;)* >
<!ATTLIST noscript
%attrs; >
<!ENTITY % version "version CDATA #FIXED '%HTML.Version;'">
<![ %HTML.Frameset; [
<!ENTITY % html.content "head, frameset">
]]>
<!ENTITY % html.content "head, body">
<!ELEMENT html - - (%html.content;) >
<!ATTLIST html
%i18n; >
<!-- END DTD -->
--
Lars Eighner <http://larseighner.com/> usenet@xxxxxxxxxxxxxxx
Bush's third term begins Jan. 20th with an invocation by Rick Warren.
Obama: No hope; No change; More of the Same.
.
- Follow-Ups:
- Re: Validating to a higher 4.01 standard (long)
- From: Jukka K. Korpela
- Re: Validating to a higher 4.01 standard (long)
- From: mynameisnobodyodyssea
- Re: Validating to a higher 4.01 standard (long)
- Prev by Date: Re: Selecting a colour
- Next by Date: Re: Selecting a colour
- Previous by thread: Karhunpaskat hukassa
- Next by thread: Re: Validating to a higher 4.01 standard (long)
- Index(es):