Re: Putting a "<" in an attribute value (was about CDATA sections)



Jon Noring wrote:

> As an addendum to my prior message where I asked if there is an
> absolute ban on using the "<" character in attribute values (for
> well-formed XML documents) no matter how the "<" is represented.
>
> Googling around at various "authorities" on this topic I get different
> answers. I suppose this is to be expected. <laugh/>

Yes. Google is a fine thing, but the pages it indexes are not subjected
to any form of authority.

> To summarize, there are four mechanisms by which the "<" character may
> be included in an attribute value, some or all of which are illegal
> per XML well-formedness rules:
>
> 1) <foo bar="is x < y ?">

No.

> 2) <foo bar="is x &lt; y ?">

Yes.

> 3) <foo bar="is x &#x003C; y ?">

Yes.

> 4) <foo bar="is x &lessthan; y ?"

That is well-formed.

> a) where in the DTD we have <!ENTITY lessthan "<">

No, that's an invalid declaration.

> b) where in the DTD we have <!ENTITY lessthan "&lt;">

That's OK.

> c) where in the DTD we have <!ENTITY lessthan "&#x003C;">

So is that.

> From the latest XML spec (section 3.1, rule 41 and associated WFC),
> see http://www.w3.org/TR/REC-xml/#NT-AttValue , it says
>
> "No < in Attribute Values.
> "The replacement text of any entity referred to directly or
> indirectly in an attribute value MUST NOT contain a <."
>
> So it is clear from this that #1 and #4a are illegal. But the others
> are ambiguous (section 2.4 essentially says numeric character
> references are equivalent to the escape strings.) It partly seems to
> hinge around the definition of an "entity".

All these terms have their formal definition in SGML (ISO 8879:1986).
You may want to borrow a copy of Goldfarb, C, "The SGML Handbook" (OUP)
to check them out, but beware the formal standards-ese language (Charles
is a lawyer :-) XML has inherited these definitions with very few
changes.

To understand what happens may help: validity attaches to the state of
the characters making up the file at the time of parsing, without any
form of interpretation (ie no substitution of entity values for entity
references...yet). So a < in a CDATA attribute value is invalid, but
a &lt; or &#x3c; is valid because neither of them contains a literal <
character. Once validity is established, an application will receive
a data representation of the document from the parser, which includes
both the structural information (where the markup nodes were) and the
character data content information (where the document text is). This
is variously known in assorted circles as "the grove", "the
post-schema-validation infoset" and other terms. How it is presented
to the application varies, but at this stage all physical markup has
disappeared (or rather, been turned into pointers of some kind) and
all entity references and character references have been resolved.

One way to get a handle on this (and to solve any other questions of
validity or invalidity) is to install a validating parser like onsgmls
or rxp which runs from the command-line. onsgmls in particular is
useful, despite its now having some small areas of non-conformance)
in that it can output a format called ESIS, which is a line-by-line
echo of the markup interpretation. As an example, here is your XML
file:

<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE header [
<!ELEMENT header (#PCDATA)>
<!ATTLIST header title CDATA #REQUIRED>
<!ENTITY lessthan "&lt;">
]>
<header title="Is A &lessthan; B?"> ... </header>

and here is onsgmls's unsuppressed output (there's a -s option to turn
this off and simply report validity or not):

$ onsgmls -wxml /usr/share/sgml/xml.dcl test.xml
onsgmls:/usr/share/sgml/xml.dcl:1:W: SGML declaration was not implied
?xml version="1.0" encoding="ISO-8859-1"
Atitle CDATA Is A < B?
(header
- ...
)header
C
$

Ignore the warning about the SGML declaration for the moment. The ESIS
output clearly shows the data and markup being dissected and exposed
for processing. Lines beginning with A are attribute values, ( is the
start-tag of an element type, ) is the end-tag, - is character data,
and C is the end.

> The plot thickens when looking at the 1998 first edition of the XML
> spec, http://www.w3.org/TR/1998/REC-xml-19980210.html#sec-starttags .
> It says:
>
> "No < in Attribute Values.
> "The replacement text of any entity referred to directly or
> indirectly in an attribute value (other than "&lt;") must not
> contain a <."

Right. That means it mustn't contain a literal < sign like 4(a)
above. It may well resolve to a < sign at the end of the day, but
for the purposes of document validity we're only concerned with
the actual characters in the file, not what they represent.

> The difference between the current XML spec and the first 1998 spec
> is that in the 1998 spec it clearly says "&lt;" may be used to
> represent the literal "<" character in an attribute value (and I
> would assume, by extension in section 2.4, so would be &#x003C or
> &#60;). So in the 1998 spec, #2 and #4b appear legal, and likely #3
> and #4c.

Yes, exactly correct.

> So what does the removal of the phrase '(other than "&lt;")' mean
> in the current XML spec edition? Was it removed because it is
> superfluous

Yes.

> (that is, &lt;, and &#x003C; are not considered "any
> entity" -- this is supported in that in section 2.4 XML calls &lt; a
> "string", not an "entity".) Or was it a change to have a total,
> absolute ban on using that character no matter how it is represented?

It was just to avoid clouding the issue, so far as I know.

///Peter
--
XML FAQ: http://xml.silmaril.ie/

.



Relevant Pages

  • RE: Quoted Printable in Messages (Subject Changed)
    ... which definitely went out in HTML format with all the MIME headers and so ... The important part is the Content-Transfer-Encoding: header line. ... the last character on the line as a "soft" line break accommodates longer ... followed by a two-digit hexadecimal representation. ...
    (comp.os.vms)
  • [solved] Problems displaying Japanese characters in alert boxes
    ... the original encoding, ... should not differ from the header value (with HTTP, ... the content of an XHTML `script' element served with an XML MIME media type. ... special Unicode font for that character range in order to display the ...
    (comp.lang.javascript)
  • Re: Future of LISP. Alternative to XML. Web 3.0?
    ... I didn't realize it meant literally the cr character within the ... instead of XML representation for queries and responses. ... using s-expressions instead of XML, nobody is going to use it, ... Do *any* of those LISP projects have a server I ...
    (comp.lang.lisp)
  • RE: System.ArgumentException: Illegal characters in path
    ... But I don't use any xml string at all in my web ... It is a default data type string and I wonder it ... cannot accept latin character since string accepts all utf-8 characters. ... Microsoft XML 3.0 SP1 ...
    (microsoft.public.dotnet.framework.webservices)
  • Re: Suppressing character entity transformation
    ... I was gently chastizing Pavel for a minor stylistic point. ... When XSLT reads in a document, character references and entity references are expanded; when it writes the document back out as either XML or HTML, it should re-create character references where they're necessary. ... You really want to fix that, along the lines he illustrated (issuing actual elements rather than text that looks like tags) before you do anything else. ...
    (comp.text.xml)