Whitespace-preservating Search & Replace in multiple XML documents



Dear Newsgroup,

I am looking for a way to search and replace some strings inside various
XML documents while at the same time binary-preserving all the
whitespace of each document (in particular the line ending convention,
the white space both *inside* the markup and inside the content).

So far, this sounds more like a plain text search-and-replace, but the
twist is that the strings should only be replaced if they match a
certain XML context (say: replace attribute name "jarfile" in any
element <jar> with attribute name "destfile", or change the entire
content of element <value>, but only when when <value> immediately
follows an element <key> with a content of "OutputFile" etc...)

I even don't know if my problem has a "canonical" name, which pretty
much precludes a meaningfull search on Google...

I know XLST can do some (all ?) of that, but :
a) These substitutions need to occur on many different XML files and the
XML contexts / search strings may differ from file to file, so I will
need many different stylesheets. (which could be generated
automatically, I suppose)
b) What guarantee do I have on binary-preservation of all whitespace ?
(BTW this "weird" requirements arises from the need to keep the ability
to make plain textual diff of those XML documents which are stored
inside a source control system)

I have also looked at SAX parsers, thinking that maybe I could rely on
event notifications, but it seems that the events are not granular
enough for my situation (eg : AFAICT, no notification will tell that I
have encountered a block of contiguous whitespace inside an element tag
and how is such a block made, for instance 3 SPC + LF + LF + TAB + TAB).
Also, the SAX parser does not seem to be able to tell me the exact
'slices' of input characters that it identified as element name,
attribute name, attribute value, whitespaces, entity reference, etc...
AFAICT, SAX will not tell me the difference between 'attr="&#x21;"' and
'attr="!"' ?

Pointers, suggestions & comments appreciated.
Regards
_______________________________________________________
François Robert
(to mail me, reverse character order in reply address)
.



Relevant Pages

  • Re: ElementTree write creates large one line XML file ....
    ... Let me quote an XML tutorial: ... XML considers four characters to be whitespace: ... In XML documents, there are two types of whitespace: ... Usually without DTD or XML schema definition, ...
    (comp.lang.python)
  • Re: Whitespace-preservating Search & Replace in multiple XML documents
    ... I am looking for a way to search and replace some strings inside various XML documents while at the same time binary-preserving all the whitespace of each document (in particular the line ending convention, ...
    (comp.text.xml)
  • Re: Need help change a particular word in an XML doc
    ... ' re-parse the XML into xmlDoc. ... prototyping or working with small XML documents, ... better is that you're only calling Replace on smaller strings (not ... document (because you can give the XmlTextReader subclass ...
    (microsoft.public.dotnet.xml)
  • new lame effort, self-compressed bin-xml...
    ... hell, I use xml, partly because it ... (likewise, my format supports namespaces, which is imo a plus). ... for the most part strings are merged via mru caching. ...
    (comp.compression)
  • Re: Unicode setting question
    ... formatting a decimal or hex number for inclusion in a string elsewhere. ... The biggest consumer of _Tdeclarations is formatting strings for debug output, ... 18 hardwired language-independent constant strings for XML tags ... We are not asking to break the standard and assume 'char' = Unicode char. ...
    (microsoft.public.vc.mfc)