Re: Three broken compressors




"Sachin Garg" <schngrg@xxxxxxxxx> wrote in message
news:1148146253.577132.227590@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

Matt Mahoney wrote:
cr88192 wrote:
"Jim Leonard" <MobyGamer@xxxxxxxxx> wrote in message


<snip>


Wikipedia actually makes very little use of XML. It is mainly used for
titles, timestamps, and user IDs on articles. The article is just one
long string in a <text> tag. All of the structure such as headings,
links, tables, lists, etc. use special characters embedded in the text.

I am not sure if using a xml file for text compression comparison is a
good idea.

as noted it seems, most of the content found in the articles is plain
text...

as a result, xml likely only makes up a small portion of the content.

What about XML aware compressors? They can probably beat generic
compressors in your benchmark. Would you include XML parsing
compressors (like xmlppm) in your benchmark?


in my experience one can get almost deflate levels of compression simply by
using a binary coding. there are cases where deflate could do better, but
deflate is also hindered by having to process a lot of extra
representational stuff that a specialized binary coding could ignore.

more so, I suspect building an actual compressor on top of this (throwing
better sub-tree elimination and huffman coding into the mix) could do
better.

in my case, my coding was both smaller and better handled dynamic content
than wbxml in my tests. in theory, using wbxml with premade dictionaries and
schemas would allow it to do better, but I was assuming dynamic content (in
this way, the data can be decoded regardless of context).

likewise, my coding was structured as such that decoding managed to beat out
text parsing wrt decode speeds (thus, implicitly, the delfate+xml combo, as
this involves both inflating and parsing).

text was also compressed, however, this was done using a markov predictor
which would not do as well for "wordy" content as lz77. it worked well
enough though at trimming down the strings for attributes and text contents.

the coding was also "namespace aware", however, using namespaces on all tags
would not help much (as such, namespaces generated an extra marker per tag,
and were not viewed as part of the tag name itself).

not "all" core xml features were supported, in particular, those that have
little use for my uses (doctype, dtd's, ...). so all that really was
supported was the core syntax, cdata (distinguished for varying reasons),
and namespaces. some other features were included, but were in retrospect
not all that sensible (binary payloads, ...).


my thoughts:
why do we need doctypes, when one can do similar, and imo better, using
namespaces, and without cluttering the syntax?...

why do we need dtd's, when similar can be done via schemas and without
cluttering the syntax?...

....


however, I suspect that my thoughts are not novel, as noticing that in most
(non-w3c) uses of xml, people seem to do things about the same way as me
anyways:

<foo xmlns="http ://bar.org/foo"
xmlns:bar="http ://bar.org/bar">
<bar:baz>something or another</bar:baz>
...
</foo>

but, oh well, whatever...

Sachin Garg [India]
http://www.sachingarg.com



.



Relevant Pages

  • Re: namespace?
    ... I think it's a bit funny that the new MS stuff, I forget the name maybe it was WCF?, was almost back to comma delimited and they talked about cutting down on the data being sent across the net in comparison to XML. ... Honestly I think if the standard for web service namespaces was ACME/employees/HR instead of http://www.ACME.com/employees/HR it would have made more sense to me as it doesn't carry the connotation that it's a web address. ... The MSDN Managed Newsgroup support offering is for non-urgent issues where an initial response from the community or a Microsoft Support Engineer within 1 business day is acceptable. ...
    (microsoft.public.dotnet.framework.webservices)
  • Re: XmlSerializer: deserialize against xsd generated class
    ... If there are 5 namespaces in the XML do I need to add a namespace for each ... I did find a code snippet that removes the "empty" nodes ... public partial class Request_TypeMessageTypeFieldTag4000 { ...
    (microsoft.public.dotnet.xml)
  • Re: namespace?
    ... confusion of using URL's as namespace names. ... Honestly I think if the standard for web service namespaces ... It is an XML standard per the World Wide Web Consortium. ... reason for needing namespaces in XML is because the very nature of XML ...
    (microsoft.public.dotnet.framework.webservices)
  • Re: namespace?
    ... just for web service use. ... It is an XML standard per the World Wide Web Consortium. ... reason for needing namespaces in XML is because the very nature of XML ... Microsoft MSDN Online Support Lead ...
    (microsoft.public.dotnet.framework.webservices)
  • Re: getting rid of NS0 (name space) in xml tags
    ... (it's quite strange, cose, the namespaces are one of the main part of XMl) ... It's the parser's job to bother with namespases and prefixes. ... BTW BizTalk doesn't work well with xml-messages without namespaces. ...
    (microsoft.public.biztalk.general)