Re: Three broken compressors
- From: "cr88192" <cr88192@xxxxxxxxxxxxxxxxxx>
- Date: Sun, 21 May 2006 07:23:43 +1000
"Sachin Garg" <schngrg@xxxxxxxxx> wrote in message
news:1148146253.577132.227590@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
Matt Mahoney wrote:
cr88192 wrote:
"Jim Leonard" <MobyGamer@xxxxxxxxx> wrote in message
<snip>
as noted it seems, most of the content found in the articles is plain
Wikipedia actually makes very little use of XML. It is mainly used for
titles, timestamps, and user IDs on articles. The article is just one
long string in a <text> tag. All of the structure such as headings,
links, tables, lists, etc. use special characters embedded in the text.
I am not sure if using a xml file for text compression comparison is a
good idea.
text...
as a result, xml likely only makes up a small portion of the content.
What about XML aware compressors? They can probably beat generic
compressors in your benchmark. Would you include XML parsing
compressors (like xmlppm) in your benchmark?
in my experience one can get almost deflate levels of compression simply by
using a binary coding. there are cases where deflate could do better, but
deflate is also hindered by having to process a lot of extra
representational stuff that a specialized binary coding could ignore.
more so, I suspect building an actual compressor on top of this (throwing
better sub-tree elimination and huffman coding into the mix) could do
better.
in my case, my coding was both smaller and better handled dynamic content
than wbxml in my tests. in theory, using wbxml with premade dictionaries and
schemas would allow it to do better, but I was assuming dynamic content (in
this way, the data can be decoded regardless of context).
likewise, my coding was structured as such that decoding managed to beat out
text parsing wrt decode speeds (thus, implicitly, the delfate+xml combo, as
this involves both inflating and parsing).
text was also compressed, however, this was done using a markov predictor
which would not do as well for "wordy" content as lz77. it worked well
enough though at trimming down the strings for attributes and text contents.
the coding was also "namespace aware", however, using namespaces on all tags
would not help much (as such, namespaces generated an extra marker per tag,
and were not viewed as part of the tag name itself).
not "all" core xml features were supported, in particular, those that have
little use for my uses (doctype, dtd's, ...). so all that really was
supported was the core syntax, cdata (distinguished for varying reasons),
and namespaces. some other features were included, but were in retrospect
not all that sensible (binary payloads, ...).
my thoughts:
why do we need doctypes, when one can do similar, and imo better, using
namespaces, and without cluttering the syntax?...
why do we need dtd's, when similar can be done via schemas and without
cluttering the syntax?...
....
however, I suspect that my thoughts are not novel, as noticing that in most
(non-w3c) uses of xml, people seem to do things about the same way as me
anyways:
<foo xmlns="http ://bar.org/foo"
xmlns:bar="http ://bar.org/bar">
<bar:baz>something or another</bar:baz>
...
</foo>
but, oh well, whatever...
Sachin Garg [India]
http://www.sachingarg.com
.
- References:
- Three broken compressors
- From: Matt Mahoney
- Re: Three broken compressors
- From: cr88192
- Re: Three broken compressors
- From: Jim Leonard
- Re: Three broken compressors
- From: cr88192
- Re: Three broken compressors
- From: Matt Mahoney
- Re: Three broken compressors
- From: Sachin Garg
- Three broken compressors
- Prev by Date: Re: Three broken compressors
- Next by Date: Re: Three broken compressors
- Previous by thread: Re: Three broken compressors
- Next by thread: Re: Three broken compressors
- Index(es):
Relevant Pages
|
|