correction (Re: RFC, an ugly parser hack (and a bin-xml variant))
- From: "cr88192" <cr88192@xxxxxxxxxxxxxxxxxx>
- Date: Mon, 5 Sep 2005 14:12:02 +1000
>
> it is, well, signifigantly faster than my textual parser, largely because
> of the dramatic reduction in memory allocation. this is partly because, as
> a matter of the format's operation, most strings are merged. likewise, it
> is a bit smaller (around 20-30% the original size in my testing), which is
> a bit worse than what I can get from "real" compressors, but this is no
> big loss.
>
just checked and recalculated percents, realized it was doing somewhat
better than this, eg, around 10% original size or so for some larger files
(around 900kB initial, around 1MB after being spit back out from my app with
different formatting).
binary files are presently about 2x as large as that of the output from gzip
(eg: initial, about 900kB, my format about 100kB, gzip about 40kB).
somehow, I had not taken this into account, remembering my initial results
with smaller xml files (eg: 1.5kB to 400 bytes, ...).
as for huffman compression, if done, it would likely be at least close to,
or maybe exceed that of gzip. this is difficult to predict though given the
signifigant differences in the algos (gzip might win due to its ability to
utilize patterns spaning multiple tags, but might be hurt by its inability
to deal with regular but predictable variations in the pattern).
gzip'ing the binary variant leads to an output of about 30kB, so about 10kB
less than gziping the input file. a specialized compressor may thus have a
chance.
each tag as a huffman code, possibly using a lz77 or markov variant for the
strings (lz77+huffman is the base of gzip anyways), ...
but, then again, speed may no longer be good. by this point it may have
dropped somewhat below the speed of the normal text printer/parser,
effectively losing part of the gain.
actually, it may yet be slower than defalte, eg, given my tendency to be
lazy and use adaptive huffman coding most of the time (slower but generally
easier to manage than the static varieties used in gzip/deflate). actually,
the varieties I use are more often "quasi-static", eg, they only update
every so often, vs after every symbol (I can, for example, encode a few kB
and then rebuild the trees, which is faster than a pure-adaptive variant,
but slower than static). as a result, for decoding at least I can still use
an index table (vs. having to resort to decoding the file a single bit at a
time). one then has to tune how often they rebuild the trees/tables
(rebuilding more often hurts speed, but typically helps compression).
not like it matters probably.
I am just a lame hobbyist...
.
- References:
- RFC, an ugly parser hack (and a bin-xml variant)
- From: cr88192
- RFC, an ugly parser hack (and a bin-xml variant)
- Prev by Date: convert Jakarta Ant script thru XSLT to HTML
- Next by Date: Re: XHTML to XML conversion
- Previous by thread: RFC, an ugly parser hack (and a bin-xml variant)
- Next by thread: Re: RFC, an ugly parser hack (and a bin-xml variant)
- Index(es):
Relevant Pages
|