Re: XML-WRT 3.0 (a state-of-the-art XML compressor) has been released




Matt Mahoney wrote:
inikep@xxxxxxxxx wrote:
XML-WRT 3.0 (XML compressor) has been released at:
http://sourceforge.net/project/showfiles.php?group_id=176333

XML-WRT is a high-performance XML compressor. It transforms XML to more
compressible form and uses zlib (default), LZMA, PPMVC, or FastPAQ8 as
back-end compressor. It is similar to XMill, but has many improvements
like e.g. semi-dynamic dictionary.

best regards,
Przemyslaw

Version 3.0 goes from #24 to #8 on the large text benchmark.
http://cs.fit.edu/~mmahoney/compression/text.html#1653

Some results on enwik8 and enwik9.
xml-wrt 2.0 -l6 -b255 -m255 -s -f8 23,199,202 196,914,328
xml-wrt 3.0 -l11 -b255 -m255 -f24 19,663,305 165,274,422

I am still testing as a preprocessor to ppmonstr. Some results on
enwik8 (3.0 not posted yet):

ppmonstr J -m1700 -o16 = 19,055,092
xml-wrt 2.0 -l0 -w -s -c -b255 -m100 -e2300 | ppmonstr J -m1650 -o64 =
18,625,624
xml-wrt 3.0 -l0 -b255 -m255 -3 -s -e7000 | ppmonstr J -m1650 -o64 =
18,494,374

-- Matt Mahoney

Updated results on xml-wrt|ppmonstr. enwik9 goes from 150,651,873 to
150,004,636. The optimal dictionary size is larger (from 10000 to
20000).
http://cs.fit.edu/~mmahoney/compression/text.html#1500

The work on paq8hp4 shows that much better dictionary optimization is
possible. For example, simply sorting the dictionary makes it more
compressable. The paq8hp4 dictionary is sorted extensively by
syntactic category (person, place, adjective, adverb, etc) and suffix
sorted within these groups. I understand from Alexander Ratushnyak
that this dictionary organization is being done manually with the help
of custom utilities. This organization results in codes being assigned
to similar words that differ only in the last few bits. Then paq8hp4
uses sparse models that discard the low bits as context. You can
examine the dictionary by running paq8hp4 and it will be left behind as
a temporary file. To see how codes are assigned, compress the
dictionary with option -0.

-- Matt Mahoney

.



Relevant Pages