new lame effort, self-compressed bin-xml...
- From: "cr88192" <cr88192@xxxxxxxxxxxxxxxxxx>
- Date: Mon, 5 Sep 2005 21:31:58 +1000
I am not sure if anyone will have any comments on this.
I know, this has little to do with compression in general.
any comments would be nice...
few have been interested in talking thus far, likely because this is
unlikely to have much relevance on the "general" stance.
it is, if anything, a dubious effort. anymore, I don't care, I will do my
own thing. to hell with most standards, it is not like I need to
interoperate with anything, most people obsessing on xml unconcerned with
much outside the web/network realms. hell, I use xml, partly because it
works as a fairly general and useful tool for things largely unrelated to
its original purpose...
so, ok, first off it is a binary xml format.
second off, it parses a lot faster than my text xml parser.
thirdly, for my input files it almost matches bytewise lz77 and is not that
far behind gzip.
it is currently under 1/2 the size of basic self-contained wbxml files
(likewise, my format supports namespaces, which is imo a plus).
(one can, eg, just deflate the xml, but then one still has to use the
textual parser, and thus can't squeeze any more speed out of it).
Some Example Sizes
900kB input (slighlty over 1MB when spit back out as a result of
formatting);
200kB wbxml output;
87kB my format;
40kB gzip.
some of my bytewise lz77 variants were going at around 60kB.
gzip'ing my format results in about 29kB for output.
both fpaq0 and vitter-huffman result in about 54kB for compressing my format
and around 330kB for the input xml.
I don't have any measured times, but I can tell the difference.
reading the 900kB xml file with my textual parser takes a few seconds,
wheras reading a file in my format, writing another file, reading in that
file, and writing out some textual xml (so I can look at it to make sure
things haven't gone weird) takes well under a second (xml writing tends to
be a bit faster than parsing in my experience).
in the case of my parser, according to the profiler most of the time goes
into allocating memory. as for the binary variant, it tends to merge a lot
of data and as a result needs to allocate less memory.
no schemas, dtd's, external strings tables, ... are used in the attempt to
save space. the format operates, as it does, dynamically, and based only on
the xml itself.
the format is bytewise for speed and simplicity reasons. huffman coding
could save some space, but would risk making the readers/writers slower.
for the most part strings are merged via mru caching. any new strings are
added to the mru, and any strings allready in the cache are moved to the
front of the list.
seperate mru lists are maintained for text, tag names, attribute names, and
namespace prefixes. the codespace assigned to tags and attributes, however,
overlaps and is distinguished via context.
inline text strings (eg: ones that are not present in the mru list),
however, are compressed by a simplistic markov predictor using an order-1
context and a 16kB window. this does a decent job at pruning down strings.
the encoding is specialized for utf-8, eg, to minimize the occurance of
needing to escape characters (however, at present the end-of-string marker
itself is not encoded, which could save a little bit of space at the cost of
a little weird logic).
the context is reset for each string as to better allow eliminating
potentially common patterns at the start of strings, and assuming little
correlation between the end of one string and the start of another.
prefix characters can be used for tag-names as a means by which to avoid
encoding the end-of-attributes or end-of-contents markers (no prefix char
implies needing to encode the markers). potentially, this could cause extra
strain on the mru or increase the need to encode literal strings, but in my
testing it is effective at reducing the size.
(I had also tried statistical approaches to eliminating end markers, but
these didn't turn out as well...).
....
Trimmed overview:
Byte Values
0x00: general purpose ending marker
0x01..0x1F: Special, Reserved
0x20..0x3E: Namespace Prefix MRU
0x3F: Namespace String
0x40..0x7E: Opening Tag/Attr MRU
0x7F: Opening Tag/Attr String
0x80..0xFE: Text MRU
0xFF: Text String
Node(Default Case)=[<NS>] <TAG> <ATTR*> 0 <BODY*> 0
Attr=[<NS>] <TAG> <TEXT*>
Body=<NODE>|<TEXT>
CDATA=0x12 <TEXT*> 0
Binary-Data=0x13 [<NS>] <TAG> <ATTR*> 0 <UVLI len> <BYTES[len]>
0x10, Integer Text String ('[-](0..9)+'), text looked like an integer, and
was encoded as such.
The MSB will indicate the presence of more bytes (high-low order), the LSB
of the final byte will give the sign. Range is limited to that of a 32 bit
signed integer.
0x11, Markov Text String
A variation of lz+markov modeling is used for encoding a string (as before,
the result is added to the MRU).
By default, Nodes will include both a body and attributes along with the
appropiate end markers.
As an exception to this, a prefix may be used with the tag name which will
modify the interpretation:
'/', omit contents and contents end marker;
':', omit attributes and attributes end marker;
'=', omit both attributes and contents.
Markov+LZ Coding
0xF0..0xFF: Reference.
The low 4 bits will give the index.
The next byte will give the match length.
As a special case, if the length is 0, this will serve as an escape for the
previous byte (1 is reserved).
At present I will specify that the window is 16kB.
A 12 bit hash will be used: (((s[-3]<<8)^(s[-2]<<4)^s[-1])&0xFFF).
Updating a hash will consist of:
hash=(hash<<4)^c;
Each string will start with an initial hash key of 0x000.
.
- Follow-Ups:
- Re: new lame effort, self-compressed bin-xml...
- From: jackokring
- Re: new lame effort, self-compressed bin-xml...
- Prev by Date: Re: How compress a string into a compressed string?
- Next by Date: attn: minette - really surprising finances - cozli ip - (1/1)
- Previous by thread: help with batch file for 7z command line?
- Next by thread: Re: new lame effort, self-compressed bin-xml...
- Index(es):
Relevant Pages
|