Re: fast static alphabetic (order preserving) compression algorithms/implementations
- From: Steven Pigeon <steven.pigeon@xxxxxxxxxxxx>
- Date: Mon, 09 Jul 2007 17:44:31 -0400
bryan@xxxxxxxxxx wrote:
Hello,
I am looking to identify a fast static alphabetic (order preserving)
compression algorithm for database indices (B+Trees) for use in an
open source project. The algorithm must be unencumbered by patents.
The goal would be to reduce (a) the space for the indices and (b) the
time required to perform key search on the compressed keys (hence the
compressed keys must have the same total order as the uncompressed
keys).
It is completely possible to do. While those aren't exactly
what you're looking for, they'll provide the theory and ideas
you need.
See:
http://www.nist.gov/dads/HTML/orderPreservingHuffmanCoding.html
http://www.springerlink.com/content/an9maht67tkd1a8n/
and google for "order-preserving database index compression" or other
compression keywords and you'll be on your way.
best,
S.
.
In the general case the keys are variable length unsigned byte[]s.
However, there are applications where a known structure is imposed on
the keys that can be used to choose the alphabet. One example is an
RDF database, where the keys are an array of three 64-bit long
integers serving as term identifiers for a "statement" {subjectId,
predicateId, objectId}. In this case the term identifiers form the
alphabet and the symbols are 64-bit integers rather than bytes.
There are several places within the database architecture where we
would use this algorithm. In nearly every case the keys will already
be in order and the #of keys will be bounded by the branching factor
of the B+Tree (the exception is a sort performed on compressed keys).
I imagine that some algorithms might exploit knowledge that the keys
are already in order.
Relevant algorithms that I have identified include Hu-Tucker and
variants. I have seen people indicate that arithmetic coding can be
order preserving as well. Finally, I have seen indications that
standard libraries such as the compression package integrated within
Java may produce alphabetic codes, but (a) these algorithms appear to
be adaptive (the dictionary is updated as the input is processed); and
(b) there does not appears to be any means to isolate the dictionary
required to decode the data from the compressed keys (with the
consequence that you can not compare the compressed keys since you
don't know which bytes form the dictionary and which bytes form each
key).
The database is Java and is an open source project. It would be ideal
to find an implementation in Java (or in C/C++ with a JNI interface
with a license no more restrictive than LGPL).
Thanks,
-bryan
- References:
- Prev by Date: Re: fast static alphabetic (order preserving) compression algorithms/implementations
- Next by Date: Re: fast static alphabetic (order preserving) compression algorithms/implementations
- Previous by thread: Re: fast static alphabetic (order preserving) compression algorithms/implementations
- Next by thread: Re: fast static alphabetic (order preserving) compression algorithms/implementations
- Index(es):
Relevant Pages
|