Re: fast static alphabetic (order preserving) compression algorithms/implementations



bryan@xxxxxxxxxx wrote:
Hello,

I am looking to identify a fast static alphabetic (order preserving)
compression algorithm for database indices (B+Trees) for use in an
open source project. The algorithm must be unencumbered by patents.
The goal would be to reduce (a) the space for the indices and (b) the
time required to perform key search on the compressed keys (hence the
compressed keys must have the same total order as the uncompressed
keys).

It is completely possible to do. While those aren't exactly
what you're looking for, they'll provide the theory and ideas
you need.

See:

http://www.nist.gov/dads/HTML/orderPreservingHuffmanCoding.html
http://www.springerlink.com/content/an9maht67tkd1a8n/


and google for "order-preserving database index compression" or other
compression keywords and you'll be on your way.

best,

S.



In the general case the keys are variable length unsigned byte[]s.
However, there are applications where a known structure is imposed on
the keys that can be used to choose the alphabet. One example is an
RDF database, where the keys are an array of three 64-bit long
integers serving as term identifiers for a "statement" {subjectId,
predicateId, objectId}. In this case the term identifiers form the
alphabet and the symbols are 64-bit integers rather than bytes.

There are several places within the database architecture where we
would use this algorithm. In nearly every case the keys will already
be in order and the #of keys will be bounded by the branching factor
of the B+Tree (the exception is a sort performed on compressed keys).
I imagine that some algorithms might exploit knowledge that the keys
are already in order.

Relevant algorithms that I have identified include Hu-Tucker and
variants. I have seen people indicate that arithmetic coding can be
order preserving as well. Finally, I have seen indications that
standard libraries such as the compression package integrated within
Java may produce alphabetic codes, but (a) these algorithms appear to
be adaptive (the dictionary is updated as the input is processed); and
(b) there does not appears to be any means to isolate the dictionary
required to decode the data from the compressed keys (with the
consequence that you can not compare the compressed keys since you
don't know which bytes form the dictionary and which bytes form each
key).

The database is Java and is an open source project. It would be ideal
to find an implementation in Java (or in C/C++ with a JNI interface
with a license no more restrictive than LGPL).

Thanks,

-bryan

.



Relevant Pages

  • fast static alphabetic (order preserving) compression algorithms/implementations
    ... The algorithm must be unencumbered by patents. ... time required to perform key search on the compressed keys (hence the ... RDF database, where the keys are an array of three 64-bit long ... standard libraries such as the compression package integrated within ...
    (comp.compression)
  • Re: How do I hash this?
    ... I got some primary key failures which would indicate matching keys. ... Yes they'll ultimately be loaded in to a database. ... Your hash function could either require that the 3 integers be sorted ... rest of the algorithm. ...
    (comp.programming)
  • Re: How do I hash this?
    ... I got some primary key failures which would indicate matching keys. ... Yes they'll ultimately be loaded in to a database. ... Your hash function could either require that the 3 integers be sorted ... rest of the algorithm. ...
    (comp.programming)
  • Recompressing poorly compressed files
    ... compression algorithm was not very good, and the 2nd one is very ... just compressing with a good algorithm in the first place of course. ... algorithms for compression and decompression, or even if the're a bit ... reversible (e.g. 'decompressing' a gif into a bitmap means losing some ...
    (comp.compression)
  • Re: Computer being developed modeled after human brain
    ... innate skills. ... run-length-encoding compression module. ... If the "edge detection" algorithm is not useful for ... edge detection algorithm will be replaced. ...
    (comp.ai.philosophy)