Re: A Revolutionary Breakthrough in Data Compression!



"sebastian" <sebastiangarth@xxxxxxxxx> wrote in message news:8b5cb235-d396-4243-9170-d200aeaf8a9e@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
Not really. I just figured I'd make my first post on comp.compression
with a splash. =)

What I have got is a minor improvement of the classic Huffman
algorithm. So basically, the problem with the classic approach is that
you have to transmit the statistics along with the compressed data.
This is usually addressed by modifying the algorithm to arrange the
codes using the 'canonical' scheme, which in essense enables you to
send just the length of each code instead,
resulting in exactly 256 bytes of overhead.

Actually that's not true. There are of course cases in which you may need 256 bytes, but usually it will be far less.
1) it's unusual to allow such huge codelengths - as it's very impractical for decoding. You may only need 4 or 5 bits per symbol.
2) you could compress the lengths-data. and why wouldn't you? Deflate and BZIP2 both contain nice examples of how to do that.

As it turns out, we can actually do better than that, and it doesn't
even require rearrangement of the codes: transmit the Huffman 'tree'
itself!

I'll describe the method with a bit of pseudo-code:

EMIT_TREE(NODE)
IF NODE IS LEAF
EMIT_BIT(LEAF TAG)
EMIT_BYTE(NODE DATA)
ELSE
EMIT_BIT(NONLEAF TAG)
EMIT_TREE(LEFT CHILD OF NODE)
EMIT_TREE(RIGHT CHILD OF NODE)

Reconstruction from the stream is simple and unambiguous. And since a
Huffman tree is guaranteed to contain 511 or less nodes, the upper
bound for the total size of the compression model is 511 + ( 256 * 8 )
== 2559 bits == 319.875 bytes. This is, of course, 63.875 bytes larger
than the worst (and only) case for canonical encoding, but then the
lower bound is a mere 2.375 bytes, and thus the 'average' (whatever
that means) case should be somewhere in the middle.

Your lower bound seems a little odd to me, how did you get it?

Besides the possible savings in overhead, the 'classic' Huffman
algorithm can be used, which is much more simple and efficient than
the 'canonical' method.

Cheers,

Sebastian Garth

Well it's simpler alright, but they're provably equally efficient (codelengths are not changed)
It takes very little time to calculate the canonical codes..
Canonical codes also make decoding easier


Good luck
Harold

.



Relevant Pages

  • A Revolutionary Breakthrough in Data Compression!
    ... What I have got is a minor improvement of the classic Huffman ... This is usually addressed by modifying the algorithm to arrange the ... even require rearrangement of the codes: ... EMIT_TREE(RIGHT CHILD OF NODE) ...
    (comp.compression)
  • Re: A Revolutionary Breakthrough in Data Compression!
    ... "Harold Aptroot" wrote in message ... What I have got is a minor improvement of the classic Huffman ... This is usually addressed by modifying the algorithm to arrange the ... codes using the 'canonical' scheme, which in essense enables you to ...
    (comp.compression)
  • Re: answers please...
    ... number of output states, given a uniform number of input states. ... source with a non-uniform distribution. ... that codes are possible that generate shorter average bit-length. ... For Huffman, ...
    (comp.compression)
  • Re: answers please...
    ... number of output states, given a uniform number of input states. ... source with a non-uniform distribution. ... that codes are possible that generate shorter average bit-length. ... For Huffman, ...
    (comp.compression)
  • Algorithm for license codes
    ... I know that posting here the details of a new algorithm would irk most of us, but this is the only place I know where I can ask for a reliable advice. ... I'm studying the feasibility of an algorithm based on public key cryptography that is suited for generating strong license codes for product activation. ... The public key part are the polynomials, the private part are an equivalent representation of the polynomials such that it's easy, given Y, to compute the reverse X = F^-1. ...
    (sci.crypt)