Re: A Revolutionary Breakthrough in Data Compression!
- From: "Harold Aptroot" <harold.aptroot@xxxxxxxxx>
- Date: Sat, 26 Sep 2009 13:52:20 +0200
"sebastian" <sebastiangarth@xxxxxxxxx> wrote in message news:8b5cb235-d396-4243-9170-d200aeaf8a9e@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
Not really. I just figured I'd make my first post on comp.compression
with a splash. =)
What I have got is a minor improvement of the classic Huffman
algorithm. So basically, the problem with the classic approach is that
you have to transmit the statistics along with the compressed data.
This is usually addressed by modifying the algorithm to arrange the
codes using the 'canonical' scheme, which in essense enables you to
send just the length of each code instead,
resulting in exactly 256 bytes of overhead.
Actually that's not true. There are of course cases in which you may need 256 bytes, but usually it will be far less.
1) it's unusual to allow such huge codelengths - as it's very impractical for decoding. You may only need 4 or 5 bits per symbol.
2) you could compress the lengths-data. and why wouldn't you? Deflate and BZIP2 both contain nice examples of how to do that.
As it turns out, we can actually do better than that, and it doesn't
even require rearrangement of the codes: transmit the Huffman 'tree'
itself!
I'll describe the method with a bit of pseudo-code:
EMIT_TREE(NODE)
IF NODE IS LEAF
EMIT_BIT(LEAF TAG)
EMIT_BYTE(NODE DATA)
ELSE
EMIT_BIT(NONLEAF TAG)
EMIT_TREE(LEFT CHILD OF NODE)
EMIT_TREE(RIGHT CHILD OF NODE)
Reconstruction from the stream is simple and unambiguous. And since a
Huffman tree is guaranteed to contain 511 or less nodes, the upper
bound for the total size of the compression model is 511 + ( 256 * 8 )
== 2559 bits == 319.875 bytes. This is, of course, 63.875 bytes larger
than the worst (and only) case for canonical encoding, but then the
lower bound is a mere 2.375 bytes, and thus the 'average' (whatever
that means) case should be somewhere in the middle.
Your lower bound seems a little odd to me, how did you get it?
Besides the possible savings in overhead, the 'classic' Huffman
algorithm can be used, which is much more simple and efficient than
the 'canonical' method.
Cheers,
Sebastian Garth
Well it's simpler alright, but they're provably equally efficient (codelengths are not changed)
It takes very little time to calculate the canonical codes..
Canonical codes also make decoding easier
Good luck
Harold
.
- Follow-Ups:
- Re: A Revolutionary Breakthrough in Data Compression!
- From: Harold Aptroot
- Re: A Revolutionary Breakthrough in Data Compression!
- References:
- A Revolutionary Breakthrough in Data Compression!
- From: sebastian
- A Revolutionary Breakthrough in Data Compression!
- Prev by Date: Re: A Revolutionary Breakthrough in Data Compression!
- Next by Date: Re: A Revolutionary Breakthrough in Data Compression!
- Previous by thread: Re: A Revolutionary Breakthrough in Data Compression!
- Next by thread: Re: A Revolutionary Breakthrough in Data Compression!
- Index(es):
Relevant Pages
|