Re: Huffman & Entropy Question



On Oct 14, 7:50 am, Thomas Richter <t...@xxxxxxxxxxxxxxxxx> wrote:
sebastian schrieb:

Clearly, if you have a proper model, typical strings are the most likely strings generated by the source, and thus you have *usually* the compression ratio Huffman grants you, but in the unlikely event of an atypical string you might have no compression at all, and rather an expansion.

Of course, expansion only occurs from to the overhead of transmitting
to model. In compressing the symbols alone, Huffman encoding
guarantees an output less than or equal to the input size.

That wasn't my point, though. My point was rather: "NO, IT DOESN'T".

Example: use the following Huffman code:

A       ->   1
B       ->   01
C       ->   000
D       ->   001

Does this Huffman code ensure "compression"?
Not at all. If I feed it by *any* message I like to, *most* of the
messages will be expanded, so it clearly doesn't ensure it.

The reason is that this works only well if we really have a random
source with p(A) = 1/2, p(B) = 1/4, p(C)=p(D)=1/8, at least
approximately. If the source derives from that, compression will be
lower, or even an expansion.

However, *even for such a source*, messages like

DDDDDDDDDDDDDDDDDDDDDDDDDDDDDDD

are *not* impossible. Just unlikely. So you do get, at best,
"compression on average" over a huge number of input messages. You do
not get a compression "for a message generated by the random source".
You only get that for "typical messages". The above message would be
rather a-typical, but not impossible. And hence, even for a random
source that *fits* the Huffman code, Huffman doesn't *ensure*
compression. It just makes *compression likely to happen*.

That was the point I was trying to make.

People here often mix the concepts of probabilities, relative
frequencies, models, random sources, specific sequences, random
sequences, etc. and hence create an immense noise level.

So long,
        Thomas

I apologize for reviving an old discussion, but I just wanted to set
the record straight. The Huffman algorithm is, in fact, guaranteed to
produce an output LESS THAN or EQUAL TO the original size. As an
example (sorry, I'm not very good at formulating mathematical proofs),
if you generate a 256 byte file with every possible permutation just
once, every symbol gets assigned an 8 bit code. Here's the output of a
program given that particular input set:

Huffman Test
File: '256.txt'
....reading file...
Uncompressed size: 256 bytes
....compressing...
*** Begin Debugging Log ***
0xd7 : 10100000
0xd8 : 10100001
'b' : 10100010
'Z' : 10100011
'(' : 10100100
0xf3 : 10100101
0xa0 : 10100110
0xf5 : 10100111
',' : 10101000
0x12 : 10101001
0xc1 : 10101010
0x83 : 10101011
0x1e : 10101100
0xaf : 10101101
0x15 : 10101110
0xb7 : 10101111
'j' : 10110000
'R' : 10110001
'4' : 10110010
0xf0 : 10110011
'c' : 10110100
'Y' : 10110101
0xe0 : 10110110
0xe1 : 10110111
0xc2 : 10111000
0x87 : 10111001
0 : 10111010
0xd9 : 10111011
'!' : 10111100
0x9e : 10111101
'z' : 10111110
'B' : 10111111
0xad : 10000000
0x8 : 10000001
0x4 : 10000010
0x1 : 10000011
'v' : 10000100
'F' : 10000101
'i' : 10000110
'S' : 10000111
0xb8 : 10001000
0xd : 10001001
'8' : 10001010
0xac : 10001011
0x99 : 10001100
0xf7 : 10001101
'u' : 10001110
'G' : 10001111
'=' : 10010000
0xbe : 10010001
0x8d : 10010010
0xfa : 10010011
'p' : 10010100
'L' : 10010101
'x' : 10010110
'D' : 10010111
'2' : 10011000
0x11 : 10011001
0xda : 10011010
0xdb : 10011011
'q' : 10011100
'K' : 10011101
0x91 : 10011110
0xf9 : 10011111
0x9c : 11100000
0xdf : 11100001
0xdd : 11100010
0xde : 11100011
0x3 : 11100100
0xd0 : 11100101
'y' : 11100110
'C' : 11100111
0x10 : 11101000
0xb2 : 11101001
'a' : 11101010
'[' : 11101011
0xe5 : 11101100
0xe2 : 11101101
0x94 : 11101110
0x98 : 11101111
0xbc : 11110000
0x19 : 11110001
0x82 : 11110010
'~' : 11110011
0x7f : 11110100
0x80 : 11110101
't' : 11110110
'H' : 11110111
'#' : 11111000
0xb6 : 11111001
'_' : 11111010
'^' : 11111011
0xe6 : 11111100
0xe7 : 11111101
0xc : 11111110
0xee : 11111111
'9' : 11000000
0xa9 : 11000001
0x5 : 11000010
0x6 : 11000011
0x81 : 11000100
0xfd : 11000101
'w' : 11000110
'E' : 11000111
'?' : 11001000
0xdc : 11001001
0x84 : 11001010
0x88 : 11001011
0xcd : 11001100
0xa6 : 11001101
0xc3 : 11001110
0x8b : 11001111
'}' : 11010000
'7' : 11010001
0xab : 11010010
0xd6 : 11010011
0xb9 : 11010100
0x16 : 11010101
'e' : 11010110
'W' : 11010111
0x92 : 11011000
0x8e : 11011001
0xc8 : 11011010
0x9f : 11011011
0xa1 : 11011100
''' : 11011101
's' : 11011110
'I' : 11011111
'1' : 00100000
'-' : 00100001
0xc5 : 00100010
0x93 : 00100011
0x95 : 00100100
0xf8 : 00100101
'{' : 00100110
'A' : 00100111
0x8c : 00101000
0x90 : 00101001
'h' : 00101010
'T' : 00101011
0xe : 00101100
0xb0 : 00101101
0xd4 : 00101110
0xd5 : 00101111
0xeb : 00110000
0xe8 : 00110001
'k' : 00110010
'Q' : 00110011
'5' : 00110100
0xb3 : 00110101
0xe3 : 00110110
0xe4 : 00110111
'r' : 00111000
'J' : 00111001
'm' : 00111010
'O' : 00111011
0x20 : 00111100
0x14 : 00111101
0xcc : 00111110
'.' : 00111111
0xa8 : 00000000
0xef : 00000001
'/' : 00000010
0xb4 : 00000011
0x9 : 00000100
0xae : 00000101
'>' : 00000110
';' : 00000111
0xc0 : 00001000
0xbf : 00001001
0xca : 00001010
0xa2 : 00001011
0x1f : 00001100
']' : 00001101
'g' : 00001110
'U' : 00001111
0x17 : 00010000
0xbb : 00010001
0xff : 00010010
0xfe : 00010011
'&' : 00010100
0x13 : 00010101
0x89 : 00010110
0xfb : 00010111
0x9a : 00011000
0x96 : 00011001
'o' : 00011010
'M' : 00011011
0xb : 00011100
0x1c : 00011101
'<' : 00011110
0x1a : 00011111
0xec : 01100000
0xed : 01100001
0x2 : 01100010
0xd3 : 01100011
0xa4 : 01100100
0xf2 : 01100101
0xcf : 01100110
':' : 01100111
0xcb : 01101000
'*' : 01101001
'n' : 01101010
'N' : 01101011
'|' : 01101100
'@' : 01101101
')' : 01101110
0xb5 : 01101111
'+' : 01110000
0xa5 : 01110001
0xf : 01110010
0xb1 : 01110011
'3' : 01110100
0xa7 : 01110101
0x85 : 01110110
0xfc : 01110111
0x8a : 01111000
0x86 : 01111001
0xa3 : 01111010
'%' : 01111011
0xa : 01111100
0x7 : 01111101
0xce : 01111110
'6' : 01111111
0xc4 : 01000000
0x8f : 01000001
'd' : 01000010
'X' : 01000011
0xd1 : 01000100
0xd2 : 01000101
'$' : 01000110
0xf4 : 01000111
0xe9 : 01001000
0xea : 01001001
0x18 : 01001010
0xba : 01001011
0xc6 : 01001100
0x97 : 01001101
'`' : 01001110
'\' : 01001111
'f' : 01010000
'V' : 01010001
0x1b : 01010010
0xbd : 01010011
'0' : 01010100
0xf1 : 01010101
0x9d : 01010110
0xf6 : 01010111
0xc7 : 01011000
0x9b : 01011001
0xaa : 01011010
0x1d : 01011011
0xc9 : 01011100
'"' : 01011101
'l' : 01011110
'P' : 01011111
Model: 63.875 bytes (511 bits)
Symbols: 256 bytes
Accumulator: 4 bytes
Total overhead: 323.875 bytes
*** End of Debugging Log ***
256 bytes processed in 0.36 seconds
Compressed size: 580 bytes (2.3e+02%)
....decompressing...
580 bytes processed in 0 seconds
....verifying...
Result: pass

In fact, the algorithm automatically adjusts itself in such a way that
no matter what the frequency of any given symbol is, a minimal code is
chosen such that the size of the input is never exceeded. It's an
interesting property that sets it apart from all others (that I know
of). If you still have any doubt, just imagine a 'two-bit' machine and
visualize every possible arrangement of the Huffman tree for a given
input, which really isn't too difficult.

And just to touch on the subject of entropy, I'm going to reiterate
what Thomas has said (although somewhat differently): Entropy has no
real bearing on the 'reality' of the state of a system. It is simply a
measure derived from whatever analysis is being applied to it, eg: the
'relative' order of the system as viewed by the observer. In other
words: Entropy is not 'absolute'.
.



Relevant Pages

  • Next set of vetting
    ... Harrington's Compression Method, ... The First Fundamental is the Huffman. ... Pascals triangle has applications in many many different fields. ... possible outcomes of that many bits, ...
    (comp.compression)
  • Re: Next set of vetting
    ... design a model for a data source that fits to that. ... In typical compression the usage ... You forget here that the input of the huffman must be "out of balance" ... you will find that the result is in balance. ...
    (comp.compression)
  • Re: Huffman versus arithmetic compress for fixed static compression
    ... Huffman performance. ... but the arithmetic coder (still overlooking possible implementation ... huffman for this kind of file compression. ... token for decompression. ...
    (comp.compression)
  • Re: Sorry for the irritation
    ...  Yet the message got no response. ... Yet there's plenty of time here to debate Perpetual Compression! ... is much more flexible than Huffman, but why not use Huffman, ... arithmetic coding here for optimal results. ...
    (comp.compression)
  • Re: Paying for assistance
    ... forget about the input being 8 bit sequences. ... This is just uninteresting, and forget about the dictionary, this is only a technicality and we can assume without loss of generality that the input is sorted by probability. ... The expected lengths of files 1,2,3 are then given by multiplying the expected bit count by the entropy of the given file, i.e. ... them 'bait' for a compression algorithm. ...
    (comp.compression)