Re: What is the state-of-the-art analysing hardware impact on achievable compression ratio



Claudio Grondi wrote:
> You write:
> > You could enumerate all 512 byte programs to get an exact answer
> > (although not very quickly). But I can tell you the upper bound.
> > There are 2^4096 possible programs so at most you could compress 2^4096
> > out of 2^8192 strings of length 1K bytes. This is a very tiny
> > fraction, less than one in 10^616.

> I know, but what I am after is some kind of understanding what is so special
> about the strings for which programs exist and the strings for which it is
> not possible to write a program when given the mentioned limits.
> How does the characteristics of the compressible or not compressible strings
> depend on the instruction set supported by the CPU?
> Could another CPU design change dramatically the set of the compressible
> strings, so, that the two sets does not have anything in common?

You are asking about algorithmic complexity. The strings that are easy
to compress are the ones that are easy to describe. The algorithmic or
Kolmogorov complexity K(L,x) of a string x is defined as the length of
the shortest program in language L that will output x. Two basic facts
about algorithmic complexity are:

1. Given any two languages, L1 and L2, then for all x, K(L1,x) <=
K(L2,x) + O(1). The complexity of x depends on L only up to a constant
independent of x.

2. There is no algorithm to compute K(L,x) for all x in any L.

To prove 1, any string encoded in L1 can be encoded in L2 by appending
a compiler for L1 written in L2.

An informal proof of 2 is like this. Suppose there is an algorithm to
compute K(English, x) for all x. Let x1 be the first string that
cannot be described in 100 words. Given the above algorithm, you could
try all x until you found x1. However, I have already described x1 in
less than 100 words, so our initial assumption leads to a
contradiction. A formal proof extends this to an arbitrary language L,
in which you could write an equivalent program to print x1:

x = "";
while (K(L,x) <= length_of_this_program)
x = next(x);
print x;

For more on the mathematical foundations of algorithmic complexity, see
Chaitin's classic paper,
http://www.cs.auckland.ac.nz/CDMTCS/chaitin/ibm.pdf

A possible reason why compressible strings are so rare according to the
counting theorem, yet so common in practice, might be because most
"useful" strings are the result of processes that can be computed or
simulated by a program. Given that ideal compression is not
computable, compression becomes the art of understanding the natural
processes that generate data.

-- Matt Mahoney

.



Relevant Pages