Re: What is the state-of-the-art analysing hardware impact on achievable compression ratio



"Matt Mahoney" <matmahoney@xxxxxxxxx> schrieb im Newsbeitrag
news:1131913112.544390.296720@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
> Claudio Grondi wrote:
> > You write:
> > > You could enumerate all 512 byte programs to get an exact answer
> > > (although not very quickly). But I can tell you the upper bound.
> > > There are 2^4096 possible programs so at most you could compress
2^4096
> > > out of 2^8192 strings of length 1K bytes. This is a very tiny
> > > fraction, less than one in 10^616.
>
> > I know, but what I am after is some kind of understanding what is so
special
> > about the strings for which programs exist and the strings for which it
is
> > not possible to write a program when given the mentioned limits.
> > How does the characteristics of the compressible or not compressible
strings
> > depend on the instruction set supported by the CPU?
> > Could another CPU design change dramatically the set of the compressible
> > strings, so, that the two sets does not have anything in common?
>
> You are asking about algorithmic complexity.
I think I am not.
In my eyes down to the essence of compression there is also one kind of
compression possible: some of the content of the string to compress is
already there so it is enough to point to it and tell its length. The CPU
provides with its built-in functionality some of the strings even if they
have to be created in memory first (by running e.g. a program), but only a
fraction of all possible patterns can be created this way and memory and
time limits puts restrictions on the size of the strings which can be taken
as granted in order to point to their substrings. What I am after is the
answer to the question which strings can be considered as granted for the
kind of compression I describe above. The possible patterns of such strings
_must_ in my eyes be dependent on the CPU instruction set and of course the
length of the program used to generate them. I suggest with my question,
that there is maybe a generic dependency which is only a function of the CPU
instruction set. If not, limiting the size of the program will give a to
this size limited insight into the dependency from the CPU instruction set
only.
Is there any work done on it from this point of view, or do I still missing
something here?

Claudio

> The strings that are easy
> to compress are the ones that are easy to describe. The algorithmic or
> Kolmogorov complexity K(L,x) of a string x is defined as the length of
> the shortest program in language L that will output x. Two basic facts
> about algorithmic complexity are:
>
> 1. Given any two languages, L1 and L2, then for all x, K(L1,x) <=
> K(L2,x) + O(1). The complexity of x depends on L only up to a constant
> independent of x.
>
> 2. There is no algorithm to compute K(L,x) for all x in any L.
>
> To prove 1, any string encoded in L1 can be encoded in L2 by appending
> a compiler for L1 written in L2.
>
> An informal proof of 2 is like this. Suppose there is an algorithm to
> compute K(English, x) for all x. Let x1 be the first string that
> cannot be described in 100 words. Given the above algorithm, you could
> try all x until you found x1. However, I have already described x1 in
> less than 100 words, so our initial assumption leads to a
> contradiction. A formal proof extends this to an arbitrary language L,
> in which you could write an equivalent program to print x1:
>
> x = "";
> while (K(L,x) <= length_of_this_program)
> x = next(x);
> print x;
>
> For more on the mathematical foundations of algorithmic complexity, see
> Chaitin's classic paper,
> http://www.cs.auckland.ac.nz/CDMTCS/chaitin/ibm.pdf
>
> A possible reason why compressible strings are so rare according to the
> counting theorem, yet so common in practice, might be because most
> "useful" strings are the result of processes that can be computed or
> simulated by a program. Given that ideal compression is not
> computable, compression becomes the art of understanding the natural
> processes that generate data.
>
> -- Matt Mahoney
>


.



Relevant Pages


Loading