Re: how to estimate compressibility ?



Matt Mahoney <matmahoney@xxxxxxxxx> wrote:
: Ulrich Lauther wrote:
: > I have the following problem:
: > An algorithm generates a stream of characters that later on will be
: > compressed. In each step, I can select the character to generate from a set
: > of candidate characters using the last character (or possibly the last few
: > characters) generated to guide my decision.
: > Most - but of course not all - characters generated are zero.
: >
: > I assign a "cost" to each selection, which should reflect compressibility
: > of the generated sequence.
: > Currently, I assign a cost of 0 for a zero following a zero, a cost of 1
: > for a zero following a non-zero, and a cost of 2 in all other cases.
: >
: > Any better ideas?
: >
: > BTW, for compression I use a sequence of Burroughs-Wheeler transformation,
: > Move-to-front, Runlength-encoding, and Huffman-coding.

: I assume you are at the BWT+MTF+RLE stage, and now you want to assign
: costs to select Huffman codes? Your cost should be log 1/p, where p is
: the probability of the next symbol. (Cost will be proportional to your
: code length). You can estimate p by counting character occurrences.
: It will vary depending on the input. If your input is random data then
: BWT+MTF will have a uniform distribution without any long runs. If
: your input is highly compressible then BWT+MTF will have long runs of
: zeros and mostly small values otherwise.

: -- Matt Mahoney

No, I am way before that stage - and I could use any other compression
method.
As I said, I generate a data stream where I have a (limited) choice for
each character and I want to choose in a way that the resulting stream
becomes as compressible a possible.
Yes, I could use log 1/p as cost, but that whould take a very global look
at the problem and would not catch local properties of the data stream.


--
-lauther

[nosave]
----------------------------------------------------------------------------
Ulrich Lauther ph: +49 89 636 48834 fx: ... 636 42284
Siemens CT SE 6 Internet: Ulrich.Lauther@xxxxxxxxxxx
.



Relevant Pages

  • Re: "Read stuff from a file and chop it up to do stuff" code advice wanted.
    ... ;; This function returns TRUE if any character ... (if (char< char #\!) ... a stream and an array to hold characters in temp memory. ... ;; resulting string. ...
    (comp.lang.lisp)
  • Re: How to make Word STOP rewriting your copy
    ... The "character stream" that I was referring to is the GUI stream and not a physical storage stream (although I still believe the paragraph marker originally had it's roots there). ... the "power" of a tool can only be measured by how much PROODUCTIVE work can be done AT A GIVEN COST. ...
    (microsoft.public.mac.office.word)
  • Re: Read only last line-
    ... stream with a value that does not correspond to a position in the file ... example is putcon Windows, where one character ... That's the mapping you have to support. ... and fgetpos/fsetpos. ...
    (comp.lang.c)
  • Re: How do I stop a Winsock from buffering characters?
    ... it's applied at the OS level to the socket. ... Stream s = client.GetStream; ... from the client code and have the server see it right away. ... first character of the client send. ...
    (microsoft.public.windowsce.embedded)
  • Re: reading csv files
    ... >> stream, so, after the first name is extracted, the subsequent ... delimiter character, and not provide a complete solution to reading a CSV ... The next such call extracts 'yyyy', and ',', and the ...
    (alt.comp.lang.learn.c-cpp)