Re: Large text compression benchmark



tony.nospam@xxxxxxxxxxxxxxxxxxxxxxx wrote:
"Matt Mahoney" <matmahoney@xxxxxxxxx> writes:

I am creating a text compression benchmark using a large corpus of text
(1 GB of Wikipedia). It's goal is to promote research in natural
language processing applications (such as speech recognition). I
explain the rationale for this in the benchmark:
http://cs.fit.edu/~mmahoney/compression/text.html

I would appreciate any comments on whether you think the benchmark
achieves this goal or not, and what you might change about it.

As this is comp.speech.research I'll answer from the speech recognition
viewpoint - you'll get different answers if you ask computational
linguists.

The language modelling component of a large vocabulary speech
recognition system is very similar to text compression. Speech people
use the term "perplexity" which is directly related to the number of
bits needed to store a decision (2^PP) and so the file size.

Here are the main differences:

Speech people want clean tokenised input, that is every word separated
by whitespace with punctuation normalised and non-alphabetic items
converted to the spoken form. This is quite different from the
predict-every-bit approach of compression.

Speech people want defined training, development and (perhaps hidden)
test sets. They don't care (too much) about model size as in practical
use the material will be new.

As a consequence of the two items above, speech people have an "out of
vocabulary" problem - that is words that occur in the dev/test set that
aren't seen in training - this isn't a big issue but needs to be thought
about when presenting and comparing results.

Off the top of my head, I can't think of a benchmark in language
modelling that is freely available, so there is an opportunity here.
However, in order for this to be widely adopted I'd suggest that you
want to reduce the XML into a clean tokenised form and partition the
data. There are language modelling toolkits available (CMU/Cambridge
v2, SRILM, HTK LM) which would give baseline numbers.

Good luck.


Tony

Thanks for your comments. In my early dissertation research in 1999, I
was interested in estimating the complexity of language modeling. We
really don't know how hard the problem is. I made a graph where I
plotted compression ratio vs. size of the training set for about 30
language models and projected the trend to where it met Shannon's
estimate of the entropy of written English. The graph is here:
http://cs.fit.edu/~mmahoney/dissertation/ (I have since finished my
dissertation in another topic that had funding).

Most of the language models I studied were for speech, so to make any
useful comparison to data compression programs I had to convert
perplexity to compression ratio by making assumptions about word
length, out of vocabulary words, and so on. In any case the trend
suggests a model complexity of about 10^9 bits which agrees with the
estimates by Turing and Landauer, and is about how much language a
person processes in a lifetime.

One problem with the benchmark is it is not in the clean form you
described, which is also the form that Shannon used. It would be a lot
of work to retest all the compressors so I will might instead try to
estimate the entropy of the non-text parts, perhaps by comparing the
compression ratios for clean and dirty text for several programs.

-- Matt Mahoney

.



Relevant Pages

  • language origin, language evolution, evolutionary mutation
    ... The evolution of the Expressive voices into Language ... Vocal expression, animal voicing or speech, is an action, ...
    (sci.lang)
  • Re: Chess Bitch
    ... > Maybe such language is ... > used in the square, and maybe by the 30-year-old women in the square, most ... > of whom, if anything, are even less well-educated than Ms. Shahade. ... > frequently are dismissed for indulging in gender-based hate speech and/or ...
    (rec.games.chess.misc)
  • Re: OT: Some thoughts abouth thinking
    ... >> mean that on the order of 0.5 kg of brain tissue is spent ... >> on processing language. ... >> Based on that line of arguments, I find speech processing by ... In that context, the sense of smell is one of the oldest, as the ...
    (comp.dsp)
  • Re: Why do you like Mozarts music?
    ... >>> someone really wants to make a claim for a specific speech dialect ... > good measure, the language he actually spoke, Swedish. ... >> sentence I am writing now, and there are a lot of examples for this ... Beethoven does write complex phrase structures (known obligingly ...
    (rec.music.classical.recordings)
  • Re: Large text compression benchmark
    ... language processing applications. ... As this is comp.speech.research I'll answer from the speech recognition ... There are language modelling toolkits available (CMU/Cambridge ...
    (comp.speech.research)