Re: Quick Huffman theory question



On Apr 26, 7:46 am, digiol...@xxxxxxxxx wrote:
Yeah it must be partial context, I think that the PPMC implementation
was heavily modified to suit mobile devices. Im hoping to keep the
heap size below 256K which is reasonable for a mobile device (although
I think the GUI is included in this). Here are slides giving a brief
overview of how it works:http://www.mdt.tu-berlin.de/Members/rein/dcc06slides
there is a research paper that goes in to much more detail also.

My implementation is scalable in that only a chunk of one single sub
dictionary needs to be in memory at any one time which (in theory) in
the worst case should be 36Kb (and this could be MUCH improved). The
application compresses and decompresses, compression appears to
require a little more work.

I agree that word typing might be tricky considering that there are
multiple word types for some words, these words would probably be
duplicated in sub dictionaries each word type, the compression
algorithm could disambiguate either by choosing the most probable word
type or even better is that the algorithm could look at the following
word and see which would make the 3 word sequence (previous, current,
and following word) likely to be syntacially correct. I beleive that
this would be easily dealt with. The real question is how much benefit
this would give. Its difficult to know without implementing it I would
think.

The two top compressors in the large text benchmark (
http://cs.fit.edu/~mmahoney/compression/text.html ) both use word
categories as context. Both programs preprocess the input text by
substituting codes from a dictionary where words are grouped by
related meanings or syntactic roles. In paq8hp10 the organization was
done manually with the help of some utilities. paq8hp10 (a context
mixing compressor) takes advantage of this by using the high order
bits of the codes as context. In durilca4linux (PPM) the words were
grouped automatically by context. I don't know the details, but the
dictionary looks similar to some experimental results I got by
clustering the words in a vector space generated by hashing the
surrounding contexts.

However, good text compression requires gigabytes of memory. Natural
language has a very complex model, unlike most other data types. If
memory is tight, you can use a memory efficient form of BWT such as
BBB where the blocks are 80% of memory, although it still requires 4x
blocksize of temporary files for compression. Just about every other
program in the table would benefit from having 5 to 20 GB of memory.

If you only have 32K of memory, you should forget about all these
sophisticated techniques and use something like zip.

Also, don't bother with Huffman codes. Use arithmetic coding. Modern
processors have hardware multipliers, so it is fast. Look at the
specs for tornado. It uses LZ77 with arithmetic coding and it
decompresses faster than zip, gzip, or cabarc, all of which use
Huffman codes.

-- Matt Mahoney

.



Relevant Pages

  • New large block BWT compressor
    ... I was fooling around with BWT compression using large blocks (80% of ... available memory) and using a PAQ-like order 0 model back end (but no ... context mixing). ... although BWT is limited to contiguous contexts like PPM. ...
    (comp.compression)
  • Re: Was: Re: What is the state-of-the-art analysing hardware impact on achievable compre
    ... If you & I come to a rsult which improves compression; ... - Where comes the "redundancy" of data making compression possible from? ... proper and most useful context of describing data ... Context free langgaues are always either incomplete or inconsistent; ...
    (comp.compression)
  • Re: Was: Re: What is the state-of-the-art analysing hardware impact on achievable compre
    ... As I had a very hard time to get through the text of the message below I hope, that eventual responses put some more structure into it making it easier to see the main points. ... Where comes the "redundancy" of data making compression possible from? ... What comes in addition to my mind as aspects worth to be discussed in this context are also: ... Compression gets very intersting in this situation. ...
    (comp.compression)
  • Re: Quagga as border router
    ... I'm going to reply this first response in full context, ... already have my router config pretty well done, on a flash memory card, ... substantial enough Cisco router can house the entire v4 route table via ... Quagga, but I really didn't hear any 'technical' reason as to the ...
    (freebsd-net)
  • Re: Entropy
    ... but at the time the algo was rather slow. ... I have a big area of memory, showing the current symbol orders for every ... context, the index is output, and the symbol is moved to the front of the ... I don't want to take into consideration the space required for huffman ...
    (comp.compression)