Re: compressing a text file



junky_fellow@xxxxxxxxxxx wrote:
HI guys,

I am new to the field of data compression. I want to write an
algorithm to compress
the text file. One way I thought of replacing the frequently occuring
words with a smaller
symbol. Say, for example if "the" is repeated in the text file 1000
times I would replace
"the" with a new symbol "@" at all the 1000 places.
But there is a possibility that the new symbol "@" is already present
at some places
in the text file. So, I may mistook it as "the". Can anyone suggest me
how to solve
this problem ?

Thanx for any help/hints in advance ...

The solution is to use an "escape" symbol, which could be any character
that rarely occurs in your text. For example, if you use "\" as your
escape symbol, then you encode "@" as "\@" and you encode "\" as "\\".

This is a form of byte-aligned LZW compression. Decompression is
extremely fast. You can improve compression somewhat and avoid the
need for escape symbols if you don't restrict symbols to exactly 8
bits. The best you can do is assign codes of length lg 1/p bits, where
lg means log base 2 and p is the probability of the string encoded by
the symbol.

Other compression methods you might investigate are (from fastest to
slowest, and from worst to best compression) are LZ77, Burrows-Wheeler,
PPM, and context mixing.

-- Matt Mahoney

.



Relevant Pages

  • Re: Data-compression puzzle - why cant files be repeatedly compressed until theyre only 1 by
    ... reasonable speculation as to how data compression works. ... the decompressing algorithm takes up ...
    (rec.puzzles)
  • Re: data file similarity metric/software
    ... but a simple correlation function is just ... > when the two are compressed by a data compression algorithm they will ... This means that any data compression ... > algorithm is a metric of similarity. ...
    (comp.programming)
  • Re: data compression on data blocks
    ... >I have to implement a data compression to compress ... It is a good general purpose lossless ... properties that would make some other specific algorithm more ...
    (comp.programming)
  • Re: compressing a text file
    ... I am new to the field of data compression. ... algorithm to compress ... This is a form of byte-aligned LZW compression. ... need for escape symbols if you don't restrict symbols to exactly 8 ...
    (comp.compression)
  • Re: compressing a text file
    ... I am new to the field of data compression. ... algorithm to compress ... This is a form of byte-aligned LZW compression. ... need for escape symbols if you don't restrict symbols to exactly 8 ...
    (comp.compression)