Re: compressing a text file
- From: "Matt Mahoney" <matmahoney@xxxxxxxxx>
- Date: 20 Apr 2006 17:06:40 -0700
junky_fellow@xxxxxxxxxxx wrote:
HI guys,
I am new to the field of data compression. I want to write an
algorithm to compress
the text file. One way I thought of replacing the frequently occuring
words with a smaller
symbol. Say, for example if "the" is repeated in the text file 1000
times I would replace
"the" with a new symbol "@" at all the 1000 places.
But there is a possibility that the new symbol "@" is already present
at some places
in the text file. So, I may mistook it as "the". Can anyone suggest me
how to solve
this problem ?
Thanx for any help/hints in advance ...
The solution is to use an "escape" symbol, which could be any character
that rarely occurs in your text. For example, if you use "\" as your
escape symbol, then you encode "@" as "\@" and you encode "\" as "\\".
This is a form of byte-aligned LZW compression. Decompression is
extremely fast. You can improve compression somewhat and avoid the
need for escape symbols if you don't restrict symbols to exactly 8
bits. The best you can do is assign codes of length lg 1/p bits, where
lg means log base 2 and p is the probability of the string encoded by
the symbol.
Other compression methods you might investigate are (from fastest to
slowest, and from worst to best compression) are LZ77, Burrows-Wheeler,
PPM, and context mixing.
-- Matt Mahoney
.
- Follow-Ups:
- Re: compressing a text file
- From: junky_fellow@xxxxxxxxxxx
- Re: compressing a text file
- From: junky_fellow@xxxxxxxxxxx
- Re: compressing a text file
- References:
- compressing a text file
- From: junky_fellow@xxxxxxxxxxx
- compressing a text file
- Prev by Date: Re: compressing a text file
- Next by Date: Re: how many bpb are lost ?
- Previous by thread: Re: compressing a text file
- Next by thread: Re: compressing a text file
- Index(es):
Relevant Pages
|
|