Re: Compressibility of DNA
- From: "Matt Mahoney" <matmahoney@xxxxxxxxx>
- Date: 4 Dec 2005 17:44:39 -0800
BDH wrote:
> This is pretty obvious. LZW compresses some genomes to less than that;
> bacterial genomes are less redundant and only PPM does better than 2
> bits/bp on them.
> http://corpus.canterbury.ac.nz/details/large/RatioByRatio.html
> I actually wrote and used a program that stored entire genomes in
> memory using this representation some time ago.
PPM reports 1.97 bpc. PAQ6 results on e.coli (timed on a 750 MHz
Duron):
paq6v2 -5 x large\e.coli
large\e.coli 4638690 -> 1122269
1122304/4638690 in 892.60 sec. (1.9356 bpc, 24.19% at 5 KB/s)
One type of redundancy in e.coli is the occurrence of palindrome
complements. For example, if "aaact" occurs with a certain frequency,
then "agttt" occurs with about the same frequency. (swap a-t, c-g and
reverse the order). However PAQ doesn't have a special model for this.
I believe eukariote DNA is more compressible than bacteria because the
non coding regions often contain long repeating sequences. Also, there
may be several copies of a gene. e.coli seems to lack these features.
Eukariotes have regions of DNA that bind to gene regulation proteins
that are unnecessary in organisms without cell differentiation.
It is true that 64 codes (3 base pairs) map to 21 codons (20 amino
acids and a stop signal) but this doesn't help. There is no
evolutionary pressure to choose one code over another for the same
amino acid, and random (uncompressible) mutations accumulate. The
mutation rate is much higher in bacteria than in higher organisms.
-- Matt Mahoney
.
- References:
- Compressibility of DNA
- From: Nils
- Re: Compressibility of DNA
- From: cp
- Re: Compressibility of DNA
- From: BDH
- Compressibility of DNA
- Prev by Date: Re: Near Infinite Data Compression - Myth? or Fact!
- Next by Date: Re: Near Infinite Data Compression - Myth? or Fact!
- Previous by thread: Re: Compressibility of DNA
- Next by thread: Re: Compressibility of DNA
- Index(es):
Relevant Pages
|