Re: Compressibility of DNA



BDH wrote:
> This is pretty obvious. LZW compresses some genomes to less than that;
> bacterial genomes are less redundant and only PPM does better than 2
> bits/bp on them.
> http://corpus.canterbury.ac.nz/details/large/RatioByRatio.html
> I actually wrote and used a program that stored entire genomes in
> memory using this representation some time ago.

PPM reports 1.97 bpc. PAQ6 results on e.coli (timed on a 750 MHz
Duron):

paq6v2 -5 x large\e.coli
large\e.coli 4638690 -> 1122269
1122304/4638690 in 892.60 sec. (1.9356 bpc, 24.19% at 5 KB/s)

One type of redundancy in e.coli is the occurrence of palindrome
complements. For example, if "aaact" occurs with a certain frequency,
then "agttt" occurs with about the same frequency. (swap a-t, c-g and
reverse the order). However PAQ doesn't have a special model for this.

I believe eukariote DNA is more compressible than bacteria because the
non coding regions often contain long repeating sequences. Also, there
may be several copies of a gene. e.coli seems to lack these features.
Eukariotes have regions of DNA that bind to gene regulation proteins
that are unnecessary in organisms without cell differentiation.

It is true that 64 codes (3 base pairs) map to 21 codons (20 amino
acids and a stop signal) but this doesn't help. There is no
evolutionary pressure to choose one code over another for the same
amino acid, and random (uncompressible) mutations accumulate. The
mutation rate is much higher in bacteria than in higher organisms.

-- Matt Mahoney

.



Relevant Pages

  • Re: Evolution increases the computational ability of organisms.
    ... They are as complex as they can be for the environments they are in. ... the bacteria have maximized the amount of information stored in their genomes vis-a-vis the environment. ... reason to believe that fitness ever achieves a maximum. ...
    (talk.origins)
  • Re: In the news: Researchers Predict Infinite Genomes
    ... > this requires that a string of DNA can exist that's infinitely long. ... > number of distinct genomes of that length is 4^N. ... >>>bacteria or viruses, which have an upper size limit. ...
    (talk.origins)
  • Re: Evolution Theory Has Scientificly Collapsed
    ... plants and simple dna type organisms.) ... and large organisms do not equate with large genomes. ... Some viruses have genomes larger ... than some bacteria, and some eukaryotes have smaller genomes than some ...
    (talk.origins)
  • Re: In the news: Researchers Predict Infinite Genomes
    ... >>Glenn wrote: ... >>>the species, and scientists will find significant new genes. ... How many genomes do you ... >>bacteria or viruses, which have an upper size limit. ...
    (talk.origins)
  • Re: Why dont mitochondria have junk DNA?
    ... >>Prokarotic genomes have a single origin of DNA replication. ... In the case of bacteria with large genomes, ... >>problem is solved by having more than one chromosome but the mechanisms ...
    (talk.origins)