Re: Yet again, human evolution: huh?



> > What is the backbone? (Deoxyribose ring at each position?)
> The ribose + phosphodiester is usually considered the backbone.

I'm confused: I thought RNA meant ribo-nucleic acid, which means ribose
is the sugar part of the base, whereas DNA is called deoxyribo-nucleic
acid, becuase instead of ribose it's using deoxyribose. But you say DNA
uses ribose just the same as RNA??

> > ... Each backbone sugar is connected to the next, the 3' to the next
> > 5' or vice versa, I forget which, ...
> Traditionally it's 5' to 3', probably because that's the direction of
> replication and transcription.

OK, I think I'll skip trying to learn specifically whether the 3' or
the 5' of one unit connects to the physphoric acid of the next or
previous unit, and I'll just dumb it down to "replication and
transcription progress in the same direction in all known life on
Earth, and by convention reading out the sequence for the purpose of
publication etc. goes in that same direction".

Hmm, does replication and transcription go in the same direction along
the two parallel strands, or one strand processed in one direction and
the other strand processed in the opposite direction because the
base+phos combo is physically oriented in the opposite direction?
Does unzipping of the two strands, prior to replication fo each strand
separately, during mitosis, happen from one end or the other, or entire
strand in parallel at once? If unzipping happens from one end, but one
of the two strands is replicated or transcribed in the opposite
direction, how can that possibly work??

Hmm, if the two strands are physically layed out in opposite
directions, then whenever replication of a single strand happens, the
newly-built strand is being built in its own backwards direction,
right? (If the two strands run in true-parallel, same direction on both
strands, then this issue doesn't apply.)

Anyway, I'll trust that regardless of whether the two strands are
replicated and transcribed in true-parallel (same direction) or
anti-parallel (opposite directions), for each strand the replication
and transcription go in the saem direction, and readout of base
sequence for purposes of these discussions and all genomic databases
also goes in that same direction for that one strand. I hope at least I
got that correct now.

Regarding anti-parallel layout:
> Yes, that's correct. But I really don't want to get into that. I'm
> giving just exactly enough information to understand the phylogenetic
> discussion, and no more. Almost everything you say, here and below, is
> beyond what I need.

That's OK. You asked for feedback as to what parts were opaque or
caused confusion, and I simply replied from my own point of view,
bringing up questions that might confuse me if I didn't know the
answer, and asking some other related questions while we were on the
topic. I saw a later posting where you got rid of the five-species
example, using a four-species example instead. As a first tutorial as
to what's going on, I think that's a good idea, and don't know why I
didn't think of it myself, but I'm glad you did.

Still, the real hypothesis is that three (3) African apes, not just
two of the three, are in a small clade, split off from the other apes.
Maybe you can go into great detail with just two of them, such as
Gorilla+Human ---*--- Orangatan+Gibbon
then once that's all explained, present just the table of results
for the other two possible 2/2 splits:
Gorilla+Chimpanzee ---*--- Orangatan+Gibbon
Chimpanzee+Human ---*--- Orangatan+Gibbon
If all those 2/2 splits show very high confidence, then we can
summarize the three results as a sure conclusion of the desired 3/2
split.

Note that the 3/2 split does *not* guarantee the African-ape
hypothesis, that Gorilla+Chimpanzee+Human are very closely related. The
opposite could be equally true, consistent with the 3/2 split: Gibbon
and Orangatan could be very closely related, and all three Gorilla
Chimpanzee and Human could be distant out-groups. It's only when you
build a philogenic tree that includes hundreds or thousands of species,
and see that the African apes remain together, separate from all other
species, that the African-ape-clade hypothesis is demonstrated
conclusively. Even then, in principle it's possible that some new
completely unexpected species that looks totally unlike apes and hasn't
been classified as apes might turn out to be very closely related to
the African apes. But if that ever happened, it would call a lot of
philogenics into question. I don't believe that would happen, that's my
prediction. But the current classification *is* falsifiable by that
means.

> They call them species in bacteria too, ...

I hope that poor terminology practice is abandoned sometime soon. In
the case of sexually reproducing life, where the concept of "species"
makes full sense, we'll have to deal with terminology to deal with a
point midway through a speciation event, and ring species, etc. I'm not
sure how best to deal with those cases. I presume you saw the "ring"
species of birds that span from Mongolia through Eastern China and
Himalayas and Western China and back to Mongolia, where the two ends in
Mongolia are completely different species if seen alone, unable to mate
with each other, yet each link around the cycle is same-species, so by
transitivity the two species in Mongolia are of the same species
despite inability to mate with each other. I think in that case it's
reasonble to call them all one species, per the transitive closure of
mating relations, but call them different varieties in each region
around the closed-U, and have a measure of sequence around the closed-U
that approximately determines ability to mate. But in other cases where
there are two large but tight groups with almost no intermixing between
the two, but still *some* intermixing, with the middle individuals
fertile and able to breed with either side, perhaps they can be termed
different species, and the few exceptions in the middle can be
considered fertile cross-breeds. But if experts decide to clump the two
into a single species, and demote each of the two to a sub-species,
that would be fine with me too.

> > How do you deal with duplication events? For example, consider these:
> > AAGAAGCTAGTGTAAGA
> > GTAAGTAAGATGCTAGTGTAAGC
> > where the last part of the top genome is a good match for both the
> > first part and last part of bottom genome? Are all three bases (the
> > original in any position in top-right part, and each copy of it in
> > bottom-left and bottom-right parts) considered all the "same position"?
> In practice, one of them is considered the "original" sequence, and the
> other is considered a "duplicate" sequence, and you put gaps in the top
> sequence to cover the duplicate.

Oh, I hate that, presumably in this case done like this:
xxxxxxAAGAAGCTAGTGTAAGA
GTAAGTAAGATGCTAGTGTAAGC
since it completely discards the obvious descent from the original to
that moved copy (at very left above), and the study of point mutations
along that path of descent. I hope that practice changes. If the
software can't deal with indels, then I suggest instead of matching
emptyness to the moved copy, simply duplicate the original and put it
likewise in whatever position would line up with its descendent in the
second genome:
(copy) (.....actual.....) Copy here means fake copy to make it align w/cop1.
GTAAGA AAGAAGCTAGT GTAAGA
GTAAGT AAGATGCTAGT GTAAGC
(cop1) (cop2) Cop<n> here are from actual duplication event.
It would be a pain to have to do this manually, so I suggest the
high-quality software be modified to do this copying automatically
during the phase where the two sequences are automatically lined up.

I have a side question: When there's no duplication event, just a
simple insert or delete, I can understand a large segment of DNA
getting lost, and I can understand a large insertion that is a simple
pattern repeating over and over, and also horizontal gene transfer can
insert a large segment that came from somewhere totally alien. But does
it ever happen that an apparently random sequence is inserted, not from
any source, but brand-new random sequence of DNA bases out of nowhere?
If not, then I would assume whenever we find two sequences like this:
CACGAGCCATACGATATCAGT CCGTAGTGAGCACTATTAAACAGTTAGAGCGGTTT
CACGACCCATACGATATCAGTTTGTTCATTAGCTCAATAATTCCGTTGTGAGCACTATTAAACAGTTAGAGCCGTTT
if that middle part of the bottom sequence doesn't look like anything
available via lateral gene transfer then we must assume the bottom
sequence is ancestral and the top sequence derives from it as the
result of a deletion event? It can't reasonbly be top-ancestral
bottom-large-random-insertion-event, right? (Note: I've thrown in a few
point mutations also just to make the data more realistic.)

> > other copy stayed x. To avoid this complication, you might wish to
> > explicitly exclude any such duplicated base from study here?
> Or I could just use sequences that have no duplicated bases.

Yes, that's what I meant by "exclude" there. You exclude such examples
from your tutorial because they introduce unnecessary complications. I
just meant you might add a note that you don't cover any such cases,
maybe in a footnote of caveats at the bottom, after all the explanation
of the easy cases is done. Or not. Your tutorial, your judgement how
much to mention. I'm just one of several free proofreaders for content.

> Point of terminology: in systematics, "branch point" and "node"
> generally mean the same thing. Sometimes we distinguish between terminal
> nodes and internal nodes, ...

Sorry, I didn't know the correct jargon for this specialty. If this
were math or software, I'd use the terms "leaf nodes" and "internal
nodes", but I didn't think "leaf nodes" would be understandable to
readers of your tutorial so I tried to guess what they would
understand, and I mis-guessed.

> What I have there is a partially unresolved tree of 5 taxa, with 2
> internal nodes. There are three possible fully resolved trees
> compatible with that unresolved tree.

Try explaining *that* to an absolute beginner at graphs!! :-(
Good thing you decided to scrap the 5-species stuff and use 4-species now.

> I think it's very confusing to talk of this as two trees instead of
> as one tree.

There are two rooted trees, what is usually meant by a tree, attached
back to back (trunk to trunk) to make one large unrooted tree (the
unusual kind being discussed here). In general at any inner link if you
cut there you get two rooted trees, and for these kinds of ternary
unrooted trees if you cut out any single internal node you get three
rooted trees. It all seems simple to me, don't know if it makes sense
to you, or more important to the intended reader of your tutorial.

> There are no patterns that partly support it. Either they support it,
> contradict it, or are irrelevant to it.

I disagree. The hypothesis is:
Gor+Chi+Hum / Ora+Gib
The following data is what you are looking for:
Gor+Chi+Hum / Ora+Gib
The following sets of data are also consistent with the hypothesis:
Chi+Hum / Gor+Ora+Gib * Note this case
Gor / Chi+Hum / Ora+Gib * Note this case
Gor+Chi+Hum / Ora / Gib
The following data is inconsistent with the hypothesis, except by
arguing special circumstances such as duplicate mutations in distant
branches or polymorphism in common ancestor or horizontal gene transfer:
Gor+Gib / Ora+Chi+Hum

* The two cases flagged above clearly show a close relation between
chimp and human, separate from the relation between orangatan and
gibbon, which is *part* of the close clustering of all three African
apes, hence my claim these give some evidence of that hypothesis.
I claim that is indeed weak support, rather than totally neutral.

Sometimes in mathematics, it's best to derive a stronger theorem than
what we really want, and then generate what we want as a simple
corollary to the main theorem. In this case, we might generate the
stronger fully-resolved 5-species unrooted tree:
Gib Gor Hum
| | |
Orang---*----*----*---Chimp
and then the three-African-ape hypothesis would be one of the two
possible corollaries (the other being Hum+Chi cousins hypothesis).
(With the caveat that we haven't proven whether Ora+Gib are outgroups
for tight Gor+Hum+Chi clade, or whether Gor+Hum+Chi are the outgroups
with tight Ora+Gib.)

> In the first case, the time between speciation events is such that
> ancestral polymorphisms would be expected to have coalesced without any
> retention.

How is this a valid fact to assert, if the only information you have
are these five short DNA sequences, no information about branch lengths
as you claimed above?

> Remember, these are mitochondrial seqeuences.

No, I don't remember seeing this in the OP. Let me check there ...
indeed, the *only* mention in the entire article is way at the very end
in the reference:
> Molecular phylogeny and evolution of primate mitochondrial DNA.
I have no access to a technical library, so I didn't read that part.
You should have mentionned this was mitochondrial DNA way at the top.
Indeed, I would expect mitochondrial DNA to have very very little
polymorphism that lasts only a few generations. So with that
revelation, I agree ancestral polymorphism is unlikely, leaving
duplicate identical mutations in separate branches, and horizontal gene
transfer (via some virus vector such as ape influenza or monkey pox) as
the two remaining ways we can explain away the conflicting data.
It would help to have some statistical analysis of the general mutation
rate (before individuals are immediately killed off by fatal
mutations), hence the chance that the same exact neutral or beneficial
mutation would occur twice.

> In fact homoplasy is quite common in DNA sequences. There are only 4
> possible bases, after all.

I've heard that there's about one mutation per generation. For a gemone
of several million bases, that's one mutation per several million bases
per generation, or one mutation per base per several million
generations. Are there several million generations separating African
apes and other apes, whereby we'd expect mutations in the same location
to recur all over the place, with one third of them resulting in the
same result, hence several such within any segment of the size you were
considering? I didn't think so, but was I mistaken?

> Please read just a little bit ahead before commenting.

Do you expect your novice readers to read the whole tutorial before
they understand any of it? Wouldn't it be more reasonble for it to be
self-explanatory in a single-pass forward reading?

> I don't think you understand how the chi square test works. There is
> only one test performed on the entire distribution. It asks whether the
> 7 patterns (and there really are only 7 relevant patterns) occur with
> equal frequency. They don't. That's all.

I don't need the chi squared test to see from the raw totals that the
African/nonAfrican split is way out ahead compared to all the rest.
Converting them to all chi-squ scores doesn't look, on the face of it,
just looking at the numbers, not knowing what to expect, whether it's
really significant or not. On the other hand, 95% or better confidence
on one hypothesis, and only 50% of less for all the others, shows me
true and obvious significance. From later in your tutorial, I see
something better than 99.999999% confidence, which is super super good.
But that's too late. This reader is already saying "so what? why did
you bother to show me these meaningless (to me) chi-sq results here?"
at the point where the chi-sq has been shown but the P hasn't yet.

> Here's what you can do. Go to Genbank and find a random sequence that
> has entries for all 5 species above. ...

I have no idea how to do that. Is there a simple tutorial for laypeople
who want to do such simple tasks? One time a year or so ago I saw a URL
for a set of provisional sequence data that was organized in a way
where I could just browse the data at random and pick some sequence
from somewhere and view/download. But I have no idea how to find
matching sequences in five different gemones, and I'm sure it can't be
done by manually browsing each of the five independently even if I had
URLs for each of the five.

> This is not an assumption of what I want to prove. It's a prediction
> based on experience, mine and that of everyone who has ever sequenced
> primate DNA.

I think you should have presented it clearly as such a prediction from
the model, a way of falsifying the model (hypothesis) if your
prediction turns out to be wrong in more than just a tiny fraction of
cases.

> > I believe that among five species there are thirty possible unrooted trees,

> 15.

Aha, there are indeed thirty unrooted trees if you distinguish the
inner branches as major and minor, for example one of them supported to
a 99.999999% confidence level and the other supported only to a 95% or
even 80% confidence level. But if you ignore branch lengths or
confidence levels, then indeed there are only 15 modulo the symmetry.
I had the right thing in mind, but said it wrong, sorry.

> > or considering just
> > the toplevel 3/2 split there are ten possible, correct? So there's
> > nothing unique about any particular one of them.
> No. However, if there is statistical support for any one tree as opposed
> to all other trees, that itself is an expectation of common descent.
> That is, common descent supposes that you will get one consistent tree
> from different samples, though it does not a priori tell you what tree
> to expect. Fiat creation has no such expectation.

We're in agreement on the facts. I just thought your original wording
was misleading, indicating that this one (of the ten possible) 3/2
split was unique. When there are ten possibilties that are special, any
one of the ten is possible, even with special creation you'll get one
of those ten for any given sequence data, it's just that if you use
several different sequences of DNA you'll get the same 3/2 split in all
cases with common descent whereas you'll get a random sample of the ten
possible splits with special creation.

But that's not quite correct. If the correct fully-resolved unrooted
tree is the one I showed above, repeated here again:
Gib Gor Hum
| | |
Orang---*----*----*---Chimp
then there are two different 3/2 splits that we expect to be supported
by the data:
Ora+Gib / Gor+Hum+Chi
Ora+Gib+Gor / Hum+Chi
For some sets of data, one or another might be strongly supported by
the data, and in some *both* might be strongly supported. So, on that
premise, looking at lots of sequences you'd expect a random mix of one
or the other or both strongly supported.

In the general case, for any five species, there's only one possible
unrooted tree in the topological sense, fifteen possible ways that tree
can fit in with the five species (15 different ways to assign labels).
Given the general hypothesis of universal common descent, we have
common descent for these five species, hence only one of these fifteen
fully-resolved unrooted-trees is correct. But then we have two
different 3/2 splits supported, random mix of one or other or both with
different sets of sequence data.

On the other hand, with only four species, again there's only one
topological unrooted-tree, three possible labeled unrooted-trees, and
for whichever such unrooted tree is correct there's only one 2/2 split,
so there the split really is unique given the hypothesis.

It's sad that you had to delete one of the four apes to make the
tutorial manageable, but it seems to be a necessary decision.
..

.



Relevant Pages

  • Re: Origins and Mental Activity
    ... Your sequence appears to be, though highly contorted, two nucleotides ... in a single DNA strand? ... human DNA would be ridiculous to expect in the primordial soup. ...
    (talk.origins)
  • Re: Optimization algorithm help needed
    ... a specific sequence and snips the part between two ... No. Primers are pairs of short DNA sequences which will amplify the sequence of DNA between them in PCR. ... Would the primer snip the whole strand into ... P1 in your drawing above is to snip ...
    (sci.math)
  • Bioinformatics Toolbox - Restriction Enzymes
    ... If you use the 'restrict'-function of the bioinformatics toolbox to restrict a sequence, only the sequence of one strand is returned. ... 'GATCC' ...
    (comp.soft-sys.matlab)
  • Re: Speculative Design Hypothesis (with predictions) 2nd draft
    ... You're using the analogy to claim that DNA is too complex to have ... But in fact advantageous mutations do happen randomly. ... Is the resultant sequence more or less specific than the ... Yet they all have eyes. ...
    (talk.origins)
  • Re: The last ancestor of all life
    ... I never said that there was DNA post ... into his "You can't make a flagellum by shaking up the constituent ... proteins in a test-tube" sketch. ... actual difference in genetic sequence. ...
    (talk.origins)