Re: Sean Pitman: definitions wanted



Seanpit wrote:
> RobinGoodfellow wrote:
>
> > > > Are you certain? I've skimmed the 89 PNAS paper, and it appears that
> > > > there they used only single-residue substitutions.
> >
> > > Read page 2156, second paragraph: ". . . many residues are heavily
> > > mutagenized at the same time so that several substitutions may be seen
> > > in a particular sequence."
> >
> > I actually went back and read the entire paper quite a bit more
> > carefully, and realised I initially misunderstood what the authors did.
> > In that work, they didn't try any single-site substitutions: all
> > mutations were targeted at multiple sites. While I can't fault the
> > authors for their methodology, you can't really draw any conclusions
> > about the "frequency of the ARC repressor in sequence space", as you
> > would put it, except perhaps to propose some very loose lower bounds.
>
> Well, the authors themselves drew conclusions about the frequencies.

Not in the 89 PNAS paper on the ARC repressor. Or the 90 Science
paper. But, when it comes to the 90 Protein's paper - well this is
where it gets good.

The abstract for that paper clearly states that only 25 residues of the
total 92 were mutagenized, so the idea that Sauer et.al. would be
claiming with the 10^57 possible 92-residue lambda repressors seemed
more than a little strange. Yet you wrote, or rather copied and pasted
from another creationist that "... there should be about 10^57
different allowed sequences for the entire 92 residue domain. ..." One
might wonder what is hidden in those asterisks, and after doing a
little searching (alas, no electronic access still to the original
paper - Wiley Interscience is being difficult), I found the following
on the web:

"Extrapolating to the rest of the protein indicates that there should
be about
10^57 different allowed sequences for the entire 92-residue domain.
Clearly,
this is an extraordinarily rough calculation, and we do not intend to
suggest
that we can accurately determine how many sequences would actually
adopt a
structure resempling the N-terminal domain of [lambda] repressor."

http://www.asa3.org/archive/asa/200009/0140.html

i.e. First of all they did an extrapolation from the 25 residues region
to the entire protein, which is not really justified as some regions of
a protein can be far more constrained than others. More importantly,
however, the authors themselves consider this calculation "
extraordinarily rough", and disclaimed the suggestion that their work
could be used to determine the frequency of the "lambda repressor
function" in sequence space. So, aside from the fact that you
parrotted (unknowingly, I hope), a creationist quote mine, you and your
creationist source, seem to believe more strongly in this particular
result of the work of Sauer et. al. than the authors do themselves!
This is quite amusing.

> > First, while their methods ensured that every residue was mutated, far
> > from every residue position received every possible amino acid. In
> > fact, if you read the paper, you'll see that at almost every site, only
> > one codon position was mutated at a time, making sure that the sampling
> > of amino acids at each position was distinctly biased. Furthermore,
> > heavily mutating multiple residue positions at a time can fail to give
> > you a good picture of the tolerance of individual amino acid
> > substitutions: a substitution may fail to show up as viable simply
> > because it is always coupled with some unfavorable substitutions (this
> > is the reverse of the problem I mentioned earlier).
>
> There was enough variability in the substitution positions, it seems to
> me, to deal with this problem to at least a fair degree.

Maybe, but it is equally possible that the site-by-site analysis of the
substitutions will reveal that an unfavorable substitution was always
coupled with at least one other unfavorable one. Even if not, there is
still the possibility that a substitution by itself will be tolerated,
but not coupled with a set of other specific ones. (e.g. Radical
substitutions of two spacially close hydrophobic residues will have a
more destabilizing effect on the structure than a single such
substitution.) All these details are quite important if you wish to
obtain a reliable estimate of the number of sequences that will fold
into a specific shape, but I've learned by now that you are above such
minutae.

> > The authors
> > themselves realise the potential problems, and don't actually give any
> > figures for the "frequency" of the ARC repressor themselves in the
> > paper, since such figures would be seriously skewed as the result of
> > necessary limitations in their methodology. These figures, however,
> > can be obtained by simply looking at figure 3 of the paper, and
> > performing some simple multiplication, without understanding the
> > broader context and limitations of the work. I wonder if that's what
> > happenned?
>
> I don't think this is what the authors were thinking here since Sauer
> did in fact make such estimates in his other papers.

In only one, with a major disclaimer. As I suspected, some
creationists took the highly tentative result, disregard all the
provisos, and ran with it. What a surprise!

> > > > I think you misunderstand my idea of a cluster. You cannot envision
> > > > sequence space as neatly as you would like to, as a set of rising
> > > > complexity levels based on protein length. Since there are
> > > > length-altering transformations, the sequence space being explored has
> > > > an effectively infinite dimension. (There is no 100-residue space,
> > > > 1000-residue space, 10000-residue space: there is only one space.)
> >
> > > This doesn't seem to be true. If you get a length altering mutation,
> > > you leave the sequence space you were in and move to a new sequence
> > > space. There are definitely limits to the sequence space of
> > > 100-residues and it is much different from the sequence space of 10,000
> > > residues. And, one can limit one's search to a particular sequence
> > > space.
> >
> > Sean, if the "random walk" can transition between any two points in
> > configuration spacem, then, by definition, they are part of the same
> > space.
>
> Not, its not. You are simply transitioning between different spaces
> with different ratios. The odds of success do in fact change with such
> transitions.

Fine, if you wish to re-invent terminology for describing Markov
processes, who am I to argue? You are the maverick mathematician here.
The random walk transitions between multiple spaces back and forth.
That justifies modeling the random walk across the multple spaces as
uniform random sampling of one single space at a time. Yep, it's all
perfectly clear now.

> > You may talk about different regions of sequence space, if you
> > wish, but they are still very much part of the same space.
>
> This is just semantics. It is meaningless to this discussion.

Naturally. How could I have possibly thought that proper use of
terminology might be at all meaningful to the discussion? Silly me.

> The different "regions" of the overall sequence space, if you prefer that
> term, carry with them different odds of success. The point is still
> the same however you look at it and however you wish to label it.

Amazingly, I actually agree with this statement. It's the specific
claims that you make about these odds that I disagree with.

> > So, no, you
> > cannot limit your "search" to a particular space, unless you can
> > somehow disallow all length-altering transitions between states (i.e.
> > sequences).
>
> It is possible to search only one particular level in space. It is
> also possible to wander between multiple levels. The odds of success
> change, however, within different levels. This doesn't help you.

So, do you think that the distribution of "beneficial sequences" in the
"space" of level L is independent of the distribution of such sequences
at level L-1? Because, for your calculations to hold (or at least, to
somewhat accurately approximate random walk times), both distributions
must be uniform random, and therefore independent of one another.

> > Incidentally, defining regions of space as you do also fundamentally
> > conflicts with your complexity metric. Since your complexity metric is
> > based on the frequency of a particular function in sequence space,
> > there is no simple way to map sequence length onto complexity.
>
> Yes, there is. Each higher level of sequence space (increase in
> minimum size/or specificity) is a higher level of complexity.

Which is why, I suppose, you failed to answer (and possibly, to
understand) my question below.

> > Suppose
> > you have a "poorly specified" 1000 res sequence and a "heavily
> > specified" 100 res sequence, both of which have a frequency of 1e-60.
> > Are they in completely different sequence spaces of 100 and 1000
> > respectively, or are they in the same complexity space of 1e-60? Your
> > model is self-contradictory in its very basic definitions: no wonder
> > you are having so much trouble articulating them.
>
> Sequence space is built on levels of sequence size. The larger the
> minimum size requirements, the exponentially larger the sequence space.

How does sequence size relate to "minimum size requirements"? Does a
1000-res sequence belong to the 1000-res sequence space, or to the
400-res sequence space, if 400 happens to be the "minimum size
requirement" for some selectable level of that protein's function?

> Within this sequence space, one may find a greater ratio of sequences
> with lower specificity as compared to those with higher specificity
> requirements. The space is still the same, but the odds of success are
> very much different given different specificity requirements. The
> greater the specificity requirements at a given level of sequence
> space, the greater the level of functional complexity.

So. to repeat my question, are the "spaces" divided from each other by
possible protein lengths, or by possible complexity levels? If the
former, then you admit that the sequence space at a given level is far
denser than you claim, since you claim that many proteins of high
lengths don't have to be specified at all. (Elsewhere, you claimed
that the density for a "template-matching" function maybe one in two: a
ridiculously high estimate, but whatever floats your boat.) If the
latter, then how do you determine to which "space" the a sequence
belongs just by looking at it? Please answer my question about the
1000-res and 100-res sequences with the same ratios, instead of waving
your hands.

> This model is not at all contradictory or difficult to understand. I'm
> actually quite surprised that you and others in this forum seem to be
> having such difficulties understanding this concept when many others
> I've talked to, to include, geneticists, biologists, biochemists,
> mathematicians, and even high-school students do not have much trouble
> at all.

Whatever you say, Sean. No doubt that countless lurkers support you in
e-mail, too. But the best test of the clarity and accuracy of your
model would be to submit it for publication. Perhaps to a mathematical
journal, where the reviewers aren't likely to understand biological
subtleties, and would review your article on its mathematical merits,
or to a specialized biochemical journal, where the reviewers would be
most interested in the biochemical merits of your argument. Or,
ideally, to one of the many bioinformatics journals, where the
reviewers would likely be quite familiar with math, biochemisry, and
evolutionary biology enough to assess the quality of your work on all
counts. Let us see how well your submission might do, and what
comments you might get back. I am sure that people on the editorial
board of scientific journals are at least as bright as those high
school students you've been lecturing too.

> > > > The "cluster" is not based on simple Hamming Distance (the number of
> > > > single-residue substitutions) either: it is a small swath of our
> > > > infinite-dimensional sequence space that is connected under all
> > > > evolutionary transformations.
> >
> > > This is also not true when you consider the random-walk odds of
> > > success.
> >
> > This has absolutely nothing to do with the odds of success. I am
> > simply describing the configuration space where your "random walk" must
> > occur, if you truly wish to model evolution as a random walk on a
> > frozen configuration space.
>
> I do not with to model evolution as a random walk on "frozen"
> configuration space.

Well, you aren't even doing that. You're modeling it as uniform random
sampling of the said space. Modeling it as a random walk would be a
step up.

> The space is not frozen at all, though it is more
> static than you seem to be suggesting at higher levels. But, even with
> a highly fluid sequence space, as far as "beneficial" sequences are
> concerned, the odds do not change as long as the ratios remain
> essentially the same. It is the ratios that are essentially static,
> not the locations of the beneficial islands within sequence space.

The odds would change if the distributions of "beneficial" states were
to change, even if the ratios did not. But the claim that ratios
remain static is another unsupported assertion of yours, and its face,
easily false. Do you really think a nylonase would be really
beneficial regarless of whether or not nylon were present in the
environment?

> > > > So, for example, a 100-residue P and a
> > > > 200-residue protein Q would be right next to each other in this space,
> > > > assuming that P contains a 100-residue domain that is highly homologous
> > > > to Q.
> >
> > > That's not true as far as random walk distances/times are concerned.
> >
> > I am not talking about expected random walk times here: I'm talking
> > mutational distances. I don't ever refer to expected random walk times
> > as "distances": you shouldn't either.
>
> Lot's of scientists refer to random walk times as "distances" - Because
> it really is a type of distance.

I would really to see some citations. Especially from computer science
or statistical literature.

> The "mutational distances" must
> consider the odds of success as part of the "distance" to be covered.
> Multicharacter differences do indeed create much greater mutational
> distances than do single character differences.

To disambiguate my terminology: a "mutational distance" is the minimum
number of mutations needed to get from one point in configuration space
to another. Nothing more, nothing less. So that you and I don't
quibble about sematics.

> > > Even if all 100 Q-residues were exactly the same as 100 P-residues,
> > > finding a fully functional 200aa P-sequence starting with Q is not
> > > going to be easy. Almost certainly they are not right next door to
> > > each other as far as random walk distance/time is concerned.
> >
> > But they are right next door to each other as far as the number of
> > mutations is concerned.
>
> They may be right next to each other as far as the fewest possible
> mutations are concerned, but they are not right next door to each other
> as far as the average number of mutations are concerned.
>
> > The transition probability, of going from P to
> > Q, would, of course, be lower than the transition probability of going
> > from P to some protein P' that is only one residue different from P.
>
> Exactly . . . that's the whole point. We are talking average
> distances/times here. We are not talking about the shortest possible
> distance/time since finding this shortest possible path is highly
> unlikely, at higher levels, this side of trillions of years of average time.

Indeed? So, you're claiming that the combination of two protein
domains, or a new beneficial interaction between two pairs of proteins
of known function is highly unlikely this side of a trillion years?
Communication with you is most beneficial, Sean. I learn new things
every day.

> > If you wish to argue this topic further, I suggest you learn some
> > terminology for describing Markovian processes. Your current use of
> > math and terminology suggest that you are arguing way outside your
> > field of expertise.
>
> My use of language here is quite easily understood. You're trying to
> argue semantics that have nothing to do with the main issue at hand.

The definitions have everything to do with matter at hand. I am trying
to understand if the picture of "sequence space" that *I think* you
envision is indeed what you envision. Because what I think you
envision seems very wrong to me. So, there are two possibilties:
either you are indeed very wrong, or I misunderstand you. By now, I'm
fairly certain I understand you, but it wouldn't hurt to clarify a few
things here and there.

> > > > Even if you claim that huge neutral gaps exist in this space
> > > > (and they almost certainly don't), your calculations aren't adequate
> > > > for computing the size and distribution of such gaps.
> >
> > > They are adequate enough to get a very good idea about the average
> > > times involved - as far as being way over what even evolutionary time
> > > frames have to offer.
> >
> > Only if you model evolution as sampling entire sequences from a set of
> > unconnected "sequence spaces" uniformly at random, assuming a uniform
> > random distribution of "beneficial sequences" in each space.
>
> My argument is based on connected sequence spaces where beneficial
> sequences are indeed clustered and highly interconnected at low levels
> of minimum size and/or specificity requirements, but are less and less
> interconnected at higher and higher levels.

Are they less and less interconnected with the other sequences on their
level only, or with the sequences above and below as well? If the
latter is true, we would expect to find little or no homology between
beneficial sequences between higher and lower levels. Is that what we
observe?

> > That's not even a random walk: that's random sampling.
>
> At higher and higher levels, natural selection does indeed tend to keep
> islands intact. Those sequences that get mutated off the island, do
> indeed sample the surrounding sequence space in a random sampling type
> pattern - not a random walk.

Oho! Now this is a new claim, representing a slight but refreshing
refinement of your ideas. I assume that by "random sampling", you
don't mean that a sequence start wildly randomly mutating every amino
acid at once, but rather that after considerable time in the "neutral
gap", the effects of a random walk begin to approach those of random
sampling. What you are trying to say, using proper terminology which
you deem so unimportant, is that after a certain amount of time has
passed, the random walk will approach a stationary distribution, and
that stationary distribution is uniform. Congradulations, Sean, you've
just made a step in the right direction!

Of course, proving that a particular random walk converges to a
stationary distribution *and* that such a distribution is uniform is
intself a pretty difficult problem, and not partiuclarly relevant for
a process that cannot be adequately modeled by a random walk. But, it
wouldn't be fair to ask you to do everything at once. Let's take it
one step at a time: hopefully, this will result in the evolution of
your arguments, and not a random walk.

> Natural selection either discards such
> sequences or pulls them back to the staring point island. Some
> sequences, may undergo random walk for a while, but at higher levels,
> the odds of success are extremely remote this side of trillions of
> years of average time.

Right. Absent the math for anything other than random sampling from
one frozen, fixed-dimension, dubiously highly sparse sequence space, I
still don't believe you.

> > If you don't
> > understand the difference, I suggest a good college-level course on
> > stochastic processes.
>
> Well, given your arguments here, I'm not sure you've convinced me that
> my understanding is that far off when it comes to this particular
> topic.

To resolve your doubts, I'm sure that I haven't convinced you. Still,
if you have some free time, I hope that you educate yourself on the
difference between random walks and uniform random sampling, assuming
you still haven't. A good intro-level textbook on stochastic processes
can do wonders to help you hone your arguments.

> > > Look at the work done by Yockey and Sauer, for example. They took only
> > > one small part out of a larger system of function and both found that
> > > the ratio for this one small part was less than 1 in 1e60.
> >
> > Likely overestimates both. Yockey eventually corrected himself.
> > Sauer's methodology was not designed to accurately assess this "ratio"
> > of yours.
>
> These are published estimates in real journals. If you disagree,
> please do publish your own estimates for the ratios of such functions
> in sequence space.

So, if a published work is in agreement with your claims, it suddenly
becomes gospel truth? You seem to have no problem disputing a large
portion of published literature without ever feeling the need to submit
your disagreements for publication. Asking me to do so when I offer a
criticism of your interpretation of published data is more than a tad
hypocritical.

> > > Even if you argue for 1 in 1e35, it doesn't matter. The most likely probability
> > > is that the other parts in this system of function, of equivalent size,
> > > are at least as specified. The other cytochromes in the ETC chain all
> > > probably have equivalent specificity requirements. The same is true of
> > > all the parts of the flagellar system.
> >
> > It's certainly not true of globins, for example. In fact, Cyt C is
> > quite unusual by protein standard, as it displays a far greater degree
> > of conservation than usual. To simply the same degree of constraint
> > seen in Cyt C to every protein in existance is highly unjustified.
>
> Funny how Sauer and Yockey came to pretty much the same ratio using
> very different protein-based functions.

Funny how Yockey has revised his estimates down 25 orders of magnitude.
Funny how Lau and Dill (PNAS 87:638-642) arrived at a much higher
fractions of proteins tthat would into a Cytochrome C (1e-15, thank
your creationist source for pointing this out) using a different model.
Funny how Sauer strongly cautions against taking his results at face
value. Lots of funny things seem to be happenning around here, Sean.

> While I agree that Cyt C does
> have a higher degree of specificity than most protein-based functions
> of equivalent minimum sizes, many proteins do indeed carry a fairly
> high degree of minimum specificity requirements. The other cytochromes
> in the ETC are likely to have ratios at least as high as 1 in 1e30.
> The same is true of the proteins used in flagellar motility.

I shudder to think about the orifice from whence these numbers came.

> > > A gap to the nearest beneficial island that is just a couple dozen
> > > *fully specified* multi-character differences wide is insurmountable.
> > > A gap size that is hundreds of fully specified character differences
> > > wide is almost infinitely more insurmountable.
> >
> > So, if there are differences of dozens of amino acids in the proposed
> > pathways for flagellar evolution, why do you assume that all of them
> > are "fully specified"?
>
> I don't. A gap of 40 or 50 residues specified at 1 in 1e30 would still
> be insurmountable.

And meeting the Jabberwock - the jaws that bite, the claws that catch!
- would be indeed a very scary, not to mention brillig, experience. If
such things existed.

Cheers,
Leonid.

.



Relevant Pages

  • Re: Gutierrez et al., make same mistake as Sean Pitman
    ... place the new sequence in a random position within sequence space ... The percentage of mutations in the sequence is low. ... of identical, independent substitutions. ...
    (talk.origins)
  • Re: Gutierrez et al., make same mistake as Sean Pitman
    ... place the new sequence in a random position within sequence space ... Nor are they randomly distributed in the part of sequence space that contains them. ... The percentage of mutations in the sequence is low. ... There are several times more silent substitutions than amino acid replacement substitutions. ...
    (talk.origins)
  • Re: David Dryden - Searching All of Sequence Space
    ... probability that a protein retains its structure will decline ... exponentially with the number of substitutions. ... This paper is not about simultaneous mutations. ... mutations and the likelihood of maintained functionality? ...
    (talk.origins)
  • Re: The Living Dead
    ... > Since Sean is modeling his sequence space in the space defined by amino acid ... The more mutations happening the more his ... documented where something had to cross a neutral gap of three ... crossed in multiple species and not just in E. coli. ...
    (talk.origins)
  • Re: The Theory of Evolution is Mathematically Irrational Round 2
    ... That only calls for 25 base substitutions per generation. ... And which of your neutral mutations have you spread ... Not only do any of the mutations in your genome spread very far ...
    (talk.origins)