Re: Maximum, Average, and Likely Minimum Gap Distances



Seanpit wrote:
On Jul 13, 6:26 am, John Harshman <jharshman.diespam...@xxxxxxxxxxx>
wrote:
I'm talking about the entire sequence minimum needed to realize a
particular type of function. Where have you been? Why do you think I
keep talking about the likely minimum size needed for a function like
CytoC or lactase or rotary flagellar motility to work? These
different types of functions have different minimum structural
threshold requirements - obviously. One might argue and be at least
someone reasonable at the same time that the likely minimum size
requirement for CytoC functionality in a given life form is 80aa.
Might one? How would one argue this?

Read the papers I've listed for you. Or, provide some evidence of
your own to even suggest that the likely minimum CytoC size from the
perspective of any living thing is significantly less than 80aa. So
far, all you have is incredulity with no real evidence to back
yourself up - Certainly nothing that has actually been published.

Again you confuse evidence against X with evidence for X.

But how about a deal? I read one paper, you answer a question I want to ask you, in adequate detail. The first question would be how you can tell if two organisms belong to the same "kind" or a different one.

I will note that this is relevant to the discussion. You have said that some gene families really do contain homologous genes that are related by common descent. It would be nice to have a guide to tell me how to recognize such families and tell them from families that contain separately created genes. This would aid discussion.

One
could not make that argument for the flagellar motility function - not
remotely. Why? Because the likely minimum structural threshold
requirement needed to achieve the rotary flagellar motility function
is on the order of several thousand fairly specified residues at
minimum. For lactase it is on the order of several hundred - at
minimum.
Here you have gone from sheer number of residues to number of "fairly
specified" residues, as if that's the same thing. You keep flipping back
and forth among different versions, and this is what makes your
arguments so ambiguous.

I've always presented the minimum size requirement as a certain number
of fairly specified residue positions. The 80aa minimum size
requirement for CytoC isn't 80 absolutely specified residue
positions. How has that not been clear to you? - especially after I
directly pointed out to you that only about 27 of this 80aa minimum
were invariant? None of the systems I've presented are absolutely
specified. They all have a fair degree of flexibility. As far as what
I mean by "fair degree" read the Durston paper and note the concept of
FSC density in terms of "Fits" per residue position.

You're saying that "fairly specified" means an average of 2 possible amino acids per site? So how do you determine the number of amino acids in cytochrome c that are fairly specified? Do you start with the most conserved sites and add new sites till you get a mean of 2? Or what?

The average gap size is based on the ratio of sequences that would be
able to produce the function in question - - or more relevantly, the
total number of all potentially beneficial sequences vs. non-
beneficial sequences.
Or, more simply, your "average gap size" is just the number of
constrained sites. A normal person would consider this to be the maximum
gap size.
It isn't the maximum gap size. For a 100aa system, it is quite
possible to have another system that shares absolutely no sequence
homology at all - thereby having a gap size of 100 residue location
differences.
You need to distinguish the gap between sequences, which you don't care
about, from the gap between islands, which you do. If, out of those 100
residues, only 30 are "fairly specified", then the gap size is at most
30, because changing only those 30 residues will move from one island to
another. Therefore in that case, 30 is the maximum gap size -- the
maximum distance between those two islands.

You do have a point here. It is the number of fairly specified
residue positions that is key. A function that requires little or no
sequence specificity would have a very small maximum gap distance and
smaller still average and likely minimum gap distances.

For CytoC, in particular, it seems that all of the residue positions
have a certain degree of specificity. Some have more, some have
less. However, a given position will not tolerate some amino acid
options at all without a significant loss of CytoC function. Some
options would simply be too destabilization to the overall function of
the system.

What is your quantitative value for "too destabilizing"?

Now, you might be able to find two or three positions in CytoC that
could tolerate all 20aa options without a complete loss of function,
but the point is essentially the same.

That is the maximum gap size. Now, you might argue that
at least some of these differences are functionally neutral - and that
true. Increased flexibility at some various positions increase the
overall number of potential sequences that can produce the type of
function in question - making it easier to find by a random search of
sequence space. I.e., it makes the size of the island or islands with
the function in question larger.
Yes. So the absolute number of residues is not relevant to the gap size,
right? It's the number of "fairly specified" residues that counts.

That's right. An increase in size alone is meaningless to the
argument. It has to be an increase in fairly specified residue
positions to create a linear increase in the maximum, average, and
likely minimum gap distances.

So the maximum gap size of a sequence is not the sequence length, but the number of "fairly specified" residues. I swear that's what Howard has been claiming and you keep denying.

That would seem to require only that *something* be in them, if indeed
there is such a requirement.
Not quite true. For the CytoC function it is true that only about 27
or so positions are completely non-variable. However, most of the
other positions are also very restrained as well - to only a handful
of options. There are a few that allow 8 or 9 residues, but even this
degree of flexibility isn't limitless. And, this is only considering
single replacement events - one at a time. Studies show that if more
than one position is replace at the same time, the constraints are
even more restrictive.
This is why the likely size minimum for CytoC is more like 80aa rather
than 27aa. It also means that the maximum gap distance between a
different type of functional system with a similar minimum size
requirement of 80aa and CytoC functionality is 80aa differences, not
27.
How did you compute this number? How did you compute the likely gap size
of 30? And why did you say previously that the maximum gap size for
cytochrome c was 100?

I didn't argue that the likely gap size for a 100aa system at the
level of specificity of CytoC would be 30 residue differences.
That's Howard's strawman version of what I actually said. What I
really said is that 30 residue differences is the likely average gap
distance between 100aa systems at this level of specificity or FSC.
The average gap distance isn't the minimum likely gap distance. The
minimum likely gap distance depends upon the number of protein-based
systems of this size in the gene pool and is always smaller than the
average gap distance.

I also didn't say that the likely maximum gap distance for CytoC was
100 residue differences. It isn't. I used the 100aa number because
that is the number Yockey used to estimate the ratios of CytoCs in
100aa sequence space.

Is there a difference between "likely maximum gap distance" and "maximum gap distance"? And when you say "100aa system" are you referring to a protein with 100 residues or one with 100 "fairly specified" residues?

How many "fairly specified" residues are there in cytochrome c?

What you are talking about is the maximum possible random walk
distance - which is infinite regardless of the absolute number of
residue differences. It is just that as the gap distance gets
smaller, the odds that the random walk distance will in fact be
infinite decrease exponentially.
So you're measuring only Euclidean distances here.

That's right . . . However, a linear increase in the Euclidean
distance translates into an exponential increase in the average random
walk distance.

I thought you weren't concerned with random walk distances. Why are you bringing them up here?

But the maximum possible distance between two *islands* is equal
to the number of constrained sites in the target sequence, and that's
what Howard means by "maximum distance".
In order to know the distance between two islands of unknown position
in sequence space, you have to know something about the ratio of
islands vs. non-islands (or quarters vs. non-quarters). This ratio
will tell you the average expected distance between any particular
starting point and an island in sequence space. This isn't the
maximum possible distance. It is the average linear distance that is
expected to exist between any chosen starting point and any one of the
quarters in the circle. The greater the degree of sequence
flexibility, the more quarters there are in the circle and the less
the expected average distance between a chosen starting point and the
quarters in the circle.
You mean, in this analogy, the greater the degree of sequence
flexibility, the bigger the quarters are.

Either way, the statistics are the same.

I think not, unless each quarter/island is a set of points randomly distributed in sequence space.

One problem with your
procedure, if your analogy is at all useful, is that you're assuming
each island to consist of points (quarters) randomly distributed in
sequence space, when they are in fact tightly clustered.

Families of single proteins are indeed quite clustered at lower levels
of functional complexity. However, the distance between family
clusters of proteins, even at low levels is quite significant.
Getting from one cluster of islands to the next cluster of islands is
a bit problematic for random walk. It isn't impossible at lower
levels, just like it isn't impossible to get all the way across 3-
letter word space without having to swim for it very much, but it
isn't as easy as getting form one island to the next in the same
family cluster.

The real problem arises once one starts moving beyond the level of the
single-protein family cluster at the level of a few hundred fairly
specified residue positions (forgive me if I don't always type out
"fairly specified" each and every time I present this idea.

You must, or nobody will know that's what you mean. "Number of residues" is quite a different thing from "number of fairly specified residues".

I figure
that most people can remember what I mean from one paragraph to the
next).

How, if you can't yourself? You seem unable to remember what you mean from one sentence to the next.

Once one starts considering functional systems beyond the
1000aa level of complexity (again 1000aa means "fairly specified aa" -
in case you forgot since the previous sentence). Such systems of
higher complexity usually start requirement multiple proteins to form
them - as in systems like rotary flagellar motility, non-rotary
flagellar motility, ATPase, intracellular vesicle transport, etc.
Such complex multi-protein systems are not nearly as clustered
together in sequence space as were lower-level systems (comparable to
multi-word phrases, sentences, or paragraphs in a written human
language system).

How do you know this? The last time I asked, you pointed to a figure in PNAS that didn't even refer to sequence space but to a structural space, and in which there was no major difference in clustering between small and large protein families.

And islands too
seem not to be randomly distributed, but are themselves clustered. You
just can't use the poisson distribution to calculate distances under
these conditions. A major assumption has been violated.

Not really a problem.

So you claim. What evidence do you have for this?

While there is certainly some clustering, which
would cause a decent modification in the calculations at lower levels,
this clustering becomes less and less clustered at higher and higher
levels of functional complexity - as you yourself can determine by
noticing an absolute increase in the number of required functional
residue positions with the increasing size of fairly specified
systems.

How can I determine this? Do you have a database listing the number of functional residue positions and the sizes of a bunch of systems?

Regarding the size of the islands, have you read any of the referenced
papers I gave you?
Not the ones you're thinking about now. It seems to me that, at most,
you know the sizes of three islands, one of them cytochrome c. Are there
more?

These estimates can be reasonably extrapolated to many other types of
functions. The Durston paper, in particular, analyzes the FSC of
dozens of proteins.

I'll gladly read the Durston paper if you agree to answer my question at the top of this post.

Regarding the distribution of quarters in the circle, the actual
location of the quarters in the circle is unknown. What is known is
that the overall distribution is somewhat clustered. However, it is
also known that this clustering effect gets less and less clustered at
higher and higher levels of functional complexity (i.e., minimum size
and/or specificity requirements).
I believe that's equivalent to an answer of "no" to my second question.

You'd be wrong then. It is known that the clustering effect becomes
less and less clustered. It is also known that even at low levels of
functional complexity the clustering effect between clusters or
families of islands isn't very clustered. There are significant
distances even been families of islands at the level of a few hundred
fairly specified residues.

How is this known?

That is why although evolution between
families happens at such levels in observable time, it isn't that
common. And, it shows an exponential decline in evolutionary
potential - even at these low levels to the point of complete
disappearance well shy of the 1000aa level.

.



Relevant Pages

  • Re: Maximum, Average, and Likely Minimum Gap Distances
    ... of fairly specified residue positions. ... Or, more simply, your "average gap size" is just the number of ... maximum distance between those two islands. ... this clustering becomes less and less clustered at higher and higher ...
    (talk.origins)
  • Re: Maximum, Average, and Likely Minimum Gap Distances
    ... number is a measure of the degree of sequence specificity required for ... cytochrome c function sequences per cytochrome c function sequence). ... And the *effective maximum gap size* of all proteins of 100aa size ... Not the average distance. ...
    (talk.origins)
  • Re: The possible is likely?
    ... The *actual* gap size crossed need not be the ... The actual minimum distance is not directly known. ... of what you call "low level" evolution that has occurred. ... There are lots of potential targets in sequence space Howard. ...
    (talk.origins)
  • Re: Experimental basis for the Non-Beneficial Gap Problem
    ... unless you think evolution starts from some random sequence maximally ... distance is always smaller than the minimum structural threshold ... The maximum gap size for a 100aa system is 100aa differences. ...
    (talk.origins)
  • Re: The Relationship of Gaps to Thresholds
    ... other than a fair degree of sequence similarity. ... Your MATH explicitly says that the size of the gap needed to be ... is roughly 30% of the total size of the end protein. ... start at some average distance away (that distance being a function of ...
    (talk.origins)

Loading