Re: Part 1 (of 3): What are major aspects of evolutionary theory?



anon1@xxxxxxx wrote:
> > Even as recently as last month I was reading that haplotype blocks
> > were conserved in the same way linkage groups are; i.e. low
> > probability of being hit by a crossover. I had thus been regarding
> > the terms as synonymous.
>
> They are not synonymous.

Agreed. Whether they are compatible is more the question, or do they
assume incompatible underlying mechanisms? The alleged 'hotspots' seem
to be a new fact about the crossover mutation mechanism which
invalidates assumptions from before the genomics era, possibly
including the very assumptions the Morgan measure is based upon.

> Of the various definitions of "linkage group"
> that I saw online, the most reasonable definition I saw was essentially
> any maximal connected subset per linkage pairs as the connection links.
> <http://opbs.okstate.edu/~melcher/MG/MGW1/MG111122.html>
> * Markers that have measurable recombination frequencies are said to
> be linked.
> * Markers related through a chain of linkage constitute a linkage
> group.
> For example, in mathematical graphs, if there are points A,B,C,D,E,
> where A is connected to B, B is connected to C, C is connected to E,
> and there are no other connections between any of those points, then
> A,B,C,E are one maximal connected set and D all by itself is another
> maximal connected set. By necessity, membership in such maximal
> connected subsets is an equivalence relation.
> Now a larger example in genetics: Suppose along a chromosome there are
> sites/markers/loci A,B,C,D,E,F,G such that each marker is linked to its
> immediate neighbors and also to its next-to-immediate neighbors, but
> not to its next-to-next-to-immediate neighbors. Now consider the most
> distant loci, A and G. The following are just a few of the chains of
> linkage from A to G: A-B-C-D-E-F-G, A-C-E-G, A-B-D-F-G,
> A-C-B-D-C-E-D-G-E-G, etc. It doesn't matter how many chains there are
> from A to G, it just takes one such chain to establish that A and G are
> in the same linkage group. Obviously in this case all seven loci are in
> a single linkage group.
> In most cases in genomes, a linkage group is in fact an entire
> chromosome, because for any two loci A,Z anywhere within a single
> gehome there's soem chain of other loci B,C,...,X,Y such that there's
> linkage between A and B, between B and C, ...etc..., between X and Y,
> between Y and Z, thereby establishing a chain A,B,C,...,X,Y,Z between A
> and Z. The only exception would be if there was a rcombination "hot
> spot" so very very hot that there was no linkage between loci
> immediately on opposite sides of it, i.e. during every single instance
> of meiosis there's 50% chance of the two sides of that hotspot swapping
> and 50% chance of them not swapping (or swapping and then swapping back
> before meiosis is finished). Note it's *because" the whole chromosome
> is a single linkage group that it's possible to map entire chromosomes
> using fractions of a "Morgan", usually centiMorgans.
> <http://www.sizes.com/units/centimorgan.htm>
> The genetic distance between two loci is 1 cM if their statistically
> corrected recombination frequency is 1%; the genetic distance in
> centimorgans is numerically equal to the recombination frequency
> expressed as a percentage. Typically a genetic distance of 1 cM
> corresponds to a physical distance of roughly one million base pairs.
> Note that if the distance between two loci is 1 centiMorgan, then
> there's 99% linkage between them, clearly showing they are "linked". So
> if you have a map of loci showing them at various centiMorgan distances
> from the start, i.e. each locus is about 1 centiMorgan from any at the
> next tick mark on the scale, clearly you have a linkage every little
> step, hence a chain all the way from end to end. (Caveat: When the
> total accumulated distance approaches 50 centiMorgans, that does not
> mean that two loci spaced that far apart have 50% chance of a crossing
> between them, because compound probabilities aren't exactly additive,
> and aren't even close to additive over such large distances.

I think there are two alternative pictures here.
One is that in meiosis a chromosome may or may not undergo a single
crossover - more than one at a push. In this case, the chance of two
short subsequences finding themselves on different sides is linear in
the distance between them. Here, the outcome boils down to a question
of whether a subsequence was crossed over at all or not at all.
The other picture is that crossover routinely occurs at *many* points.
Here, the outcome boils down to a question of whether a subsequence was
crossed over an even or an odd number of times.

I suspect the prevailing view has been changing from the former to the
latter picture over the last 25 years. It certainly makes a difference
to how long ago you might think the human race went through a
population bottleneck. Or you might just buy straight into the hotspot
idea, in which case it's irrelevant.

> Does anybody need a mathematical explanation of why it's not exactly
> additive even though it's close to additive over short distances?
> Hint: P*P vs. 2*P*(1-P)) vs. (1-P)*(1-P), where P=0.99, and that middle
> term is what you want to look at.)

>>From the link you gave, it looked to me as if it is additive by
definition.

> > It speculated that because the human species had been through a
> > severe population bottleneck in the distant past, there have not been
> > enough generations since that time for crossovers to have divided the
> > genome up very finely.
>
> You wrote the premise backwards. You meant to say because the human
> species had been through a severe population bottleneck in the
> *not*very*distant* past ...

Yes, I did put it the wrong way round, perhaps because I had in mind
that it must have been a long time ago in terms of human history.
Hence my inclination to identify it with the last time of speciation.
Suppose instead the population bottleneck happened 75% of the way in to
the life of the species. We would of course, have to consider the
people before the bottleneck as fully modern humans. Then how come it
was only after that time that they managed to spread to all the
continents (except Antarctica)?

> Yes, with that correction, I agree that is one good reason why today we
> observe a small number of different haplotype blocks compared to what
> would happen with an unlimited-size population over an unlimited amount
> of time. I think the other factor is the great disparity of "hotness"
> between various recombination sites. See later below...
>
> > I ... wondered if it was the same for most species (i.e. is there
> > always a strong founder effect involved in speciation?).
>
> That is one of the major hypothesis as to what triggers a species split
> (which is then followed by one of three scenerios: new tiny-population
> species goes extinct before it leaves any fossils, new tiny-population
> species drives parent species to extinction, both species survive long
> enough to leave parallel set of fossils, sometimes both surviving long
> enough to split further). It seems to me that haplotype studies could
> answer this question, by checking several other species to see whether
> they likewise show a population bottleneck as evidenced by a
> restriction of haplotype variety just about at the point where the
> species diverged visibly (fossil characters) from the parent species.
>
> > http://www.wellcome.ac.uk/en/genome/thegenome/hg04f001.html
> 20/3/03. By RT
> A new map of human variation will greatly aid research on the genetic
> origins of disease.
> (Note this was written *before* the project was done, indicating what
> they *expected* to learn, not what was actually learned after
> completion and post-analysis.)
>
> Genetically speaking, humans are incredibly similar to one another.
> Any two unrelated genome sequences differ at only one position in a
> thousand, on average. The 0.1 per cent difference, which amounts to
> about three million base pairs of DNA in total, ...
> Much genetic diversity (around 90 per cent) consists of single
> nucleotide polymorphisms (SNPs), ...
> (What do the other kinds of diversity consist of? Tandem repeats?
> Rearrangements? Duplications? Block deletions? Point inserts and
> deletes? How do you compare large block changes with point changes on a
> fair basis? For example, does a single block of 50 bases that gets
> duplicated count the same as 50 separate SNPs, or just 1 or 2 SNPs?)

I *guess* the percentages are of loci with/without polymorphism, i.e.
it's real estate, with no weighting for higher degrees of polymorphism.
So 50 is 50 whether SNPs or not. I guess this, because surely they
would have to say if they meant anything else.

> SNPs are scattered liberally through the genome. While most of them
> are found outside genes and probably do not have any effect, those
> located in and around genes may contribute to the genetic basis of our
> biological individuality, ...
> SNPs outside any phenotype-affecting regions (exons, regulatory, etc.)
> are neutral, hence drift randomly, and it takes a long time for them to
> be fixed one way or the other, so SNPs tend to remain for a long time.
> But SNPs inside phenotype-affecting regions are sometimes neutral
> (3base->1aa coding synonyms) and sometimes not neutral, and the latter
> under selection pressure, so they tend to be rapidly moved to one
> extreme or the other i.e. eliminating one allelle and fixing the other.
> So statistically, how much more frequent are neutral SNPs than selected
> SNPs compared to what you'd expect based on how many places they'd be
> neutral vs. selected? I.e. how much is SNP diversity reduced in
> selected places compared to neutral places?

I haven't the faintest idea. Because you speak of SNP diversity, I
assume you are not talking about prevalence of variants (i.e. allele
populations), but about prevalence of variation (i.e. sites having
variants).

If we are talking about non-neutral sites responsible for vital cell
chemistry or physiology, I would of course expect them to be very clean
of variation. I guess these make up the majority of the non-junk
genome, I guess they are the oldest, and I guess they are also
comparitively clean of interspecies variation. Fitness of these sites
is under constant maintenace selection, and this selection is very low
cost as it only requires the death of gametes or early stage embryos.

The above sorts of sites aside, I can see no a priori reason why the
prevalence of polymorphism in selected for sites should be lower than
that of neutral sites. Remember you are looking at a snapshot of
ephemeral polymorphisms. The prevalences are equilibrium levels. Yes,
selectable sites are constantly being depolymorphised (either by
extinction or by becoming fixed). Likewise neutral ones are constantly
being taken out by drift to fixation. I believe because of sex
(heterozygosity shelters damaged sites from selection), there is a high
outstanding population of deleterious variants on the way to becoming
extinct, or nearly so. It is therefore not clear to me that selectable
sites should be particularly clean, if we are talking about the ones
actively evolving, or about the ones responsible for traits which make
a difference between species.

> For many years, researchers have been aware of a phenomenon called
> linkage disequilibrium - the tendency for alleles at separate sites in
> the genome (in this case SNP alleles) to be found together more
> frequently than would be expected by chance.
> Note, as was carefully explained in a Web site I found the other day
> <http://linkage.rockefeller.edu/wli/lld.html>
> linkage disequilibrium is a historical record, i.e. a state caused by
> accumulated history, as opposed to linkage which is merely conditional
> probabilities that apply regardless of the initial state. The cause of
> linkage disequilibrium is that in the distant past there were only one
> allelle each at the two sites, but then a mutation happened at one of
> the sites, so now there's one line of descent (at the single-chromosome
> level, half of a diploid genome) with that mutation and one line of
> descent without it. Then later another mutation happened at the other
> site. If it happened in the original line of descent, then there are
> now three lines of descent, with each single mutation and also the
> parent allelle without either mutation. If the second mutation happened
> in the line of descent having the first mutation, then again there are
> now three lines of descent, but one with neither mutationn, and one
> with both mutations, and one with just the first. Note in both cases
> only three of the four possible combinations occur at all, which
> already constitutes linkage disequilibrium. Over subsequent
> generations, if there are meiotic crossings frequently between those
> two sites, then the two independent variations are mixed in all
> combinations and the linkage disequilibrium disappears. But if the two
> sites are sufficiently closely linked that crossing-over hardly ever
> happens between them, then even over moderately long spans of time
> (tens of thousands of years) the linkage disequilibrium persists.
>
> If we observe linkage disequilibrium between two SNPs nowadays, that
> could mean either that two mutations occurred so recently that there
> hasn't been time for crossing-overs between them to homogenize them
> back to equilibrium, or that there are no hot spots between them so
> that even over long spans of time the ancestral linkage disequilibrium
> from the two long-ago mutations has persisted.
>
> If we observe the same two SNPs appearing in many widely separate
> native populations which couldn't possibly have all exchanged DNA with
> each other recently (assuming we see linkage disequilibrium between
> these two SNPs in the first place), that eliminates the recent-mutation
> hypothesis, leaving only the no-hot-spot hypothesis. Apparently there
> are in fact huge blocks of DNA bases which almost never cross over
> during meiosis anywhere within them, not a single crossing in tens of
> thousands of years apparently, the "haplotype blocks" being studied.
> Most such blocks contain many more than just two SNPs, whereby the
> linkage disequilibrium is much more obvious and "proveable".

That's selling it to me. Especially if, as it seems, the population
genetics is backing up the genomics.

> Recent studies suggest that the genome may be divided into a
> remarkably small number of blocks - just 200 000 or so. Recombination
> seems to be focused between the haplotype blocks, so large groups of
> alleles end up travelling together.
>
> Math: 3 billion total DNA bases, divided by 200 thousand blocks, yields
> an average of 15 thousand bases per block. 3 million base pairs
> different, 90% of them SNPs, gives 2.7 million SNPs, and divided by 200
> thousand blocks, yields an average of 13 or 14 SNPs per block.
> That means if each SNP is binary (two different allelles), there would
> be a potential for 2**13 to 2**14 different combinations of the
> allelles within a single block, hence a potential of appx. ten thousand
> different allelles at the block level. As I understand it, only a tiny
> fraction of those ten thousand possible combinations actually appear.
> If there has been no recombination within a single block in all the
> time since those 13 or 14 mutations occurred, then there should be 14
> or 15 block-allelles present (the original, plus one extra for each new
> mutation that occurred, regardless of whether the new mutation occurred
> in a tree of descent where earlier mutations had already occurred or
> not). Does anybody know the average number of alelles of such a
> haplotype block that has 13 or 14 SNPs within it? Is it like 20
> allelles, which means that recombination within that block has happened
> only about five times in all of human history, or more like 100
> allelles, which means that recombinaton within that block has happened
> about 90 times, or even more?

The complication here is that unless you sample every little village in
the world, you won't find the most recent few mutations that went in.
Also you won't see if the most recent few is actually thousands of
different very local ones because of the population explosion.

> Note that if at the time of the population bottleneck (or shortly
> after) only a single allelle of a nowaday haplotype block survived, we
> should be able to identify it. Survey all allelles of this block in
> populations around the world. If a block-allelle occurs in *all* of
> them, it's probably the ancestral block-allelle, whereas if it appears
> only in a few populations in just a few geographical regions and/or
> along a route of miagration, then the block-allelle originated due to a
> new mutation after the bottleneck and didn't have time to be
> distributed to all regions of Africa before the first miagration out of
> Africa, or it originated *after* the first such miagration. If more
> than one globally-distributed block-allelle is found, it indicates that
> all such block-alleles were ancestral and survived through the
> bottleneck. It's remotely possible that more than one block-allelle
> survived the bottleneck and for a while after, but then eventually
> drifted to fixation in all the world's populations, but that's highly
> unlikely. So if we identify only a single allelle of a particular
> block, we can assume it probably was the only survivor of the
> bottleneck.

As I tried to say above, a single allele doesn't make a case of
polymorphism. Also a large part of the non-junk genome is highly
conserved, but capable of carrying SNPs. Certainly universal
distribution of an allele would strongly suggest it is ancient.
Chances are however the blocks have all undergone changes to neutral
sites, and each such change is not globally distributed geographically.
I assume it is possible to construct a genealogical tree explaining
the currently extant alleles of a particular block based on shortest
mutational distance from a single common ancestor (which probably will
not still be around, and so whose nature has to be hypothesised). This
technique can however give multiple solutions. The bottleneck
hypothesis is supported if there aren't multiple solutions, and the
root isn't too far back in the past given some assumption about
mutation rate.

> If we find a lot of such single-bottleneck-allelle blocks
> and hardly any multiple-bottleneck-allelle blocks, it would indicate a
> very small population through the bottleneck, a really severe
> bottleneck for sure! On the other hand if we discover that lots of
> multi-allelle blocks survived the bottleneck, then maybe the bottleneck
> was only mild, that a decent-sized population existed at all times
> during the bottleneck.

That would be very strange, because the original reason for supposing
such a bottleneck was that human mitochondria have a recent common
ancestor. Your decent sized population would therefore have to be one
woman and many men - either that or they kept very complicated records
to enable them to breed all but one line of mitochodria to extinction.
I think we can take it that there was a bottleneck. The question is
whether it helps or hinders deducing whether haplotype blocks are a
reality.

> Therefore, rather than the
> millions of allele combinations potentially available, the human
> population seems to be made up of a more limited set of haplotype
> patterns.
>
> That math is not correct. With 2.7 million SNPs, if they were
> rearranged in all possible combinations, assuming each has only two
> allelles, the total number of combination allelles would be
> **HOLD YOUR BREATH** 2 ** (2.7 million) =appx= 10 ** (800 thousand)
> not merely "millions" as implied above, not a billion, not a
> quadrilion, not even as small as a GOOGOL, more like a GOOGOL to the
> eight power!!!!!!!!

It makes you realise that even if there were never any mutations, there
is a near infinite fount of variability in sexually reproducing
organisms. Despite that, there are bounds. You couldn't for example
get from fur to feathers.

> But there are only 7 billion total people on Earth, so it'd be
> impossible to have more than an infinitesimal number of different
> combinations represented in our population. The 7 billion population
> would be the bottleneck, reducing the total number of combinations to
> no more than 14 billion (two copies of each combination in each
> individual, if you somehow decide which of the two copies of each
> chromosome is the first copy so you can think of all first-copies of
> all the chromosomes as one allelle-combination). It's probably better
> to treat both copies of chromosomes together, so then there is only one
> combinination of allelles per individual, 7 billion total, out of
> GOOGLE to the sixteenth possible combinations.
>
> Now if we break the genome into arbitrary blocks of size 15 thousand
> bases each, and calculate number of possible combinations of only the
> SNPs located within a single block, we get 2 ** (13 or 14) which is the
> ten thousand possible allelles of each block that I calculated earlier.
> So even if they said it wrong, "millions of combinations" isn't correct.
>
> This haplotype arrangement appears to be similar in all the different
> populations around the world, suggesting that many of them represent
> ancestral haplotypes that existed in the earliest humans.
>
> Suppose the bottleneck occurred a million years ago, and the first
> major miagration out of Africa occured only 30 thousand years ago. Then
> we would expect only about 3% of the mutations to have occurred since
> the first miagration, the remaining 97% occured while everyone was
> still in Africa. But only a tiny portion of East Africa was occupied by
> humans for most of that time, so that small region might have been
> homogenized by intermarriage between different local groups many times
> throughout most of the pre-miagration period, before the population
> started to spread through a larger portion of Africa (just prior to the
> first out-of-Africa miagration) and got too spread around for the
> different locales to exchange genes with each other. So perhaps 90% of
> the mutations occurred before the spreading apart occurred, likewise
> 90% of the breakages of old blocks to make two new smaller blocks
> occurred during that early homogeneous time. SO 90% of the block
> structure would be shared across all human populations, with only 10%
> of block structure appearing in just part of the human population. So
> is that basically what they're saying?

I'm starting to think they were weaseling because they didn't know.

> The map will be based on DNA samples obtained from hundreds of people
> in geographically distinct populations: Nigerian Yorubas, Han Chinese,
> Japanese, and US residents of European origin.
>
> (That seems to have been the original plan. Apparently they later
> decided to survey only a small local population of Utah residents for
> the US sample. And for Japan, they picked a group in Tokyo.)
>
> A good way to think about haplotypes and haplotype blocks is to
> imagine the SNP alleles as children sharing school minibuses.
>
> Oh boy, I'm not the master of metaphor after all!
>
> By the way, I thought I invented the term "haplotype block" myself,
> after reading the article in _Scince_, becuase when I started using the
> term here people said they had no idea what I meant. But I see the term
> was already in use by the HapMap founders before the project even
> started, so I claim innocence of the coining.
>
> Additional note: Very few DNA-base-pair-neighbors (appx. 200 thousand,
> out of 3.1 billion total) have crossed over even once since the
> population bottleneck, most (99.99%) haven't crossed even once during
> all that time, yet some very "hot" places have crossed many many times
> during the same time. Is that right, or did I miss something?

Dunno. And I'm getting innocenter by the minute.

> Does anybody know where I might find online the statistics of hotness
> of all the known DNA-base-pair-neighbors
>
> More thoughts: During a time of statis, when the local population stays
> the same, each couple on the average having two offspring that survive
> to the same point in the generation-cycle, when a neutral mutation
> occurs, generating a new allelle of that locus, the average number of
> copies of that new allelle remains at 1, which means that it's very
> likely to drop to 0 i.e. go extinct. As a result, very very few new
> allelles at such times would survive to the present. On the other hand,
> during a time of local population growth, a new allelle would have a
> good chance of quickly attaining reasonble number of copies, after
> which it'd have a decent chance of surviving over long term. Eventually
> it'd have p chance of surviving, where p is the proportion of
> individuals with that allelle, which is 1/2n where n is the population
> size at the time of the mutation in a single individual. But there
> hasn't been enough time for drift to play out like that, so there'd
> still be a few copies remaining almost for sure. As a result of that
> analysis, the SNPs we see today should be of two types, those which
> appeared very very recently, like in the current individual or
> immediate parents, so there hasn't been time for them to dissappear by
> accident, and those which appeared during a time of population growth.
> But *now* *is* a time of population growth, so the population-growth
> case includes the very-recent case, so we need consider only that one
> case.
>
> Now by measuring the allelles of SNPs in lots of different people, we
> should be able to observe how widely distributed they are in various
> populations in different geographic areas, and from that information we
> should be able to estimate when each SNP first occurred due to point
> mutation. Likewise by mapping the haplotype blocks, in particular by
> measuring how widely distributed crossing-overs at boundaries between
> adjacent-block are compared to inheritance of the same ancestral pair
> of adjacent blocks, we may be able to estimate how long ago each
> crossing-over between two adjacent haplotype blocks occurred. Just like
> SNPs, new combinations of haplotype blocks due to crossing between
> blocks would tend to disappear except when such cross-overs occurred
> during times of population growth.

Tending to disappear? It seems that way, but the population does a
random walk, which has no 'tendency'.

> I predict that when we calculate
> statistics of age of neutral SNPs and adjacent-haplotype-block-pair
> cross-overs, we'll get the same age-distribution within any single
> local population group, which will clearly show when various spurts of
> population growth occurred, either globally or within local
> populations.
>
> Furthermore, if we collect haplotype data for each living human, we may
> be able to use both SNP-origin ages and block-boundary cross-over ages
> to compute a complete pedigree for each such human, that is a complete
> family tree for all of Homo sapiens, going back all the way to the
> population bottleneck!! Should I explain how I believe this would be
> possible, what specific calculational methodology would accomplish that
> task/goal, or can you-all figure it out for yourselves after I have now
> spurred you with the idea?

This might show everything but that which is most of interest - the big
evolutionary innovations. The extant gene pool contains historical
information about changes that did *not* encounter strong selection.

Strong selection, and the past is obliterated.

> (Final note: I was thinking maybe this one article was getting too
> long, so I might split it into parts. But it's only 21k bytes, so what
> the heck, I'll keep it intact and see if the auto-moderator accepts
> it.)
> .

It wasn't truncated.

Nic

.



Relevant Pages