• Nie Znaleziono Wyników

Definition 3.4:3 (Heterozygous effect s, after Gillespie 1998) The heterozygous effect h, given by

3.5. The coalescent model

Consider a sample of DNA sequences from a locus with no recombination. Looking backward in time, a single sequence that is the ancestor of all these sequences will be eventually found. The ancestral relationship creates a phylogeny of these sequences referred also as to gene genealogy or simply genealogy, and defines the notion of a coalescent.

Namely, a coalescent is the lineage of alleles in a sample traced backward in time to the allele which is a most recent common ancestor (MRCA) of the whole sample (Fig. 1). When two arbitrary sequences coalesce, i.e. when the number of lineages in the coalescent is reduced by one, it is called a coalescent event.

Definition 3.5:1 (Coalescent time, after Ewens 2003)

The number of generations between successive coalescent events is called the coalescent time. More specifically, the length of the period during which there were n ancestral alleles (sequences) is called n-coalescent time and denoted by Tn. This period is sometimes referred to as the state n of the coalescent process.

T2

T3

T4 Divergence

Coalescence

Fig. 3.5:1. The coalescent of four sequences Rys. 3.5:1. Koalescent dla czterech sekwencji

The Wright-Fisher model implies that at any generation two randomly selected sequences can have the same ancestral sequence in the previous generation. Therefore, when a coalescent event occurs, it is between two randomly selected sequences. By following the process until the MRCA of the whole sample is found, the ancestral relationship among sequences (genealogy) is created which is essentially a random tree. Note, however that not every genealogy has the same probability. In the above tree let us introduce the classification of branches as internal and external. An external branch is the one that connects directly to a sequence in a sample, otherwise the branch is said to be internal.

A random tree of gene genealogy can be also generated by a top-down approach. Starting with the MRCA of the whole sample and splitting it into two descendant lineages creates the first divergence event (see Fig. 1). Then, by random picking one of the lineages and splitting it into two lineages, the second divergence event is modeled. Repetition of this process until there are n lineages leads to the genealogy of n sequences in a sample. Remarkably, this top-down generation of the genealogy leads to the random tree which has the same statistical properties as the one generated by coalescence (Fu, 2003). This top-down generation of genealogy can be applied to compute the probability of a given genealogy. Tajima (1983) proved that the probability P of a genealogy of n sequences with s branching points that lead to exactly two descendant sequences in the sample is given by

1

! 2 1

  P n

s n

. (3.5:1)

The description of the coalescent time distributions will be started by considering the Wright-Fisher model for a smallest sample exhibiting effects of the genetic drift, i.e. sample composed of two chromosomes. The model assumes a population of haploid individuals (for example mtDNA sequences), which at time t  0 has the size Nt. Since multinomial sampling from a given generation's gene pool is assumed, two individuals at generation t + 1 are the descendants of the single member of generation t with probability pt = 1/Nt.

Consequently, with probability qt = 1- pt they are descendants of two different members.

This is reflected in the following distribution of the time to coalescence T2c of two randomly drawn chromosomes in a population with variable size Nt (Bobrowski and Kimmel 2004)

 

  

within a sample corresponds to the expected value of the coalescence time T2c in the model.

Moreover, the discrete nature of generations makes it easy to simulate the demography of the model. Therefore, using Monte-Carlo techniques it is possible to estimate unconditional coalescence distribution by averaging conditional on realizations Nt, the distributions given by (2).

Moreover, if we consider a population of constant population size then it is possible to derive algebraically the expected value of the coalescent time. When necessary, the actual fluctuating population size can be approximated by the long-term inbreeding effective population size Ne given by equation (3.2:22) in Theorem 3.2:1. Then the probability p2 that two randomly selected sequences come from a single ancestral sequence in the previous generation is (Fu 2003)

Ne

Given that these sequences come from different parent sequences at generation t – 1, the probability that they still come from different ancestral sequences at generation t is also equal to q2, and the probability that they coalesce is p2. Therefore, it follows that the probability

P(T2 = t) that the two sequences come from a single ancestral sequence T2 = t generations ago

Theorem 3.5:1 (Coalescent time for two chromosomes, after Fu 2003)

The coalescent time T2 , i.e. the waiting time until the next coalescent event occurs between two sequences, has the following properties

 

T2 2N , Var

   

T2 2N 2

Ee  . (3.5:6)

Proof

Formula (5) specifies the probability distribution of T2. It follows that the coalescent time T2 has got geometric distribution with probability of success p2 = 1/2Ne. Since the mean of the geometric distribution is the reciprocal of the probability of success, the first part of the equation (6) follows. From the properties of the geometric distribution, and taking into account that 2Ne >> 1, it follows that the variance of the coalescent time T2 is approximately given by

what ends the proof.

Note that e-x  1 – x, when x is small. Since 1/2Ne is quite small in natural populations, the distribution of T2 given by (5) can be approximated by an exponential distribution with probability density function f(T2) which satisfies (Fu 2003)

 

N t

The distribution of the time to coalescence can be computed also for a sample composed of more than two chromosomes. Consider the genealogy of a sample of n sequences taken

from a population of diploid individuals. From (4) it follows that probability that two particular sequences from a sample do not coalesce is (2Ne – 1) / 2Ne. It shows that there are a total of 2Ne possible ancestors for the second sequence, but only (2Ne – 1) that are different from the ancestor of the first sequence. Similarly, the probability that the third sequence does not coalesce with none of the two sequences, given that these two sequences have different ancestors, is (2Ne – 2) / 2Ne. Therefore the total probability that the first three sequences do

Expanding the product results in (Fu 2003)

 

Therefore, the probability pn = 1 – qn that there is a coalescence among n sequences is

 

approximation given by (12) assumes that no multiple coalescence occurs in one generation.

This approximation is valid when n(n – 1) << 4Ne.

Having probability of the coalescence pn in one generation it is possible to compute the distribution of the waiting time for the coalescent event, when the coalescent is in n state, i.e., the distribution of the n-coalescent time Tn. Note that the probability that Tn = t is given by

     

1

Theorem 3.5:2 (Coalescent time for n chromosomes, after Fu 2003)