• Nie Znaleziono Wyników

APPLICATIONS IN EVOLUTIONARY GENETICS

4. T HEORY OF N EUTRAL E VOLUTION 1. Foundations

4.2. Neutrality tests

As mentioned in section 4.1, testing for natural selection operating at molecular level has become one of the important issues in contemporary bioinformatics. Such research relies on development of neutrality tests, which are statistics that can be used against null hypotheses based on predictions of the neutral model of evolution. These tests can be often used in search for signatures of natural selection in genes, as presented in section 4.3

There exist two general types of tests of natural selection at molecular level. The first type can be applied when the data consists of entire or partial coding sequences of a gene. Then, the comparison of frequencies of silent substitutions at the third codon position to the frequencies of substitutions on the first and second position provides a handle to measure selective pressure. This approach was used in study leading to detection of perhaps the most spectacular example of natural selection found in the ASPM locus, a major contributor to brain size regulation in primates (for more information about evolution of ASPM see Evans et al. 2004, Zhang 2003).

In many cases, however we have to deal with another type of data, which consists of sequences that are not only non-coding, but also composed of nucleotides located at a considerable distance from each other. In such cases, a model for neutral evolution of the sequence has to be determined and then its predictions compared to data. Usually, this model is some modification of the Wright-Fisher model of genetic drift with mutation (Hartl and

Clark 1997, Jobling et al. 2004). The significant departure from predictions under neutrality (which serves as the null hypothesis) may provide evidence for selection (the desirable alternative hypothesis). However, there exist other alternatives, which may cause departures from the null, and be confused with selection. Examples include population substructure and past change in population size as reviewed by Nielsen (2001). Therefore, one common way to deal with this problem is to frequently apply a number of tests, each one sensitive to different combination of factors, and compare the results. The substructure for example can be approached by considering data from different subpopulations separately or by comparison of the test results among loci. Another approach presented in the section 4.3.2 is based on the formulation of null hypotheses assuming population substructure. In this way, if proper critical values of the test are determined, the influence of substructure will not cause false positive test results.

Each analysis of SNP data leading to the detection of natural selection operating at some loci, when applied to human population, has to take into consideration the alternative departures from neutrality that can produce data resulting in similar test outcomes. These alternatives feasible from the point of view of human population evolution are population growth and geographic substructure with migration. In section 4.3 it will be shown how to deal with this problem, by the analysis of a battery of statistical tests giving indication about the age of the predominant mutations, and how this information can be used to exclude not desirable alternatives. In this section these neutrality tests will be defined.

Tests which give the indication about the age of alleles, being in excess compared to the amount predicted under neutral evolution model, are based on the difference between different estimates of composite parameter  = 4N (N indicates the effective population size and  is the mutation rate per nucleotide per generation). Such tests are Fu‟s tests belonging to the class F’(r, r‟) (Fu 1997):

     

   

' ' '' '

,

' ' ,

' Var L r L r

r L r r L

r

F

  (4.2:1)

where L’ are estimates of composite parameter  in the form of linear functions of the i (the numbers of segregating sites of type i, where i = 1, 2, …, n/2 and n is the sample size). The parameter of function L’ denotes more (for larger values) or less (for smaller values) substantial influence of rare alleles on the estimation of . Therefore, ˆ = L‟(0) is less influenced by rare alleles than ˆ = L‟(1). The defined above class covers many known tests W like: Tajima (1989) test T (for uniformity, we follow the nomenclature of Fu (1997), Wall (1999), and some other papers, although originally Tajima‟s test was named D), Fu and Li‟s (1993) test D* or Fu and Li‟s (1993) test F*.

Definition 4.2:1 (Tajima test T, after Tajima 1989)

Tajima T test, which is the most widely used neutrality test (McVean 2002), is equivalent to F‟(0,1). Other tests of F’(r, r‟) class include Fu and Li‟s test D* (D* = F’(1,) and therefore the test is sensitive to existence of very rare alleles) and Fu and Li‟s test F*. Since F* = F’(0, ) it should have the power for detecting the excess of very rare alleles, presumably with greater power than D* because of a more extreme value of the first parameter in function F’.

Definition 4.2:2 (Fu and Li tests, after Fu and Li 1993) Fu and Li tests D* and F* are defined as

where  is the total number of mutations that occurred in the entire genealogy of n genes, and

s is the number of singletons, i.e. nucleotides that appear only once at the site among the sequences in the sample. For mathematical definitions of coefficients uD*, vD*, uF* and vF*

(being complicated functions of the parameter n only) see Fu and Li (1993).

Another category of tests is based on the estimates of probabilities of having no more or no less than the observed number k of haplotypes in a sample of n sequences, assuming neutrality and lack of intra-locus recombination. Into this category fall the Strobeck‟s test S and the Fu‟s test Fs, which are defined below.

Definition 4.2:3 (Strobeck‟s test S, after Cyran, Polańska, and Kimmel 2004)

The Strobeck‟s test S is defined as the estimate of the probability of having no more haplotypes in a sample, and it is given by

  

where: Sn(ˆ ) denotes the generating function of the Stirling numbers of the first kind S ni, i.e.

Definition 4.2:4 (Fu‟s test Fs, after Cyran, Polańska, and Kimmel 2004) The Fu‟s test Fs is given by: neutrality and no recombination, provides expected frequencies of haplotypes existing in a given number of copies (Hartl and Clark 1997). Therefore, it serves as a convenient reference to test deviations from neutrality. It is used, for example, in the Strobeck‟s test (see above).

However, it is even more convenient to use coalescent simulations based on the IAM, to compute a large sample of simulated distributions of variants. The value of composite parameter  is estimated from the haplotype sample, using the IAM-based expression for the total number K of variants in the sample of n sequences:

 

and comparing it to the observed number of different haplotypes. Simulated distributions of variants are compared to the observed frequencies. Technically, to facilitate visual comparison, empirical and simulated cumulative counts A(j) of haplotype variants existing in no more than j copies in the sample of n sequences (j = 1, ..., n) are compared. In addition, both the horizontal axis (the number j of copies of a variant) and the vertical axis (cumulative count A(j) of variants existing in j copies) are standardized to the unit interval, by dividing by n and K, respectively. Resulting graphs allow a visual comparison of the empirical distribution of variants (thick line) with multiple simulated distributions (thin lines), as presented for illustration in Fig. 1 and Fig 2 for actual SNP data.

Definition 4.2:5 (Kelly‟s test, after Kelly 1997)

The Kelly‟s test Zn is defined as the average (over all pairs i, j of K segregating sites) of the squared correlation of allelic identity between sites i and j

   

  1

1 1

1 .

2 K

i K

i j

ij

nS K K

Z  (4.2:8)

If the goal is eventually to find the type of selection, at first one should exclude Kelly‟s (1997) ZnS test, as it produces similar, inflated, patterns both for selective sweeps with recombination and for balancing selection. Note, however, that it is valuable to apply the ZnS

test after one of these possibilities has been excluded based on results of the tests given by Definitions 1 – 4. It is so because this test is reported to have a big power, and can verify previously obtained results.

The squared correlation of allelic identity ij is a standardized (that is ranging from 0 to 1) measure of linkage disequilibrium Dij between loci i and j. It is given by:

1

 

1

.

2

j j i i

ij

ij p p p p

D

 

 (4.2:9)

In above formula Dij = pij – pipj, where pi and pj are frequencies of mutant alleles at loci i and j respectively, whereas pij is the frequency of sequences that have mutant alleles at both loci.