Artificial intelligence-based method - Search for selection at molecular level

ATM - AFAM

WRN -AFAM

4.3. Search for selection at molecular level – case study

4.3.3. Artificial intelligence-based method

The required assumption for successful application of the AI-based methods is that a mosaic of test outcomes, making a direct inference so troublesome, contains enough information to differentiate between the existence of natural selection and its lack. The second prerequisite is the expert knowledge about presence of the selection for given combinations of neutrality test outcomes. Having those two, it is possible in principle to train

the knowledge retrieving system and, after successful testing, to use it for other genes for which the expert knowledge is unknown. The author has studied application of neural networks (in particular PNN) and three rough set approaches (CRSA, DRSA, and QDRSA) in the problem considered.

In experiment with application of PNN, in order to interpret the outcomes of the battery of mentioned seven tests, first there is applied complex multi-null-hypotheses methodology to obtained labels (balancing selection or no evidence of such selection) for given combination of tests results computed with the assumption of classical null hypothesis. The goal of the experiment was to prove that the information preserved in these test results (even computed without taking into account factors like population growth, recombination and population substructure) is valuable enough to obtain reliable inferences.

As a tool for this study probabilistic neural network was used. As presented in section 2.2.1, it is specialized radial basis function (RBF) network applicable almost exclusively for problems of classification in probabilistic uncertainty model. The network generates on its outputs likelihood functions p(x|C_j) of input vectors x belonging to given class C_j. One should notice that likelihood functions also define random abstract classes defined in a probabilistic uncertainty model.

On the other hand, the likelihood function after multiplying it by prior probabilities of classes (approximated by frequencies of class representatives in a training set) and after dividing the result by the normalizing factor having the same value for all classes (and therefore negligible in a decision rule discriminating between classes) yields posterior probability P(Cj|x) of the given class Cj, given the input vector x. However, this posterior probability is also the main criterion of a decision rule in a probabilistic uncertainty model implemented by Bayesian classifiers.

The mentioned above decision rule is very simple assuming the same cost of any incorrect decision (i.e. in the case considered treating equally false positive and false negative answers). It can be simply reduced to the choice of the class with maximum posterior probability P(C_j|x).

Moreover, assuming the same frequencies of the representatives of all classes in a training set – what is the case in this study – the above rule is equivalent to the choice of the class with maximum likelihood p(x,Cj). Since likelihood functions are generated by output neurons of probabilistic neural network, therefore to obtain a decision one has to drive inputs of the PNN with the given vector x, and choose the class corresponding to the neuron with the highest level of response.

The training of the probabilistic neural network is a one-epoch process, given the value of the parameter s denoting the width of the kernel in the pattern layer. Since the results of the classification are strongly dependent on the proper value of this parameter, in reality the

one-epoch training should be repeated many times in a framework used for optimization results with respect to s. Fortunately the shape of the optimized criterion in one dimensional space of parameter s in majority of cases is not too complex, with one global extreme having respectable basin of gravity. If the width parameter s is normalized by the dimensionality of the input data N in an argument of the kernel function, then the proper value of s is very often within a range from 10 to 10^-1. In this study, where there was applied the minimization of the decision error serving as a criterion, the optimal value of s proved to be 0.175.

Table 12 presents the results of PNN classification during jack knife cross validation for s equal to 0.175 (In the study three PNNs were trained, each with different width of the kernel function. In jack knife cross validation the PNN with s = 0.175 gave the best results). The decision error of this classifier in testing was equal only 6.25% with estimated standard deviation of this error equal to 0.067, proving very good classification abilities of the PNN.

Table 4.3:12 The results of jack-knife cross validation procedure for the probabilistic

neural network with parameter s = 0.175 (93.5% correct decisions)

Test Number Number of correct decisions

Percentage of correct decisions

Decision error

1 2 100% 0

2 1 50% 0.5

3 1 100% 0

4 2 100% 0

5 2 100% 0

6 2 100% 0

7 2 100% 0

8 2 100% 0

Average 15/16 93.75% 0.0625

To compare three rough set-based approaches (CRSA, DRSA, and QDRSA) applied for testing of balancing selection in four genes involved in human familial cancer, consider the information system S = (U, Q, Vq, f) in which Q = C  {d}. The haplotypes for particular loci were inferred and their frequencies were estimated by using the Expectation-Maximization algorithm (Polańska 2003). The results of tests T, D*, F*, S, Q, B and Z_nS, together with the decision concerning the evidence of balancing selection based on multi-null methodology, are given in a Table 13.

The rough set based analysis of the Decision Table 1, reveals that there exist two relative reducts: RED1={D*, T, ZnS} and RED2= {D*, T, F*}. It is clearly visible, that the core set is composed of tests D* and T, whereas tests ZnS and F* can be chosen arbitrarily, according to the automatic data analysis. However, since it is known, that both Fu‟s tests F* and D* are the examples of tests belonging to the same family, and therefore their outcomes are rather strongly correlated, it is advantageous to choose Kelly‟s ZnS instead of F* test. It is so, because Z_nS outcomes are theoretically less correlated with outcomes of test D*, belonging, as

it was stated above, to the core and therefore required in any reduct. The Decision Table 1 with reduced set of conditional attributes to the set RED1 is presented in Table 14.

Table 4.3:13 Decision Table 1. The outcomes of the statistical tests for the

classical null hypothesis

D* B Q T S ZnS F* Balancing Selection

AfAm * NS NS * NS NS * Yes

ATM Cauc * NS NS ** ** * ** Yes Asian NS NS NS * NS * NS Yes

Hisp * NS NS ** NS * * Yes

AfAm NS NS NS ** NS NS NS Yes RECQL Cauc * NS NS ** NS NS ** Yes

Asian NS * * * NS * NS Yes

Hisp * NS NS ** NS NS * Yes

AfAm NS NS NS NS NS NS NS No WRN Cauc * NS NS NS NS NS NS No Asian * NS NS NS NS NS NS No Hisp NS NS NS NS NS NS NS No AfAm NS NS NS NS NS NS NS No

BLM Cauc NS NS NS * NS NS * No

Asian NS NS NS NS NS NS NS No Hisp NS NS NS NS NS NS NS No

Table 4.3:14 Decision Table 2, in which the set of tests is reduced to relative

reduct RED1 composed of tests: D*, T, and ZnS

D* T ZnS Balancing Selection

AfAm * * NS Yes

ATM Cauc * ** * Yes

Asian NS * * Yes

Hisp * ** * Yes

AfAm NS ** NS Yes

RECQL Cauc * ** NS Yes

Asian NS * * Yes

Hisp * ** NS Yes

AfAm NS NS NS No

WRN Cauc * NS NS No

Asian * NS NS No

Hisp NS NS NS No

AfAm NS NS NS No

BLM Cauc NS * NS No

Asian NS NS NS No

Hisp NS NS NS No

After a reduction of the set of informative tests to a set RED₁={D*, T, Z_nS}, we considered the problem of a coverage of the discrete space generated by these statistics, by the examples included in the training set. The results are given in a Table 15, in which the domain of each of the test outcome (coordinate) is composed of three values: ** (strong statistical significance p < 0.01), * (statistical significance 0.01< p < 0.05), and NS (non significance p > 0.05). The given point in a space is assigned to: S (the evidence of balancing selection), N (no evidence of balancing selection) or empty cell (point not covered by the

training data). The assignment is done based on raw training data with conditional part reduced to the relative reduct RED1 . Note, that the percentage of points, covered by training examples, is only 30%.

Table 4.3:15 The discrete space of three tests: D*, T, and ZnS, based on

Decision Table 2

** * NS

Z_nS Z_nS Z_nS

** * NS ** * NS ** * NS

D* * S S S N

NS S S N N

The next step was to apply the notion of the relative value reducts to particular decision rules in the Decision Table 2. The resulting Decision Table 3 is presented in a Table 16.

Table 4.3:16 Decision Table 3, based on relative value reducts for

three tests: D*, T, and ZnS

D* T ZnS Balancing Selection

AfAm * * Yes

ATM Cauc ** Yes

Asian * Yes

Hisp ** Yes

AfAm ** Yes

RECQL Cauc ** Yes

Asian * Yes

Hisp ** Yes

AfAm NS No

WRN Cauc NS No

Asian NS No

Hisp NS No

AfAm NS No

BLM Cauc NS * NS No

Asian NS No

Hisp NS No

Table 17 presents information analogous to Table 15, however the coverage of points is based on the number of points which are classified with the use Decision Table 3. One should notice that the percentage of covered by algorithm points is 74%, however since 11%

(denoted with “-“) is classified as both with and without the evidence of balancing selection, therefore only 63% of the points could be treated as really covered.

Based on this Decision Table 3, the AlgorithmCRSA has been obtained using CRSA.. Note that this algorithm is simplified as compared to the algorithm which corresponds to the Decision Table 2. At the same time, it is more general, what can be observed in a Table 17, as compared to Table 15. In the algorithm, the outcomes of neutrality tests are designated as NS, S, and SS for non-significant, significant, and strongly significant, respectively.

Table 4.3:17 The discrete space of three tests: D*, T, and Z_nS, based on

Decision Table 3

** * NS

Z_nS Z_nS Z_nS

** * NS ** * NS ** * NS

** S S S - N

D* * S S S S S S - N

NS S S S S N - N

AlgorithmCRSA, (Cyran 2009d)

BAL_SEL_DETECTED = False BAL_SEL_UNDETECTED = False CONTRADICTION = False NO_DECISION = False

if T = SS or (T = S and D* = S) or ZnS = S then BAL_SEL_DETECTED = True

if T = NS or (T = S and D* = NS and ZnS = NS) then BAL_SEL_UNDETECTED = True

if BAL_SEL_DETECTED and BAL_SEL_UNDETECTED) then CONTRADICTION = True if not(BAL_SEL_DETECTED) and not(BAL_SEL_UNDETECTED) or CONTRADICTION then

NO_DECISION = True

The algorithm generated by DRSA, called AlgorithmDRSA is as follows AlgorithmDRSA, (Cyran 2009d)

at_least.BAL_SEL_DETECTED = False at_most.BAL_SEL_UNDETECTED = False CONTRADICTION = False NO_DECISION = False

if T >= SS or (T >= S and D* >= S) or ZnS >= S then at_least.BAL_SEL_DETECTED = True

if T <= NS or (T <= S and D* <= NS and ZnS <= NS) then at_most.BAL_SEL_UNDETECTED = True

if at_least.BAL_SEL_DETECTED and at_most.BAL_SEL_UNDETECTED then CONTRADICTION = True

if not(at_least.BAL_SEL_DETECTED)

and not(at_most.BAL_SEL_UNDETECTED) or CONTRADICTION then

NO_DECISION = True

It happened that the algorithm generated by QDRSA AlgorithmQDRSA is identical to AlgorithmDRSA when the whole universe U of the information system S is used for generation of the algorithm. However, if the universe of the information system S is divided into two sets of rules, those used for information retrieval in the process of generating the decision algorithm, and those left for testing, then the resulting algorithms generated by DRSA and

QDRSA are different in some cases. Below only these algorithms which differ between the two approaches are presented.

If the information about RECQL gene is excluded from the information system S and it is left for testing in crossvalidation process, then the DRSA and QDRSA generate the algorithms Algorithm_DRSA(-RECQL) and Algorithm_QDRSA(-RECQL), respectively. Since the general structure of both algorithms is identical to that of Algorithm_DRSA, only two crucial if-then rules (the ones after four initialization assignments, and before two contradiction/no-decision determining if-then rules) are presented below.

AlgorithmDRSA(-RECQL)

...

if (T >= S and D* >= S) or Zns >= S then at_least.BAL_SEL_DETECTED = True

if T <= NS or (D* <= NS and Zns <= NS) then at_most.BAL_SEL_UNDETECTED = True

...

AlgorithmQDRSA(-RECQL)

...

if {T >= SS} or

(T >= S and D* >= S) or Zns >= S then at_least.BAL_SEL_DETECTED = True

if T <= NS or (D* <= NS and Zns <= NS) then at_most.BAL_SEL_UNDETECTED = True

...

It is visible that the difference is the existence of one more condition in the rule describing the detection of balancing selection. This condition reads "if the outcome of Tajima test is at least strongly statistically significant". It occurs in AlgorithmQDRSA(-RECQL), because the condition T = SS is the result of application of the relative value reduct for one of the rules in the information system S(-RECQL) analyzed with QDRSA indiscernibility relation (2.3:5). After changing the condition in QDRSA to T >= SS, this condition is still not dominated by any other conditions detecting balancing selection. Since it is not dominated it must remain in the final decision algorithm presented above.

However, this is not the case in DRSA. This latter approach, when considering the dominance of the decision rules for the class at-least.BAL-SEL, compares the original (i.e. not reduced with the relative value reduct notion) condition (A) D* >= S and T >= SS and ZnS >= S with another original condition (B) D* >= S and T >= S and ZnS ZnS >= NS, instead of comparing (like QDRSA does) the condition (a) T >= SS with condition (b) D* >= S and T >= S, being the results of application of the relative value reducts in QDRSA-sense to the original conditions (A) and (B), respectively.

It is clear, that the rule with the condition (A) is dominated by the rule with the condition (B), and therefore the condition (A) seemed to be redundant in DRSA-sense for the class

at-least.BAL-SEL. However, the rule with the condition (a) is not dominated by the rule with the condition (b) and this is the reason why condition (a) is present in the Algorithm_QDRSA(-RECQL), while it is absent in Algorithm_DRSA(-RECQL). The conditions (B) and (b) in both approaches are necessary and they are reduced to the condition (b) present in both algorithms.

Finally, consider what is the influence of inclusion of the condition T >= SS to the Algorithm_QDRSA(-RECQL). When this algorithm is applied for the interpretation of neutrality tests for RECQL gene, i.e. the gene which was not present in the information system S(-RECQL), the decision error is reduced from 0.25 to 0 for four populations. When the full jack-knife method of the crossvalidation is applied, then the decision error is reduced from 0.313 with DRSA, what seems rather unacceptable, to 0.125 with QDRSA. It is important to mention that at the same time QDRSA NO-DECISION results have increased from 0 to 0.188, however in the case of screening procedure for which this methodology is intended, the unsure decision is also an indication for the more detailed study with the use of multi-null hypotheses methodology.

4.4. Conclusions

Population geneticists have developed quite a number of statistical neutrality tests which serve to deny at given significance level the Kimura‟s model of neutral evolution described in section 4.1. Hence, in the post-genomic area researchers are armed with quite a number of statistical tests (see section 4.2) whose purpose is to detect signatures of natural selection operating at the molecular level. Positive signals generated by these tests, given in detail in section 4.2, can be interpreted as caused be the presence of natural selection. In the case study considered in section 4.3 there have been used the following neutrality tests: Tajima's T, Fu's D* and Fs, Wall's Q and B, Kelly's ZnS and Strobeck's S.

However, because of such factors like recombination, population growth, and/or population subdivision, the appropriate interpretation of the test results is very often troublesome (Nielsen 2001). When the given gene is tested with the use of aforementioned tests, some of them can give positive, while others generate negative signals. Moreover, positive signals can be caused by population expansion or geographical structure of the population. On the other hand the signatures of actual natural selection can be suppressed by the recombination. All these factors make the proper interpretation hard, and not necessarily univocal. The problem is that mentioned departures from selectively-neutral classical model

(i.e. model with panmictic, constant in size population with no recombination) can produce similar results for some of these tests to results produced by the existence of natural selection.

Nevertheless, since the time of Kimura‟s famous book (Kimura 1985) until present, geneticists are searching for signatures of natural selection, treating proposed by Kimura model of neutral evolution at molecular level as a convenient null hypothesis, which is not fulfilled for particular loci under detectable selection. By moving the emphasis form selective forces to random genetic drift and neutral mutations, the neutral theory of molecular evolution gave birth to mentioned neutrality tests, which treat this theory as a null model, and statistically significant departures from it, discovered in loci under study, can be interpreted in a favor of natural selection. The existence of a rare, positive selection has been confirmed for example in a ASPM locus that contributes to the size of brain in primates (Evans et al.

2004, Zhang 2003).

An interesting example of another type of selection, called balancing selection (see section 3.4), has been detected by the author in ATM and RECQL loci (see section 4.3). To overcome serious interpretation difficulties while searching for the selection in ATM, RECQL, WRN and BLM, i.e. in four human familial cancer genes, the author has proposed an idea of so called multi-null-hypotheses methodology (part of this methodology was published in Cyran et al. 2004). However, this methodology is not appropriate for fast detection because of long lasting computer simulations required for estimating critical values under non-classical null hypotheses.

Yet, armed with reliable conclusions about balancing selection at ATM and RECQL and no evidence of such a selection at WRN and BLM, after time consuming search with the use of computer simulations, the author has proposed the usage of machine learning methodology, based only on knowledge of critical values for classical null hypotheses (see section 4.3.3). Fortunately, critical values for classical nulls are known for all proposed non-neutrality tests, and therefore outcomes of such tests can be used as inputs for artificial intelligence classifiers without additional computer stochastic simulations of alternative models.

In this methodology, described in section 4.3.3, the battery of tests outcomes is considered as a set of conditional attributes and the expert knowledge is delivered by application of the multi-null hypotheses method for some small amount of genes. After crossvalidation of the model, the decision concerning other genes can be done based on testing only against classical null hypotheses and application of the decision algorithm inferred with AI-based methodology. Such strategy does no need intensive computer

simulation, and therefore is much more time-efficient as compared to multi-null hypotheses approach.

The results of application of rough set based theory for knowledge acquisition and processing were published in (Cyran 2007a) for CRSA, (Cyran 2010) for CRSA and DRSA, and (Cyran 2009d) for QDRSA. In Cyran (2009b) the author presented results of another study, based on the application of probabilistic neural network (PNN) for the detection of natural selection at molecular level. The advantage of the last proposed methods is that it not so time consuming and due to good recognition abilities of probabilistic neural networks it gives low decision error levels in cross validation (see section 4.3.3 for results).

The comparison of CRSA with DRSA for this particular purpose is described in section 4.3.3, where it is proved that neither CRSA nor DRSA generates decision algorithm which is optimal for the problem considered. The proof is done by a simple demonstration of another algorithm which is Pareto-preferred over both mentioned approaches. This algorithm can be obtained with QDRSA, the novel method proposed by Cyran (2009d).

The comparison of QDRSA with CRSA gives the favor to the first when the preference-order is present in conditional and decision attributes. The resulting decision algorithms in QDRSA are more general, i.e. they cover more points of the input space. Moreover, in many

W dokumencie Artifical intelligence, branching processes and coalescent methods in evolution of humans and early life (Stron 184-194)