• Nie Znaleziono Wyników

LocalizingInfluentialGeneswithModifiedVersionsofBayesianInformationCriterion MałgorzataBogdan (Wrocław) PiotrSzulc (Wrocław)

N/A
N/A
Protected

Academic year: 2021

Share "LocalizingInfluentialGeneswithModifiedVersionsofBayesianInformationCriterion MałgorzataBogdan (Wrocław) PiotrSzulc (Wrocław)"

Copied!
12
0
0

Pełen tekst

(1)

Piotr Szulc (Wrocław)

Małgorzata Bogdan (Wrocław)

Localizing Influential Genes with Modified Versions of Bayesian Information Criterion

Abstract Regions of the genome that influence quantitative traits are called quan- titative trait loci (QTLs) and can be located using statistical methods. For this aim scientists use genetic markers, of which genotypes are known, and look for the associ- ations between these genotypes and trait values. The common method which can be used in this problem is a linear regression. There are many model selection criteria for the choice of predictors in a linear regression. However, in the context of QTL map- ping, where the number of available markers pnis usually bigger than the sample size n, the classical criteria overestimate the number of regressors. To solve this problem several modifications of the Bayesian Information Criterion have been proposed and it has been recently proved that at least three of them, EBIC, mBIC and mBIC2, are consistent (also in case when pn > n). In this article we discuss these criteria and their asymptotic properties and compare them by an extensive simulation study in the genetic context.

2010 Mathematics Subject Classification: Primary: 62J05; Secondary: 92D20.

Key words and phrases: statistical genetics, quantitative trait loci, model selection, sparse linear regression, Bayesian Information Criterion.

1. Introduction. Recently we have observed a very fast development of tech- nologies supporting researchers in genetics. As a result of this development huge data sets are currently available for statistical analyses. Effective acquisition of informa- tion from such sets requires close cooperation between geneticists, IT specialists and statisticians. The role of statisticians is the development of statistical methods for the effective separation of essential information from random interferences. Es- pecially, large size of those sets require creation of new methods of correction for multiple testing, as well as new criteria of choosing essential explanatory variables.

A particular example of identifying explanatory variables is the problem of lo- calizing genes responsible for quantitative traits (Quantitative Trait Loci, QTL). In

Affiliation: Institute of Mathematics and Computer Science, Wrocław University of Technolo- gy and Department of Mathematics and Computers Science, Jan Długosz University in Cz¸estochowa

(2)

this research the database contains genotypes of a large number of genetic markers.

These are fragments of DNA chain which can be found in different variants (alleles) at different units in a population. The form of a given marker in studied individual can be established experimentally. In diploid organisms, where chromosomes appear in pairs, a genotype of a given marker is specified by a pair of alleles occurring in both chromosomes. From point of view of a statistician, genotypes of markers con- stitute qualitative explanatory variables. If a given marker occurs close to a gene influencing a trait of interest, then we expect statistical relation between the geno- type in that marker and this trait.

The main purpose of Genome Wide Association Studies (GWAS) is the identifi- cation of QTL or disease causing genes in human populations. Localization of genes in such outbred populations is relatively difficult. The problem comes from the fact that due to cross-overs (exchange of genetic material among chromatids), which oc- cur during production of reproductive cells, the statistical relation between a QTL genotype and a genotype of the neighboring marker might be very small. Therefore scientists need to use a huge number of densely spaced markers, which necessitates the application of stringent multiple testing corrections. The classical approach for localizing QTLs in GWAS is based on standard t-tests or ANOVA tests, performed separately at every marker. To control for multiple testing the popular Benjamini- Hochberg [3] procedure is often used. As explained in [14], this standard approach suffers from an unnecessary loss of power due to the fact that the effect of genes un- accounted for by single marker models inflates the variance of the ”residual” error.

Moreover, the correlation between a given marker genotype and a trait can be sub- stantially influenced by other causal genes. As discussed in [3] the multiple regression approach to GWAS, presented below, helps to solve both of these problems.

As compared to GWAS, Localization of QTLs is much easier when it takes place in experimental populations. Such populations are created on the basis of so called inbred lines, which are acquired through subsequent crosses of closely related indi- viduals. As a result of such crosses we can get individuals that in every locus (place on a genome) have the same alleles on both chromosomes. By crossing individuals from such inbred lines, scientists may control the process of cross-overs and create large experimental populations, for which one can theoretically foresee the structure of correlation between genotypes in different markers. Inbred lines have been grown for many plants and animals. Since genomes of experimental animals, such as mice or rats, are amazingly similar to genomes of more complex mammals (e.g. man), the results from studies in experimental populations often constitute the starting point for analyzing more complex organisms (see e.g. [17]). Description of tradi- tional methods, commonly used to localize QTLs in experimental populations, can be found for example in the book [16] or articles [10] and [11], while a good tutorial on GWAS is provided in [2] .

2. Mathematical problem. To use statistical methods for QTL mapping, we need to encode genotypes for all individuals. Genome Wide Association Studies usually use Single Nucleotide Polymorphisms (SNPs). These markers typically have only two alleles: a and A, and the corresponding genotypes are often coded as

X =

−1/2 if aa 0 if aA 1/2 if AA.

(1)

In this article we will also consider a backcross design, which is very popular in

(3)

experimental design. In the first step in this design the individuals from two distinct inbred lines are crossed. Let us denote by a an allele from one of these lines and by A an allele from the other line. Each individual in the population F1, resulting from crossing the individuals from these two lines, has the genotype aA at each locus.

Then the individuals from F1 population are crossed with individuals from one of the initial inbred lines. The resulting individuals can only have two genotypes at each locus, e.g. aa or aA. To code these genotypes we will use −1/2 for aa and 1/2 for aA.

Now, QTL mapping is often performed by identifying the best regression model of the form

yi=

pn

X

j=1

βnjxij+ i, i = 1, ..., n, (2)

where pn is the number of available markers, yi is a trait value for i -th individual, xijis a genotype of j -th marker for i -th individual and i’s are independent variables with a normal distribution N (0, σ2).

When choosing the best model we try to determine which of the regression co- efficients are different from zero. There exist many model selection criteria which can be used for this purpose, but the classical ones like Akaike Information Cri- terion (AIC, [1]) or Bayesian Information Criteria (BIC, [18]) are inappropriate.

They were derived based on the assumption that the sample size n goes to infinity, while the total number of available regressors pn remains constant. In our situation we have lots of markers (available regressors) and usually much fewer observations.

Thus the asymptotics underlying BIC and AIC no longer describes the situation well. It turns out that in case of such large databases the classical criteria choose too many markers (e.g. see [13]). To solve this problem several new modifications of BIC have been recently proposed. In this article we focus on three of these modifications:

mBIC [5], mBIC2 ( [13], [14], [21]) and EBIC [8]. They were constructed under the asymptotic assumption that pn → ∞ and pp0n

n → 0, where p0n is the number of nonzero coefficients. This assumption (which is called sparsity ) is appropriate in our situation, because in the problem of locating genes we know that the number of influential genes is very small compared with pn.

To define these new criteria, we first recall the Bayesian Information Criterion [18], which suggests choosing the model which minimizes the following formula:

BIC(s) = n ln(RSS(s)) + v(s) ln n, (3) where s is a subset of {1, ..., pn}, which defines the indices of the nonzero regression coefficients, RSS(s) is the residual sum of squares and v(s) is the number of elements in s.

In [5] a modification of BIC, called mBIC, was proposed. In this criterion it is assumed that the prior distribution for p0n is Binomial, B

pn,Epp0n

n



, where Ep0n

is the expected value of p0n. The formula for mBIC is

mBIC(s) = n ln(RSS(s)) + v(s) ln n + 2v(s) lnpn

c − 1

, (4)

where c = Ep0n. If Ep0n is not known, the authors of [6] propose using c = 4, to control the overall type I error at the level below 10%.

(4)

It was shown that the additional penalty in mBIC is closely related to the Bon- ferroni correction for multiple testing. Specifically, under orthogonal design and with known σ the problem of the choice of the best regressors can be solved by using single marker z-tests and mBIC is practically equivalent to the Bonferroni correction at the Family Wise Error Rate rougly proportional to 1n ( [6], [13]). Later in [13] and [14]) the new version of mBIC, mBIC2, was proposed. In this criterion the additional penalty is related to the Benjamini-Hochberg procedure [3] for multiple testing. The formula for mBIC2 is

mBIC2(s) = n ln(RSS(s)) + v(s) ln n + 2v(s) lnpn

c − 2 ln(v(s)!). (5) In [8] another modification of BIC, called EBIC, was proposed. Here the authors assume that the prior probability of choosing the model s is proportional to v(s)pn−γ

for some γ ­ 0. This assumption gives us the formula for EBIC family:

EBICγ(s) = n ln(RSS(s)) + v(s) ln n + 2γ ln pn v(s)



. (6)

Simulation studies showed good properties of these criteria ( [15], [20], [21]). From theoretical point of view, in [13] it is shown that mBIC is consistent and asymptoti- cally Bayes optimal under sparsity (ABOS) when p0n≈ const and the design matrix is orthogonal. Moreover, in the same article it is shown that mBIC2 is consistent and ABOS under the same conditions (p0n ≈ const), and additionally also when p0n tends to infinity such that pp0n

n → 0. This result is important especially in the large data set analysis, where it is expected that if we have more regressors, we will find a larger number of significant predictors. In [7] and [8] the properties of EBIC under nonorthogonal design are investigated, which allows to consider a practically relevant situation where pn > n. It is shown that under some regularity conditions on the design matrix EBIC is consistent for p0n = const and when the maximum size of searched models is limited. Later, in [9] these results were further extended and the consistency of EBIC was proved also for p0n→ ∞. In [19] analogous results were proved for mBIC and mBIC2.

In the following sections we will present theorems about consistency of mBIC, mBIC2 and EBIC under nonorthogonal designs and compare those criteria using an extensive simulation study.

3. Consistency. When pn > n then the regression design matrix is not of a full rank and not all regression models are identifiable. In this situation a proper statistical inference is possible e.g. under the assumption of sparsity and the addi- tional assumption that “small” regression models are identifiable. In this article we will use the following identifiability assumption from [9].

Assumption 1 (Identifiability condition) Denote Xn = (xij)i=1,...,n,j=1,...,pn

and let Xn(s) be the matrix composed of columns of Xn with indexes in s. Let Hn(s) = Xn(s)[Xn(s)TXn(s)]−1Xn(s)T (the orthogonal projection on the space spanned by columns of Xn(s)) and ∆n(s) = µTn[In− Hn(s)]µn, where µn= Eyn = Xn(s0nn(s0n) and s0nis the “true model”, i.e. s0n = {j : βnj 6= 0, j ∈ {1, ..., pn}}.

We assume that the following condition holds:

n→∞lim min

 ∆n(s) p0nln pn

: s0n 6⊂ s, v(s) ¬ kn



= ∞, (7)

(5)

where kn= kp0n for some fixed k > 1.

Shortly speaking this identifiability condition states that the distance between the vector of trait expectations and its projection on the space spanned by the columns of the design matrix of any of the incorrect ”small” regression models is sufficiently large. This is required to prevent the situation where the true model can be repre- sented by more than one ”small” combination of available regressors. A discussion of this condition and its relationship with other similar assumptions used in this context can be found in [9] or [19].

The following consistency theorems provide the conditions under which the mod- ified versions of BIC are consistent. They were proved in [9] and [19].

Theorem 3.1 Assume model2, the identifiability condition7, that p0nln pn = o(n) and ln pln p0n

n → δ ­ 0. Then

P

 min

s:v(s)¬kn

s6=s0n

EBICγ(s) > EBICγ(s0n)

→ 1,

if γ > 1+δ1−δ2(1−δ) ln pln n n.

Theorem 3.2 Assume model 2, the identifiability condition 7 and that p0nln pn = o(√

n). Then

P

 min

s:v(s)¬kn

s6=s0n

mBIC(s) > mBIC(s0n)

→ 1.

Theorem 3.3 Assume model 2, the identifiability condition 7 and that p20nln pn = o(√

n). Then P

 min

s:v(s)¬kn

s6=s0n

mBIC2(s) > mBIC2(s0n)

→ 1.

Note that the condition on the rate of increase of p0n is the least restrictive for EBIC and the most restrictive for mBIC2.

4. Simulations.

4.1. Backcross design. In this section we simulate marker genotypes for a backcross design with ten 100 cM (centiMorgan) chromosomes. Such a length of a chromosome is representative for genetic studies. The expected number of cross-over in one meiosis over a distance of 100 cM is equal to 1. The traits were simulated according to the model2with i from a normal distribution with µ = 0 and σ = 1.

Effect sizes (i.e βnj in model 2) were chosen evenly between 1.2 and 2.0, so as the power of detection of causal genes under Scenario 1 was moderate (between 60% and 80%). Note that the power depends on the ratio βσ, thus our results are representative also for other values of σ as long as βσ remains constant. Concerning the sample size as well as the size of the data base and the dimension of the true model we chose

(6)

four scenarios, which are described in Table1(pn denotes the number of markers, n is the number of individuals and p0n is the number of QTLs). We used the stepwise selection to choose the best model. For mBIC and mBIC2 we set c = 4 and for EBIC γ = 1. We repeated all simulations 1000 times.

We focused on the probability of choosing the true model (T), average number of true discoveries divided by p0n (power, P) and a false discovery rate (FDR), which is the proportion of true discoveries among all discoveries.

Table 1: Parameters of simulations.

pn n p0n

Scenario 1 50 50 3 Scenario 2 100 75 5 Scenario 3 150 110 7 Scenario 4 200 150 9

As we can see in Table 2, probabilities of finding the true model using mBIC and EBIC increase along with n, even though pn and p0n increase too (notice that n < pn). In these four scenarios, mBIC performs better than EBIC. The probabilities of finding the true model are higher, average numbers of false positives and false discovery rates are lower than for EBIC, with similar average numbers of positive discovers. mBIC2 performs worse than the other criteria. Note however that in our simulations p0nis almost linearly related to n, which violates mainly the assumptions of Theorem 3.3. It is however important to observe that mBIC2 has the highest power of all three criteria and that its FDR converges to 0 as n increases. To show advantages of mBIC2 we repeated the simulations from Scenario 4 with the smaller effect sizes, ranging between 0.8 and 1.0. The results are shown in Table 3. As we can see, here the power of mBIC2 is significantly higher than of other criteria, with the false discovery rate controled at approximately 10%.

4.2. GWAS. QTL mapping in experimental crosses does not require as many markers as Genome Wide Association Studies. To illustrate how our three criteria work with really large data sets we performed additional simulations in the con- text of GWAS. For this purpose we simulated 1000 individuals and 482906 markers, which are comparable to an Illumina 650K human array, typically used for SNP genotyping. We chose four scenarios, which are described in Table4. The number of markers in the data base pn and the dimension of the true genetic modes p0n were changing according to the functions cn2and n0.49, respectively, where c is chosen in the way that for n = 1000, pn = 482906. We used these functions in order to ful- fill assumptions for consistency for mBIC and EBIC. The causal markers, included in the simulated model, were selected such that they all have the minor allele fre- quency (MAF) exceeding 0.35 and are distant from each other. Since the power of gene detection depends on the product of the magnitude of its genetic effect (re- gression coefficient) and the standard deviation of the dummy variable coding for its genotype, our results are representative also for genes with smaller MAF and larger genetic effects. The traits were simulated according to the model2with i

from a normal distribution with µ = 0 and σ = 1 and with effect sizes equal 0.35.

For mBIC and mBIC2 we set c = 4 and for EBIC γ = 1. We used the algorithm described in [8] to choose the best model. Due to the computational complexity of the search over large GWAS data bases we repeated all simulations only 100 times.

(7)

Table 2: Results of simulations.

mBIC T P FDR

Scenario 1 0.355 0.736 0.134 Scenario 2 0.714 0.953 0.048 Scenario 3 0.750 0.967 0.038 Scenario 4 0.868 0.994 0.014

mBIC2 T P FDR

Scenario 1 0.279 0.804 0.229 Scenario 2 0.467 0.979 0.130 Scenario 3 0.491 0.983 0.097 Scenario 4 0.470 0.998 0.076

EBIC T P FDR

Scenario 1 0.272 0.630 0.109 Scenario 2 0.675 0.934 0.056 Scenario 3 0.714 0.973 0.046 Scenario 4 0.761 0.996 0.027

Table 3: Scenario 4 with smaller effect sizes.

T P FDR

mBIC 0.245 0.762 0.081 mBIC2 0.335 0.915 0.107 EBIC 0.299 0.781 0.081

When reporting our results, the probability of choosing the true model was replaced with the number of misclassifications M C = p0n−P +F P (FP is the number of false positives) averaged over 100 replicates. We treated a selected marker as a positive discovery when it had the correlation with a marker used to generate traits greater than 0.7.The results are shown in Table5.

The results confirm theoretical findings about the consistency of the modified versions of BIC and show that along with increasing n we choose models which are closer to the true one (MC goes to zero). Under these four scenarios, mBIC2 performs better than other criteria. It has a substantially larger power, while controlling FDR at the assumed level. It also has the smallest number of misclassified regressors.

5. Discussion. In this article we discussed three modifications of the Bayesian Information Criterion, which can be used for detection of genes influencing quan- titative traits. We presented theoretical results on the consistency of these criteria under the assumption that the number of available regressors and the number of true signals increase with the sample size. These assumptions are relevant in the context of genetic studies, where the number of available markers usually substan- tially exceeds the sample size. Also, it is expected that by increasing a number of markers the geneticists should be able to detect larger number of causal genes.

While the results on consistency are of great value, in practical applications also the rate of convergence of the probabilities of the type I and type II errors to zero is important. In [4] and [13] the related notion of the asymptotic Bayes optimality

(8)

Table 4: Parameters of simulations.

pn n p0n

Scenario 1 146095 550 22 Scenario 2 236650 700 25 Scenario 3 348939 850 27 Scenario 4 482906 1000 30

Table 5: Results of simulations.

mBIC MC P FDR

Scenario 1 20.10 0.093 0.066 Scenario 2 15.24 0.396 0.014 Scenario 3 12.57 0.544 0.017 Scenario 4 3.70 0.883 0.007

mBIC2 MC P FDR

Scenario 1 15.58 0.338 0.120 Scenario 2 8.20 0.734 0.078 Scenario 3 4.19 0.907 0.064 Scenario 4 1.67 0.983 0.038

EBIC MC P FDR

Scenario 1 19.41 0.123 0.042 Scenario 2 11.72 0.544 0.023 Scenario 3 6.24 0.798 0.035 Scenario 4 1.49 0.963 0.013

under sparsity (ABOS) was introduced. It was proved that under orthogonal de- sign matrices mBIC and mBIC2 have some asymptotic optimality properties, with mBIC2 being ABOS in a much wider range of the sparsity parameter p0n/pn. Re- cently similar results have been proved for EBIC (manuscript in preparation). We believe that the results on the asymptotic optimality of the modified versions of the Bayesian Information Criterion can be extended to the nonorthogonal case by assuming identifiability of ”small” models, as in the assumption (7).

In many practical cases of gene location scientists do not use the multiple regres- sion models, but perform individual tests for the association between the trait and the genotype of each of the considered genes. As discussed in [14] this approach is much less powerful than the methods discussed in this article. The major advantage of the model building procedures is that the variability of the trait related to the regressors included in the model is removed from the residual sum of squares, which results in an increased power of detection of weaker genes.

In this article we discussed the application of modified versions of BIC for the identification of genes influencing quantitative traits, which can be modeled by a normal distribution. In the case of non-normal continuous traits modified versions of BIC can be used together with a nonparametric rank regression. The application of the rank version of mBIC for gene identification was discussed in [22]. Modified versions of BIC were also successfully used in the case where the trait can be modeled by Generalized Linear Models, like e.g. logistic regression (see [21]), or by the Zero- Inflated Generalized Poisson Regression (see [12]). We believe that the consistency

(9)

and asymptotic optimality results on modified versions of BIC can be extended to these linear models and consider this as an interesting topic for further research.

References

[1] H. Akaike em A new look at the statistical model identification. IEEE Trans. Automat.

Control 19 (1974): 716–723.

[2] D. J. Balding, A tutorial on statistical methods for population association studies Nat. Rev.

Gen. 7(2006):781–791.

[3] Y. Benjamini, Y. Hochberg Controlling the false discovery rate: a practical and powerful approach to multiple testing, Journal of the Royal Statistical Society: Series B 57(1) (1995), pp. 289–300.

[4] M. Bogdan, A. Chakrabarti, F. Frommlet, J.K. Ghosh, Asymptotic Bayes Optimality under sparsity of some multiple testing procedures, Annals of Statistics 39 (2011): 1551–1579.

[5] M. Bogdan, J. Ghosh and R. W. Doerge, Modifying the Schwarz Bayesian information criterion to locate multiple interacting quantitative trait loci, Genetics 167 (2004), pp. 989–

999.

[6] M. Bogdan, J. Ghosh, M. Zak-Szatkowska, Selecting explanatory variables with the modified version of Bayesian Information Criterion, Quality and Reliability Engineering Interna- tional 24 (2008), pp. 627–641.

[7] J. Chen and Z. Chen, Extended BIC for small n-large-P sparse GLM (2010) (submitted, available at www.stat.nus.edu.sg/˜stachenz/ChenChen.pdf).

[8] J. Chen and Z. Chen, Extended Bayesian information criterion for model selection with large model space, Biometrika 94 (2008), pp. 759–771.

[9] Z. Chen and Z. Luo, Extended BIC for linear regression models with diverging number of parameters and high or ultra-high feature spaces (2011) (technical raport available at arxiv.org/abs/1107.2502v1).

[10] R.W. Doerge,Z-B. Zeng, B.S. Weir, Statistical issues in the search for genes affecting quan- titative traits in experimental populations. Statistical Science 12 (1997) : 195–219.

[11] R. W. Doerge Mapping and analysis of quantitative trait loci in experimental populations.

Nature Reviews Genetics 3 (2002): 43–52.

[12] V. Erhardt, M. Bogdan, C. Czado, Locating Multiple Interacting Quantitative Trait Loci with the Zero-Inflated Generalized Poisson Regression, Statistical Applications in Genetics and Molecular Biology, Vol 9 : Iss. 1 (2010), Article 26.

[13] F. Frommlet, M. Bogdan and A. Chakrabarti, Asymptotic Bayes optimality under sparsity for general priors under the alternative (2011) (technical raport available at arxiv.org/abs/1005.4753v2).

[14] F. Frommlet, F. Ruhaltinger, P. Twaróg and M. Bogdan, A model selection approach to genome wide association studies, Computational Statistics and Data Analysis (2011) (doi:10.1016/j.csda.2011.05.005).

[15] W. Li and Z. Chen, Multiple interval mapping for quantitative trait loci with a spike in the trait distribution, Genetics 182(2) (2009), pp. 337–342.

[16] M. Lynch, B. Walsh Genetics and analysis of quantitative traits. Sinauer, Sunderland, MA, 1998.

[17] T. Philips, Animal models for the genetic study of human alcohol phenotypes. Alcohol Research and Health, 26: 202–207.

(10)

[18] G. Schwarz Estimating the dimension of a model. Annals of Statistics 6 (1978): 461–464.

[19] P. Szulc, Weak consistency of modified versions of Bayesian Information Criterion in a sparse linear regression, Probability and Mathematical Statistics (in press).

[20] J. Zhao and Z. Chen, A two-stage penalized logistic regression approach to case-control genome-wide association studies (2010) (submitted, available at www.stat.nus.edu.sg/˜stachenz/MS091221PR.pdf).

[21] M. Żak-Szatkowska, M. Bogdan, Modified versions of Bayesian Information Criterion for sparse Generalized Linear Models, Computational Statistics & Data Analysis 55(11) (2011), pp. 2908–2924.

[22] M. Żak, A. Baierl, M. Bogdan A. Futschik Locating multiple interacting quantitative trait loci using rank-based model selection, Genetics, 176 (2007): 1845–1854.

Lokalizacja genów za pomoc¸a zmodyfikowanych wersji Bayesowskiego Kryterium Informacyjnego

Streszczenie. W ostatnich latach nast¸apił bardzo szybki rozwój technologii wspo- magaj¸acych badania w genetyce. Rezultatem tego post¸epu s¸a olbrzymie zbiory da- nych. Skuteczne pozyskiwanie informacji z takich zbiorów wymaga ścisłej współ- pracy mi¸edzy genetykami, informatykami oraz statystykami. Rol¸a statystyków jest określenie precyzyjnych kryteriów gwarantuj¸acych efektywne oddzielenie istotnej in- formacji od losowych zakłóceń. W szczególności, duże rozmiary tych zbiorów wy- magaj¸a opracowania nowych metod korekty na wielokrotne testowanie oraz nowych kryteriów wyboru istotnych zmiennych objaśniaj¸acych.

Szczególnym przykładem identyfikacji zmiennych objaśniaj¸acych jest problem lo- kalizacji genów odpowiedzialnych za cechy ilościowe (Quantitative Trait Loci, QTL).

Do lokalizacji genów stosuje si¸e tzw. markery molekularne. S¸a to fragmenty łańcucha DNA, które mog¸a wyst¸epować w różnych wariantach (allelach) u różnych jednostek w populacji. Postać danego markera u badanego osobnika można ustalić eksperymen- talnie. U organizmów diploidalnych, u których chromosomy wyst¸epuj¸a w parach, genotyp danego markera jest wyspecyfikowany przez podanie alleli wyst¸epuj¸acych na obu chromosomach. Z punktu widzenia statystyka genotypy markerów stanowi¸a jakościowe zmienne objaśniaj¸ace. Jeżeli dany marker znajduje si¸e blisko genu wpły- waj¸acego na badan¸a cech¸e, to możemy spodziewać si¸e statystycznej zależności mi¸edzy genotypem w tym markerze a badan¸a cech¸a ilościow¸a.

Do identyfikacji istotnych markerów genetycznych zwykle stosuje si¸e model re- gresji wielorakiej. Liczb¸e zmiennych niezależnych można w tej sytuacji szacować za pomoc¸a jednego z wielu kryteriów wyboru modelu. Niestety, okazuje si¸e, że w kontekście genetycznym, gdzie liczba markerów istotnie przewyższa liczb¸e obserwa- cji, klasyczne kryteria wyboru modelu przeszacowuj¸a liczb¸e istotnych zmiennych.

Aby rozwi¸azać ten problem ostatnio wprowadzono kilka nowych modyfikacji Bay- esowskiego Kryterium Informacyjnego. W tym artykule zaprezentujemy trzy z tych modyfikacji, podamy wyniki dotycz¸ace zgodności tych metod w sytuacji gdy liczba dost¸epnych markerów genetycznych rośnie wraz z rozmiarem próby oraz wyniki sy- mulacji komputerowych ilustruj¸acych działanie tych metod w kontekście genetycz- nym.

Słowa kluczowe: genetyka statystyczna, wybór modelu, rzadka regresja liniowa, bayesowskie kryterium informacyjne, ilościowa analiza statistical genetics, quantita- tive trait lokalizacji genów

(11)

damental Problem of Technology of Wroclaw University of Technology. Currently he is a PhD student at the Department of Mathematics and Computer Science of Wroclaw University of Technology. He published one article in The Probability and Mathematical Statistics.

Piotr Szulc urodził się w Kaliszu w 1986 roku. Tytuł magistra inżyniera uzyskał w 2010 roku na Politechnice Wrocławskiej na kierunku matematyka, specjalność statystyka matematyczna. Od 2010 roku jest doktorantem w Instytucie Matematyki i Informatyki na Politechnice Wrocławskiej.

Malgorzata Bogdan was born in 1966 Wroclaw. In 1992 she obtained MSc degree in mathematics from the Faculty of the Fundamental Problem of Technology of Wroclaw University of Technology. In 1996 she obtained PhD degree in mathe- matical sciences from the Department of Mathematics and Computer Science of Wroclaw University of Technology. In 2009 she obtained habilitation in technical sciences (special- ization - computer science) from the Institute of Computer Science of Polish Academy of Sciences. Wroclaw University of Technology. She is an Associate Professor in the Departments of Mathematics and Computer Science of Wroclaw Univer- sity of Technology and Jan Długosz University in Czestochowa. In 1999 she received a scholarship “Kolumb” from the Foundation for Polish Science for the 8 months stay in the Department of Statistics of University of Washington. In 2012 she re- ceived a Fulbright Senior Advanced Researcher Award for a 9 months stay in the Department of Statistics of Stanford University. Malgorzata Bogdan taught many courses at the Purdue University, University of Washinghton and Vienna University.

She is a member of the Graduate Faculty in the Department of Statistics of Purdue University, where she served as a member of the PhD Committee of one of their PhD students. She is on the Editorial Board of Scientific Reports published by Nature Group. She is interested in the analysis of high dimensional data and its applications in statistical genetics. She published 27 articles in international journals in the fields of statistics, genetics and bioinformatics.

Małgorzata Bogdan urodziła się we Wrocławiu w 1966 roku. Tytuł magistra uzyskała w 1992 roku na Politechnice Wrocławskiej na Wydziale Podstawowych Problemów Techniki na kierunku matematyka stosowana. W roku 1996 uzyskała stopień doktora nauk matematycznych w zakresie statystyki matematycznej, a w roku 2009 stopień doktora habilitowanego nauk technicznych w specjalnosci informatyka. Małgorzata Bogdan jest profesorem nadzwyczajnym na Politechnice Wrocławskiej i Akademii Jana Długosza w Częstochowie. Jako stypendystka Fundacji Nauki Polskiej w roku 2000 spędziła 6 miesięcy na Universytecie Waszyngtona w Seattle. W roku 2012 otrzymała stypednium Fulbrighta dla doświadczonych naukowców na 9 miesięczny pobyt w Departamencie Statystyki Uniwersytetu Stanforda. Wielokrotnie prowa- dziła wykłady na Uniwersytecie Purdue i Uniwersytecie Wiedeńskim. Jej zaintere- sowania koncentrują się wokół asymptotycznych własności metod analizy danych wielowymiarowych i ich zastosowań do analizy danych genetycznych.

13

(12)

Piotr Szulc

Wrocław University of Technology

Institute of Mathematics and Computer Science, Wybrzeże Wyspiańskiego 27, Wrocław 50-370 E-mail: Piotr.A.Szulc@pwr.wroc.pl

Małgorzata Bogdan

Wrocław University of Technology and Jan Długosz University in Cze¸stochowa

Institute of Mathematics and Computer Science, Wybrzeże Wyspiańskiego 27, Wrocław 50-370 E-mail: Malgorzata.Bogdan@pwr.wroc.pl

(Received: 31st of May 2012)

Cytaty

Powiązane dokumenty

[r]

Na subwersywny potencjał subpola literatury cyfrowej składa się zakwestionowanie bardzo wielu instytucji, aktorów i reguł pola produkcji literackiej, począwszy od

We shall show that the generalized Pell numbers and the Pell-Lucas numbers are equal to the total number of k-independent sets in special graphs.. 2000 Mathematics

Ewolucjonizm syntetyczny teorią wielu teorii. Studia Philosophiae Christianae

Przedstawione wyniki nie wykazały znamiennych różnic poziomu szybkości i zwinności między dwiema grupami 8-letnich dzieci z płaskostopiem i z prawidłowo

Key words and phrases: statistical genetics, quantitative trait loci, model selection, sparse linear regression, Bayesian Information Criterion.. DNA as the carrier of

Key words: misspecification, binary regression model, logistic regression, Lasso, Generalized Information Criterion, sets of active predictors, variable selection,

Figure 11.7: Linear Regression: Statistics Dialog, Statistics Tab To request a collinearity analysis, follow these steps:.. Click on the Tests tab in the