• Nie Znaleziono Wyników

Challenges from new technologies and new biomarkers.

N/A
N/A
Protected

Academic year: 2021

Share "Challenges from new technologies and new biomarkers."

Copied!
38
0
0

Pełen tekst

(1)

new technologies

and new biomarkers

2.1. How to screen for promising intermediate markers?

Paolo Vineis1and Roel Vermeulen2 1

Imperial College, London, United Kingdom

2

IRAS, Utrecht, The Netherlands

Intermediate markers

Intermediate biomarkers directly or indirectly represent events on the continuum between exposure and disease. Intermediate biomarkers can provide important mecha-nistic insight into the pathogenesis of cancer, including early effects that are proximate to exposure and subsequent pre-neoplastic alterations. As such, they complement classic epidemiological studies that use cancer endpoints. In addition, intermediate biomarkers can provide initial clues about the carcinogenic potential of new exposures years before cancer develops (1–5).

One group of intermediate biomarkers, biomarkers of early biologic effect (6), generally measure early biologic changes that reflect early, non-clonal and generally non-persistent effects. Examples of early biologic effect biomarkers include measures of cellular toxicity, chromosomal alterations, DNA, RNA and protein/peptide expression, and early non-neoplastic alterations in cell function (e.g., altered DNA repair, altered immune function). Generally, early biologic effect markers are measured in substances such as blood and blood components (red blood cells, white blood cells (WBCs), DNA, RNA, plasma, sera, urine, etc.) because they are easily accessible and because in some instances it is reasonable to assume that they can serve as surrogates for other organs. Early biological effect markers can also be measured in other accessible tissues such as skin, cervical and colon biopsies, epithelial cells from surface tissue scrapings or sputum samples, exfoliated urothelial cells in urine; colonic cells in feces; and epithelial cells in breast nipple aspirates. Other early effect markers include measures of circulating biologically active compounds in plasma that may have epigenetic effects on cancer development (e.g., hormones, growth factors, cytokines).

For maximum utility, an intermediate biomarker must be shown to be predictive of developing cancer, preferably in prospective cohort studies (7) or potentially in carefully designed case-control studies of cases with low-stage/grade tumors. The criteria for validating intermediate biomarkers have been discussed by Schatzkin and colleagues (8,9) and focus on the calculation of the etiologic fraction (also known

(2)

as the attributable fraction or proportion) of the intermediate endpoint, which varies from 0 to 1. The closer the etiologic fraction is to 1, the greater the biologic marker reflects events, either directly or indirectly, on the causal pathway to disease. The availability of numerous prospective cohort studies with stored blood specimens should enhance our ability to rapidly test the relationship between a wide variety of early biologic effect markers using both standard and emerging technologies (10,11) and cancer risk. Such studies could ultimately produce a new generation of endpoints to evaluate the carcinogenic potential and mechanisms of action of various risk factors. In addition, this line of research may one day identify a panel of intermediate markers easily analyzed from blood samples that can be used to identify individuals at elevated risk of developing cancer in the future, who may then benefit from targeted primary and secondary preventative strategies.

A second group of intermediate markers represents events further down the continuum from exposure to disease, where early hyperplastic or pre-neoplastic alterations may have occurred, sometimes due to clonal expansion of a genetically or epigenetically altered cell. These have been referred to as biomarkers of altered structure and function (12). Some of these events can be identified by standard histologic techniques, at times enhanced through the use of special methods. More subtle and earlier pre-neoplastic changes may be detected through proliferation and apoptosis assays as well as molecular analyses that reflect early clonal events in cell cycle control. These markers are frequently analyzed in tissues from organ sites of interest.

Screening for potential intermediate markers

The screening for potential intermediate markers has a strong parallel with the field of genomics where both candidate gene approaches and genome wide screens are performed to identify genetic risk factors. In the candidate gene approach, priors are formulated and single nucleotide polymorphisms (SNPs) in genes are selected. In the genome wide screens, no priors are formulated as basically all genes in the whole genome are studied (discovery approach). The methodologies in the domain of inter-mediate markers are less developed as in the field of genomics however similar approaches can be distinguished.

Candidate approach

Numerous markers have been proposed as markers for carcinogenesis. These were based mostly on either the observation that they were frequently detected in tumors (e.g. t(14;18), t(8;21), t(15;17), P53, KRAS mutations, and HPV, SV40 infections) or that they were associated with genotoxic exposures (e.g. chromosomal aberrations (CA), sister chromatid exchanges (SCE), micronuclei (MN), DNA adducts). Selection of inter-mediate endpoints based on the observation that a marker is linked with either

(3)

the disease or the exposure of interest is a reasonable approach. However, its validity should still be tested by completing the criteria of validity, i.e. that it is linked to both the exposure and disease of interest (i.e. on or reflecting the causal pathway). Unfortunately, many of the historical proposed markers have not been properly validated or have turned out not to fulfil this criteria with the exception of chromosomal aberrations, micronuclei and several virus infections such as human papiloma virus (HPV). In the 1990s, historical cohort studies from Scandinavian countries and Italy, using archived data from many laboratories active in the field of cytogenetic biomonitoring, evaluated the level of CA, SCEs and MN in peripheral blood cells (PBCs) as a biomarker of cancer risk (13–16). A significant association between the frequency of CA in healthy individuals and subsequenct incidence or mortality for all cancer was observed. These findings were subsequently confirmed by two independent cohorts in five Central-Eastern European countries (17), and to a lower extent in the Czech Republic (18,19). No association was found between mean SCE level and cancer risk (13,20,21). The possible role of MN as a predictor of cancer risk has been studied in an international collaborative study, and preliminary results based on a cohort of almost 7000 subjects showed a significant association between MN frequency and the risk of cancer (22). Nine studies have examined the association between cancer at different sites and the levels of “bulky” DNA adducts. A global 73% excess of adduct levels was found in cases compared with controls in current smokers (95% CI: 31 to 115%). No association between cancer status and DNA adduct levels was found among former smokers, whereas never-smokers showed inconsistent results. These observations are in accordance with the findings of three prospective studies (23–25), which also found that DNA adduct levels measured in WBCs were predictive of lung cancer. However, results were inconsistent in that two studies found the association only in current smokers while the third one found it in never-smokers (current smokers were not investigated). It is fair to conclude that the candidate approach for selection of intermediate markers has until now not been overly successful. However, given the ever increasing knowledge of the etiology of many diseases on a molecular level, improved molecular techniques and the availability of prospective studies with biological materials, the candidate approach should deliver promising markers in the near future.

Discovery approach

Recent breakthroughs in biotechnology (e.g. genomics, transcriptomics, proteomics, and metabonomics) offer the possibility of developing a new series of biomarkers of cancer risk. These techniques enable investigators to broadly explore biologic responses to exogenous and endogenous exposures, to evaluate potential modification of those responses by variants in essentially the entire genome, and to define tumors at the chromosomal, DNA, mRNA and protein levels. Given their ome-wide screening, these techniques do not require any prior knowledge and can be used purely

(4)

as discovery tools. Of these new techniques, proteomics and metabonomics are hypothetically the most promising tools for identifying new markers. In light of the fact that the human genome consists of approximately 25–30,000 genes (26), a fraction of what was originally expected, it has become clear that mammalian systems are more complex than what can be determined by genes alone. Alternate splicing, as well as over 200 post-translational modifications affect proteins structure, function, stability and interactions with other molecules (27). It is therefore likely that a number of different proteins/peptides may be expressed by a single gene. Both proteomics and metabonomics have recently been adapted to high throughput, highly sensitive technologies (e.g. SELDI-TOF/MS, NMR) making it possible to screen large number of samples for a large number of potential markers. The results until now have not however been as promising as originally thought. This is largely due to limitations in the current techniques, small sample sizes and generally poor study designs. However, as techniques are still improving and methodological issues are better taken into consideration, it is merely a matter of time before these techniques will produce new leads in the discovery of novel intermediate markers. However, given that these techniques are still only discovery tools, these leads need to be carefully investigated and compared to existing biological information from in vivo or in vitro tests. Secondly, they should be confirmed in other independent studies (candidate approach) preferably using different platforms.

Conclusions

The future of biomarker research will result in further improvements in sensitivity, specificity and throughput of existing classical and omic methodologies. As more of these technologies become commercially available, their application in epidemio-logical studies will require large-scale efforts to validate and establish them as useful tools in biomarker research. Screening of potential biomarkers should be based on both candidate and discovery approaches.

References

1. Schatzkin A, Freedman LS, Schiffman MH, Dawsey SM. Validation of intermediate end points in cancer research. J Natl Cancer Inst 1990;82(22):1746–52.

2. National Research Council. Biological markers in environmental health research. Committee on Biological Markers of the National Research Council. Environ Health Perspect 1987;74:3–9. 3. Schulte PA, Rothman N, Schottenfeld D. Design considerations in molecular epidemiology.

In: Schulte PA, Perera FP, editors. Molecular epidemiology: principles and practices. San Diego, CA: Academic Press; 1993. p. 159–98.

4. Schatzkin A, Gail M. The promise and peril of surrogate end points in cancer research. Nat Rev Cancer 2002;2(1):19–27.

(5)

5. Toniolo P, Boffetta P, Shuker DEG, Rothman N, Hulka B, Pearce N. Application of Biomarkers in Cancer Epidemiology. Lyon: IARC; 1997.

6. National Research Council. Biological markers in environmental health research. Committee on Biological Markers of the National Research Council. Environ Health Perspect 1987;74:3–9. 7. Schatzkin A, Freedman LS, Schiffman MH, Dawsey SM. Validation of intermediate end points

in cancer research. J Natl Cancer Inst 1990;82(22):1746–52.

8. Schatzkin A, Freedman LS, Schiffman MH, Dawsey SM. Validation of intermediate end points in cancer research. J Natl Cancer Inst 1990;82(22):1746–52.

9. Schatzkin A, Gail M. The promise and peril of surrogate end points in cancer research. Nat Rev Cancer 2002;2(1):19–27.

10. Tomer KB, Merrick BA. Toxicoproteomics: a parallel approach to identifying biomarkers. Environ Health Perspect 2003;111(11):A578–9.

11. Nicholson JK, Wilson ID. Opinion: understanding 'global' systems biology: metabonomics and the continuum of metabolism. Nat Rev Drug Discov 2003;2(8):668–76.

12. National Research Council. Biological markers in environmental health research. Committee on Biological Markers of the National Research Council. Environ Health Perspect 1987;74:3–9. 13. Bonassi S, Znaor A, Norppa H, Hagmar L. Chromosomal aberrations and risk of cancer

in humans: an epidemiologic perspective. Cytogenet Genome Res 2004;104(1–4):376–82. 14. Bonassi S, Hagmar L, Stromberg U, Montagud AH, Tinnerberg H, Forni A, et al. Chromosomal

aberrations in lymphocytes predict human cancer independently of exposure to carcinogens. European Study Group on Cytogenetic Biomarkers and Health. Cancer Res 2000;60(6):1619–25. 15. Hagmar L, Bonassi S, Stromberg U, Brogger A, Knudsen LE, Norppa H, et al. Chromosomal aberrations in lymphocytes predict human cancer: a report from the European Study Group on Cytogenetic Biomarkers and Health (ESCH). Cancer Res 1998;58(18):4117–21.

16. Hagmar L, Bonassi S, Stromberg U, Mikoczy Z, Lando C, Hansteen IL, et al. Cancer predictive value of cytogenetic markers used in occupational health surveillance programs: a report from an ongoing study by the European Study Group on Cytogenetic Biomarkers and Health. Mutat Res 1998;405(2):171–8.

17. Boffetta P, van der HO, Norppa H, Fabianova E, Fucic A, Gundy S, et al. Chromosomal aberrations and cancer risk: results of a cohort study from Central Europe. Am J Epidemiol 2007;165(1):36–43.

18. Rossner P, Boffetta P, Ceppi M, Bonassi S, Smerhovsky Z, Landa K, et al. Chromosomal aberrations in lymphocytes of healthy subjects and risk of cancer. Environ Health Perspect 2005;113(5):517–20.

19. Boffetta P, van der HO, Norppa H, Fabianova E, Fucic A, Gundy S, et al. Chromosomal aberrations and cancer risk: results of a cohort study from Central Europe. Am J Epidemiol 2007;165(1):36–43.

20. Norppa H, Bonassi S, Hansteen IL, Hagmar L, Stromberg U, Rossner P, et al. Chromosomal aberrations and SCEs as biomarkers of cancer risk. Mutat Res 2006;600(1–2):37–45.

21. Hagmar L, Bonassi S, Stromberg U, Brogger A, Knudsen LE, Norppa H, et al. Chromosomal aberrations in lymphocytes predict human cancer: a report from the European Study Group on Cytogenetic Biomarkers and Health (ESCH). Cancer Res 1998;58(18):4117–21.

(6)

22. Znaor A, Fucic A, Strnad M, Barkovic D, Skara M, Hozo I. Micronuclei in peripheral blood lymphocytes as a possible cancer risk biomarker: a cohort study of occupationally exposed workers in Croatia. Croat Med J 2003;44(4):441–6.

23. Tang D, Phillips DH, Stampfer M, Mooney LA, Hsu Y, Cho S, et al. Association between carcinogen-DNA adducts in white blood cells and lung cancer risk in the physicians health study. Cancer Res 2001;61(18):6708–12.

24. Peluso M, Munnia A, Hoek G, Krzyzanowski M, Veglia F, Airoldi L, et al. DNA adducts and lung cancer risk: a prospective study. Cancer Res 2005;65(17):8042–8.

25. Bak H, Autrup H, Thomsen BL, Tjonneland A, Overvad K, Vogel U, et al. Bulky DNA adducts as risk indicator of lung cancer in a Danish case-cohort study. Int J Cancer 2006;118(7):1618–22.

26. Venter JC, Adams MD, Myers EW, Li PW, Mural RJ, Sutton GG, et al. The sequence of the human genome. Science 2001;291(5507):1304–51.

27. Banks RE, Dunn MJ, Hochstrasser DF, Sanchez JC, Blackstock W, Pappin DJ, et al. Proteomics: new perspectives, new biomedical opportunities. Lancet 2000;356(9243):1749–56.

(7)

2.2. The use of DAGs in causality assessment for biomarkers

Sara Geneletti

Imperial College, London, United Kingdom

The present contribution aims to summarise some new approaches to the elucidation of “causal” pathways that involve biomarkers or gene-environment interactions. We do not address two philosophical issues that underlie such approaches. One issue is about the definition of the ‘cause’. It is well-known that the definition varies according to different schools of thought although it has generally been agreed in the literature (1–4) that the key is to invoke interventions. These are defined as deliberate interference with the natural course of events, as for instance, in an experiment. However, in epidemiology, we usually deal with observations and not experiments, and thus we will sometimes use the word ‘cause’ to represent a more intuitive and speculative concept in contrast to the more rigorous “interventional” concept. The second issue is related to the interpretation of graphic representations of causal pathways (DAGs, see below). This is a very powerful and general approach, and broadly speaking there are two approaches to causal inference that make use of Directed Acyclic Graphs (DAGs). One approach, put forward by Judea Pearl (1), combines graphic models with functional equations and counterfactuals. The second approach (6,2,5) is based on the statistical decision theory and takes the view that the aim of causal inference is to inform (??) future decision making processes. This makes it particularly appropriate for epidemiological research as it can readily be applied to health policy decisions. The decision theoretic approach is used in the exposition below. However, we focus on the practical side of graphic models rather than promoting a specific philosophical interpretation.

Causal inference

The main objective of causal inference is to separate the language of statistical association from the language of causality. In this contribution, we focus on the use of graphic models for causal inference in the context of biomarker research and gene-environment interactions. The web of relationships between disease, exposure, genes and other biomarkers is in fact easily visualisable in terms of graphical models.

Directed Acyclic Graphs (DAGs) are graphic representations of relationships among variables, and they can be used to disentangle causal relationships from simple statistical associations. However, causal assumptions must be applied to distinguish the causal arrows from the merely associated ones, and this external information needs to be explicitly stated as it is not in itself contained in a DAG. In particular, DAGs derived from observational data without interventions are just representations of conditional independences (see below) and thus are purely probabilistic until additional causal information can be brought to bear.

(8)

Components of the Decision Theroretic (DT) approach Conditional Independences and DAGs

Conditional (in) dependence is a probabilistic concept: consider 3 variables A,B and C. We say A is independent of C conditional on B (A B | C) if P(A,B|C) = P(A|C)×P(B|C). Simply put, this says that if we know what C is, knowing what A is gives us no further information on B. An example from genetics is as follows: if we want to know the genetic makeup of an individual (B), we might be able to gain some information on it by looking at the genetic makeup of a sibling (A). If however, we can access the genetic makeup of the parents (C), then knowing the sibling's makeup gives us no further information on A.

DAGs encode such conditional independences. It is important to note that more than one DAG can encode the same set of conditional independences. The independence above corresponds to the three DAGs in Figure 2.1. below, however, only the first corresponds to the example given above.

1. A C B

2. A C B

3. A C B

Fig. 2.1. A is independent of B conditional on C. Interventions and Regimes

In order to make rigorous causal inference, interventions need to take place, or when no interventions take place, conditional independences need to be found that enable us to in-terpret the consequences of natural events as those of interventions (as in the case of Men-delian randomization, see below). A clear example of why intervention and naturally occuring events do not necessarily lead to the same results is confounding. Often the key to making causal inference from observations is whether some randomization had occurred.

F HDL CVD

Fig. 2.2. F is an intervention on HDL-cholesterol intake, which is associated with CVD-cardiovascular disease.

Formally, causality is encoded in DAGs by using the concept of regime. A regime indicates the intervention status of a variable and is denoted by F. Thus if we are looking at a binary variable X (0,1) and its possible regimes, we will consider the “active treatment regime” F = 1, the “baseline treatment regime” F = 0 and the observational regime F = º (the empty set), corresponding to setting X = 1 with no uncertainty, setting X = 0 with no uncertainty and finally letting X arise without manipulations of any sort. The last regime is what is generally termed the observational regime and is generally the data gathering situation. Interventions are represented by decision nodes (square boxes) in augmented DAGs (2) and these can be used to make some causal inference from DAGs as shown in Figure 2.2.

(9)

It is often questionable to extract a DAG from observational data and assume that the arrows represent causal relationships.

The role of DAGs in biomarker research

When the interest is in the role of a biomarker in a biological pathway, DAGs can be used in the first instance as a tool to play around with possible configurations. Here they are only a visual aid and encode no statistical information, and the arrows can be interpreted as speculative causal relationships. When data are available, DAGs can be extracted using conditional independence tests (1,7).

The DAG representations can be very useful for determining which set of relation-ships fits with the observed data. Note again that at this stage, DAGs represent purely statistical relationships and do not say anything about causal relationships. However, they can then be augmented by causal assumptions, if there any plausible ones. For in-stance, time ordering can be invoked to restrict the number of DAGs, or knowledge about interventions or randomization can be included to constrain their numbers.

An example concerning intermediate biomarkers

A recent example of a difficult interpretation of data concerning an intermediate biomarker is the role played by C-Reactive Protein (CRP) in cardiovascular disease. Is CRP a really intermediate marker, i.e. does it lie on the biological pathway between the risk factors and the disease or is it simply an indicator of the risk factors? For example, both metabolic changes, such as triglyceride or HDL-cholesterol elevation, and inflammation are the risk factors for CVD: is CRP just an epiphenomenon of their alteration or is it a genuine mechanism that mediates their action on CVD?

Consider the following simple example. If we have observational (i.e. F = º) data on the variables HDL, CVD and CRP, then, in theory, we can determine what conditional (in)dependences exist between them. Say for instance that we have found that

CVD CPR | HDL, F = º [2.5.]

i.e. knowing about CRP does not give us additional information about the likelihood of CVD if we already know the HDL levels in the observational regime. This results in the 3 DAGs in Figure 2.3A. These DAGs are associational — can we reduce them to one that represents causal relationships? We can do this by making explicit and external (to the DAG) assumptions:

1. CVD cannot cause increased HDL, because it is always the temporal endpoint i.e. CVD does not precede increased levels of HDL. This reduces the DAGs to those in Figure 2.3B.

2. We know that CRP production is governed by a particular gene G, and we have data on the population under study that indicates that people who naturally produce less

(10)

CRP have (say) lower HDL levels than those who produce more CRP. From this data we get that G HDL | CRP, F = º that is, once we know CRP levels, the type of gene (mutated or not) does not give us further information about the levels of HDL. If we believe that this emulates a controlled trial on CRP and further G CVD | HDL, CRP (i.e. CVD does not depend on the gene when we know what the HDL and CRP levels are), then we can use this information to make causal inference. This redcuces the possible DAGs to Figure 2.3C.

Fig. 2.3. (A) The DAGs that encode the conditional independence CVD CPR | HDL, F = º. (B) 3.2 CVD

has to be “downstream”. (C) Additional information states that G HDL | CRP, F = º.

A. B. C.

CRP HDL CVD CRP HDL CVD CRP HDL CVD

CVD HDL CRP CRP HDL CVD G

CRP HDL CVD

3. Another possibility to get from the scenario in Figure 2.3B. is if we have the results of a randomized trial of medication in another population that changes the levels of HDL (and only HDL), the aim of which is to reduce the incidence of CVD. We can think of this as a case where F = 1 for those who got the treatment and F = 0 for those who did not. If the results of the trial indicate that CVD incidence depends on the treatment, for instance, those taking the active treatment are less likely to experience CVD, then we get Figure 2.4A. if we can extract CVD F | HDL, from the data. I.e. CVD only depends on the treatment through the HDL levels — if this does not hold, then the treatment is affecting the CVD levels through both the HDL and some other unobserved factors. If we also monitor the CRP levels and the treatment further does not affect these, then we may conclude based on the evidence above and that from the trial, that Figure 2.4B. describes the associations between the variables and

based on the causal assumptions we have made so far, be interpreted as encoding causal

relationships.

Fig. 2.4. The DAG resulting from the combination of the evidence from the observational and

expe-rimental regimes. A. B. HDL CVD CRP HDL CVD F F ||| ||| ||| ||| |||

(11)

Depending on the set of conditional independences found in the data, different — usually more complex dependence structures can be found. The exposition above aims to show that the DAGs themselves are not causal, and can only be interpreted as such by ma-king additional assumptions that contain the causal information. Further, this information is (a) non-trivial and (b) must be plausible in the context under investigation. For example, in point 3. above, we are drawing on information from another study to inform the results of the current study — we must have grounds to believe that the two populations are comparable (exchangeable) in order to combine the two pieces of evidence.

Association vs. Intervention in the context of “genetic causality”

Experiments and intervention are the gold standard for causal inference. If we have a con-trolled randomized trial then we can make inference about causal relationships between the treatment and the effect. When we are faced with observational data we cannot make causal inference without making some strong assumptions as we will probably have to contend with confounders etc. Where does this leave us when we consider genetic data? Essentially, when someone says “this gene mutation causes increased incidence of this disease” they must have evidence to the effect that if someone were to intervene on that gene and mutate it, we would see increased incidence of the disease on average in the population under study. What type of evidence would be convincing? A randomized controlled trial with the mutation as a treatment would be, but this is not generally possible in a human population. Perhaps something similar in animals may be plausi-ble — but then this might require additional assumptions (see examples of research on knocked-out mice, for example on genes for hypertension). Finally, if enough data are available on the mutation in a suitably diverse population, a Mendelian randomization argument can be convincing, as it replaces the controlled randomization of a trial with natural randomization, thus breaking the link with non-genetic confounders. Mendelian randomization can also be used as an instrumental variable to make causal inference about exposures (8).

When none of the above arguments can be brought to bear and the basis for the statement is that a statistical association between the mutation and the disease has been found, then a statement about causality can be made only in a speculative sense. This can be useful as it can help determine what types of studies or experiments are needed in order to make more rigorous causal inference. However, more evidence is needed before serious policy decisions can be made.

DAGs in gene-environment interactions

Another issue in Epidemiology and causality, is that of gene environment interactions (GEI). DAGs can again offer a visual and also more rigorous description of GEIs. We look at Ottman's (9) models for GEI and cast them as conditional independences and cor-responding DAGs.

(12)

F

W Y

G

First some notation and assumptions: 1. Gene: G = 1 mutation, G = 0 normal. 2. Exposure: W = 1 exposed, W = 0 unexposed. 3. Disease: Y = 1 presence, Y = 0 absence.

4. Assume that there are no confounders between W and Y.

5. E(Y|W = w, G = g) denotes the expected disease status when W = w and G = g. Although we only consider a binary response here, this can be extended to a con-tinuous disease measure.

In the DAGs below, the double box means deterministic node, and a hollow arrowhead means deterministic relationship. Note that a deterministic node is like a funcational relationship — it has no associated uncertainty. However, it is not like a decision node (with a single box) as it depends not on an external decision one could make, but on the the interaction we are considering.

Ottman Models

I use F when it is necessary only in the exposition below. When it is not included, it means that the expectation is conditional on F = º.

Model A

The genotype produces an exposure which can also be produced non-genetically. Ottman says it is not strictly an interaction model, so the DAG should be:

Fig. 2.5. Ottman Model A. F is the intervention, W is the environmental exposure, G is the genotype and Y

is the health outcome of interest.

Conditional independences are: — Y F|(G, W),

— G F, and

— E(Y = 1|G = 1,W = 1, F = º) E(Y = 1|G = 1,W = 0, F = º) > 0, — E(Y = 1|G = 0,W = 1, F = º) E(Y = 1|G = 0,W = 0, F = º) > 0,

that is, if the genotype is present and there is no intervention the patient will get disease, if the exposure is present (naturally), the patient will get the disease,

— E(Y = 1|G = 1, F = º) E(Y = 1|G = 1, F = 0) > 0. ||| ||| ||| ||| |||

(13)

If we treat condition (F = 0) then the subject is less likely to get disease than if she has genotype and we don’t treat.

Model B

The genotype exacerbates the effect of the exposure, but there is no effect of the genotype on the unexposed. We introduce IW as the interaction indicator for W. It is a deterministic function of W and we can see it as a “switch” for the influence of G on the W–Y relation-ship. Thus, when W = 1 IW is “on” (1) and allows G to interact with W and alter the effect on Y. When W = 0, IW is “off ” (0) and it blocks G's effect on Y. IW is a non-random node and takes on the same values as W. It has a W subscript to indicate that it depends only on W. The arrow between Y and W says that there is an association between Y and W irrespective of G. Conditional independences: — Y G|IW = 0, — Y G|IW = 1, — Y F|(W,G), — G W. F W G

Fig. 2.6. Ottman Model B.

IW Y

The first conditional independence says that G does not affect Y when IW = 0 (which by definition means W = 0). The second (dependence) says that Y is affected by G when IW = 1. The final two are as before. So:

— E(Y = 1|W = 1) – E(Y = 1|W = 0) > 0 means that W has a positive effect of exposure. — E(Y = 1|G = 1,W = 0, IW = 0) – E(Y = 1|G = 0,W = 0, IW = 0) = 0 means that

there is no effect of G for W = 0.

And finally E(Y = 1|G = 1,W = 1, IW = 1) – E(Y = 1|G = 0,W = 0, IW = 0) = 0 > 0 means that G has a positive effect when W=1.

Model C

In this scenario, exposure exacerbates the effect of the genotype but has no effect on persons with the low-risk genotype. We introduce IG which behaves in a similar way as IW above, by regulating G's interaction with W. The arrow between Y and G says that there is an association between Y and G irrespective of W.

||| ||| ||| |||

(14)

Conditional independences: — Y W|IG = 0,

— Y W|IG = 1, — Y F|(W,G), — G W.

The first conditional independence says that W does not affect Y when IG = 0, the second (dependence) says that Y is affected by W when IG = 1. The final two are as usual. So: — E(Y = 1|G = 1) – E(Y = 1|G = 0) > 0 means that G has a positive effect.

— E(Y = 1|G = 0,W = 1, IG = 0) – E(Y = 1|G = 0,W = 0, IG = 0) = 0 means W has no effect when G = 0.

And finally E(Y = 1|G = 1,W = 1, IG = 1) – E(Y = 1|G = 1,W = 0, IG = 1) = 0 > 0 means there is a positive effect of W for G = 1.

F

W

G

Fig. 2.7. Ottman Model C.

IG Y

Model D

Both exposure and genotype are required to increase risk.

F

W

G

Fig. 2.8. Ottman Model D.

I Y

Here I acts as a “switch” for both W and G and is defined as follows: I = “on” (1) if and only if W = G = 1 otherwise I = “off ” (0)

Conditional independences: — Y (G,W)|I = 0, — Y (G,W)|I = 1, — Y F|(W,G, I = 1), — G W. ||| ||| ||| ||| ||| ||| ||| |||

(15)

F

W

G

The first conditional independence says that Y does not depend on G or W when I = 0, the second is that it depends on both if I = 1. The third follows from the definition of I, Y can only depend on F when it can depend on W. The conditional independence is the same as in the other cases, except it only makes sense when I = 1.

— E(Y = 1|G = 1,W = 1, I = 1) – E(Y = 1|G = g,W = w, I = 0) > 0 together with

— E(Y = 1|G = g,W = w, I = 0) – E(Y = 1|G = g,W = w, I = 0) = 0 means there is a positive effect only if both G and W are 1.

Model E

F

W

G

Fig. 2.9. Ottman model E. It is not possible, without further information, to determine which of the two

models represents the data situation under consideration.

I Y Y

Exposure and genotype both have some effect but together they have a worse (better) effect.

The right hand DAG represents a different situation from the left hand DAG, however the two cannot be separated unless we can somehow see the effect of the interaction in a variable other than Y itself, for instance in another response. It is possible for instance G affects one aspect of the disease and W another, and that together, they exacerbate the disease. However, it is also possible that G affects the same aspect as W and that they interact to exacerbate it. The former situation is represented by the right hand DAG, where G and W both affect response but in a sense “separately”, while the latter situation represents the interaction of G and W as they joinlty affect the disease.

In the DAG on the left hand side, I is a deterministic function of both W and G and is defined as follows: I = 1 if and only if W = G = 1 otherwise I = 0. Here there are also associations between W and Y and G and Y, this indicates that there are effects of W on Y irrespective of G and vice-versa. I can only somehow increase the effect in some way when it is I = 1.

— E(Y = 1|G = 1,W = 1, I = 1) – E(Y = 1|G = g,W = w, I = 0) = agw > 0 for w or g or both = 0.

— E(Y = 1|G = 1,W = 0, I = 0) – E(Y = 1|G = 1,W = 0, I = 0) = b > 0. — E(Y = 1|G = 0,W = 1, I = 0) – E(Y = 1|G = 0,W = 0, I = 0) = c > 0.

(16)

And agw > b, c for all g,w {0, 1} not both 1. So the first line of equation says that if both W and G are 1, then the effect is greater than for any combination of W and G not both 1. The second just says that both G and W have effects when the other is fixed.

Comments

The DAGs above are not necessarily causal. Causal assumptions could be introduced via interventions and possibly Mendelian randomization. More work needs to be done in investigating how these structures can be found from data and when causal inference can be made.

Conclusions

We have shown that DAGs, as representations of conditional independence statements can be useful in describing the associations between genes and genes, and genes and exposures. They can be viewed as representing “interventional” causal relationships when plausible assumptions can be brought to bear, and they can be used to make speculative causal statements to guide further research.

References

1. Pearl J. Causality: Models, Reasoning, and Inference. Cambridge: University Press; 2000. 2. Dawid AP. Influence diagrams for causal modelling and inference. Int Statist Rev

2002;70:161–89.

3. Shafer G. The Art of Causal Conjecture. Massachusetts: MIT Press; 1996.

4. Rubin D. Estimating Causal Effects of Treatments in Randomized and Nonrandomized Studies. J Educat Psychol 1974;66(5):688–701.

5. Geneletti SG. Aspects of Causal Inference without Counterfactuals [PhD thesis]. University of London; 2005.

6. Dawid AP. Causal inference without counterfactuals. With discussion. J Am. Statist Assoc 2000;95:407–48.

7. Heckerman D. A Bayesian approach to learning causal networks. In Proceedings of Eleventh Conference on Uncertainty in Artificial Intelligence. Montreal, Quebec; 1999. p. 285–95. 8. Didelez V, Sheehan N. Mendelian randomisation: why epidemiology needs a formal language

for causality, to appear. In: Russo F, Williamson J, editors. Causality and Probability in the Sciences. London: College Publications; 2007.

9. Ottman R. An epidemiologic approach to gene-environment interaction. Genet Epidemiol 1990;7(3):177-85.

(17)

2.3. A methodology to integrate individual data

with ecologic data in molecular epidemiology

Cosetta Minelli1, Paolo Vineis2, Emanuela Taioli3, John Thompson4 1Department of Health Sciences, University of Leicester, United Kingdom 2Imperial College, London, United Kingdom

3Pittsburg University, PA, United States of America

4Department of Health Sciences, University of Leicester, United Kingdom

Summary

Our aim is to investigate whether the statistical power to detect gene-gene interactions in studies which measure the effects of two (or more) genes on a certain outcome can be improved by incorporating partial information available from studies which only provide data on the gene-disease association for one gene, with the addition of ecological data. The ecological data will provide information for the prevalence of the unmeasured genotype in the studies reporting on only one gene-disease association.

The methodological approach, based on the use of Bayesian hierarchical models where the ecological information is modelled using informative prior distributions, represents a development of methods described in geographical epidemiology and social sciences for combining individual data with ecological data (1–2).

The approach will be illustrated by applying it to the evaluation of a possible inter-action between the GSTT1 gene and the GSTM1 gene on the risk of lung cancer.

Background

The evaluation of gene-gene and gene-environment interactions are problematic, due to the general lack of power of tests of interactions, not only in primary studies (3) but also in large meta-analyses (4). To test for interactions, it has been suggested that the study size should be at least four times larger than that needed to detect only the main effects (5). Extremely large, and often unrealistic, sample sizes are required in order to provide conclusive evidence of a gene-gene or gene-environment interaction, with the result that few interactions have been replicated (6–7). Therefore, in the evaluation of the presence and magnitude of such interactions, there is a need for efficient use of all evidence available.

The use of ecological inference

Aggregate-level data defined at a group level and collected by studies known as ecolo-gical are sometimes used in health care research, with the aim of making inferences about the individuals within the groups or areas. This type of inference, referred to as ecological inference, suffers from a fundamental problem known as ecological fallacy, which results from the fact that the association observed between the exposure

(18)

(risk factor/treatment) and the outcome at the aggregate level might not reflect the association at the individual level (8). The presence of ecological bias is theoretically certain whenever the relationship between exposure and outcome is not linear and there is within-group variability in the exposure. In practice, these conditions, and in particular the latter, will often be true to some extent. Moreover, ecological studies present problems of confounding similar to those of studies on individual data, although here, in addition to confounding due to factors that vary between individuals, there might be confounding due to factors that vary between areas. Nonetheless, ecological inference is often used in geographical epidemiology and social sciences, where data collected at the area level are readily available.

Although ecological inference alone cannot provide strong evidence about an association, it has been proposed that the use of aggregated data supplemented by individual data might represent the solution to the problem of ecological fallacy (9). In fact, the individual data provide evidence on the link between exposure and outcome at the individual level, which can inform the inference from the aggregated data, and statistical frameworks have been developed to accommodate this approach (10,11,9). Wakefield showed how even small samples of individual data collected within the areas, if carefully chosen, might lead to dramatic improvements in inference (1).

Another way of looking at the use of ecological inference in combination with individual data is that this approach might aim at increasing the statistical power of studies conducted at the individual level. An example of this has been recently presented by Jackson and colleagues in the field of environmental epidemiology, where the problem is often that of the studies which are underpowered to detect the effect of interest (2). Our aim is to adapt this approach to the evaluation of gene-gene interactions.

An example of gene-gene interaction

The GSTM1 and the GSTT1 genes encode for enzymes which play a central role in detoxification of major classes of tobacco carcinogens, and deletion of these genes is thought to be associated with susceptibility to lung cancer, especially in relation to tobacco smoking. We will evaluate the possible interaction between these two genes using data from the Genetic Susceptibility to Environmental Carcinogens (GSEC) database, which is a collection of datasets available on a number of genetic polymorphisms and the risk of cancer (12–13). The studies included in this database comprise those reporting on the risk of lung cancer associated with both genes, and those reporting on only one gene-disease association. For the latter studies, ecological data on prevalence of the mutant genotypes for the unmeasured genes can be derived from external information on the genotype distributions in the populations from which the study samples are drawn.

In the method section, for simplicity, we will refer to GSTT1 and GSTM1 genes as “gene1” and “gene2”, respectively, and to lung cancer as “disease”. Genotypes

(19)

characte-rised by deletion of the genes, which represent the mutant genotypes, will be contrasted with genotypes characterised by presence of the genes, which represent the null genotypes. The analyses of GSEC data will consist of pooled analyses of the original data, which are equivalent to meta-analyses on individual patient data.

Methods

Complete and partial evidence on the gene-gene interaction

The data from datasets which report direct evidence on the interaction between gene1 and gene2 can be presented in a 4×2 table as shown in Table 2.1., while studies measuring only gene1-disease or gene2-disease association can be presented in collapsed 2×2 tables (Table 2.2. and Table 2.3., respectively). The approach proposed in the following sections is based on reconstructing the 4×2 table from each of the two 2×2 tables, by using ecological data for the missing information (i.e. information on prevalence of gene2 deletion in the gene1-disease table, and information on prevalence of gene1 deletion in the gene2-disease table).

Table 2.1. 4×2 table: gene1-gene2-disease

Disease

No a b e f

Yes c d g h

P = gene present; D = gene deleted.

Gene1P D

Gene2 P D P D

Table 2.2. 2×2table: gene1-disease

Disease

No a+b e+f

Yes c+d g+h

Abbreviations as in Table 2.1.

Gene1P D

Table 2.3. 2×2table: gene2-disease

Disease

No a+e b+f

Yes c+g d+h

Abbreviations as in Table 2.1.

(20)

The meta-analytical approach

The meta-analysis which integrates complete and partial evidence on the gene-gene interaction is performed using a Bayesian hierarchical model, where the three odds ratios, ORGene1, ORGene2 and ORInteraction, are estimated using a logistic regression model

with binomial distributions. The baseline risk of disease is modelled using random effects, while the three odds ratios are modelled using fixed effects.

For the studies which provide complete information on gene1-gene2-disease associations (Table 2.1.), the logistic model of disease risk, for the ithstudy, is:

Logit(Pi) = αi+ β1Xgene1+ β2Xgene2+ β3Xgene1Xgene2 [2.1.]

where

αi — the baseline risk of disease, which is modelled as random effect αi~N(mu.a, sd.a);

β1 — the log odds ratio for gene1, i.e. deletion vs. presence of the gene, in subjects where gene2

is present (logORGene1);

β2 — the log odds ratio for gene2 in subjects where gene1 is present (logORGene2);

β3 — the log odds ratio for the gene1-gene2 interaction, i.e. subjects with deletion of both genes

vs. subjects with presence of both genes (logORInteraction).

Studies that only provide information on gene1-disease association (Table 2.2.) can contribute to the estimation of the three odds ratios with the addition of ecological data on what is the probability of carrying a deletion of gene2. For example, the probability of disease for individuals with null gene1 genotype, P(D | p_gene1), is calculated as a mixture distribution of the probability in these subjects carrying a deletion of gene2 and not carrying a deletion of gene2:

P(D | p_gene1) = P(D | d_ gene2, p_gene1) P(d_ gene2 | p_gene1)+

+P(D | p_ gene2, p_gene1) (1-P(d_ gene2 | p_gene1)) [2.2.]

where P(D | d_ gene2, p_gene1) and P(D | p_ gene2, p_gene1) are estimated in the logistic model. The probability P(d_ gene2 | p_gene1) comes from ecological data and is modelled using an informative beta prior distribution for each study.

P(D | p_gene1) is thus the sum of two binomial distributions and can be approximated by a binomial distribution, although other approximations might be used (1). In the same way, the probability of disease for subjects with deletion of gene1, P(D| d_gene1), is:

P(D| d_gene1) = P(D | d_ gene2, d_gene1) P(d_ gene2 | d_gene1)+

(21)

In general, it is important to note that the two conditional probabilities P(d_ gene2 | p_gene1) and P(d_ gene2 | d_gene1) should be modelled using external information on the degree of linkage disequilibrium (LD) between the two genes of interest, whenever such information is available. However, in the absence of LD the effects of the genes can be assumed to be independent, so that:

P(d_ gene2 | p_ gene1) = P(d_ gene2 |d_ gene1) = P(d_ gene2) [2.4.]

Studies that only provide information on gene2-disease association (Table 2.3.) can contribute to the estimation of the three odds ratios with the addition of ecological data on what is the probability of carrying a deletion of gene1. The equations are similar to those above.

For all models, diffuse prior distributions will be used, in particular diffuse normal distributions for mean parameters and diffuse uniform distributions on the standard deviation for scale parameters.

All meta-analytical models will be developed within a Bayesian framework using MCMC methods as implemented by WinBUGS software 1.4.1 (14), and convergence will be assessed graphically by running three parallel chains and using Brooks-Gelman-Rubin plots (15).

References

1. Wakefield J. Ecological inference for 2×2 tables. J Royal Stat Soc A 2004;167(3):385–445. 2. Jackson C, Best N, Richardson S. Improving ecological inference using individual-level data.

Stat Med 2006;25:2136–59.

3. Brookes ST, Whitely E, Egger M, Smith GD, Mulheran PA, Peters TJ. Subgroup analyses in randomized trials: risks of subgroup-specific analyses; power and sample size for the interaction test. J Clin Epidemiol 2004;57:229–36.

4. Altman DG, Bland JM. Interaction revisited: the difference between two estimates. Br Med J Clin Res Ed 2003;326:219.

5. Smith PG, Day NE. The design of case-control studies: the influence of confounding and interaction effects. Int J Epidemiol 1984;13:356–65.

6. Gauderman WJ. Sample size requirements for matched case-control studies of gene environment interaction. Stat Med 2002;21:35–50.

7. Wang S, Zhao H. Sample size needed to detect gene-gene interactions using association designs. Am J Epidemiol 2003;158:899–914.

8. Greenland S, Robins J. Invited Commentary: Ecologic studies — biases, misconceptions, and counterexamples. Am J Epidemiol 1994;139:747–60.

9. Wakefield J, Salway R. A statistical framework for ecological and aggregate studies. J Royal Stat Soc A 2001;164(1):119–37.

10. Prentice RL, Sheppard L. Aggregate data studies of disease risk factors. Biometrika 1995;82:113–25.

(22)

11. Best N, Cockings S, Bennett J, Wakefield J, Elliott P. Ecological regression analysis of environmental benzene exposure and childhood leukaemia: sensitivity to data inaccuracies, geographical scale and ecological bias. J Royal Stat Soc A 2001;164(1):155–74.

12. Taioli E. International collaborative study on genetic susceptibility to environmental carcinogens. Cancer Epidemiol Biomarkers Prev 1999;8(8):727–8.

13. Garte S, Gaspari L, Alexandrie AK, et al. Metabolic gene polymorphism frequencies in control populations. Cancer Epidemiol Biomarkers Prev 2001;10(12):1239–48.

14. Spiegelhalter DJ, Thomas A, Best NG, Lunn D. WinBUGS User Manual. Version 1.4.1. Cambridge: MRC Biostatistics Unit;2004.

15. Brooks SP, Gelman A. Alternative methods for monitoring convergence of iterative simu-lations. J Computat Graph Stat 1998;7:434–55.

(23)

2.4. Semi-Bayes for balancing false-positive

and false-negative findings in molecular epidemiology studies

Ulf Stromberg

Department of Occupational and Environmental Medicine, Lund, Sweden

Background

Epidemiologic studies on direct genetic associations increasingly attract attention. Arguments advocating estimation of effect size rather than hypothesis testing (P-value) have been convincingly put forward (1,2). Nevertheless, hypothesis testing is primarily considered for evaluating the effects of each candidate genetic marker, for example, single nucleotide polymorphisms (SNPs) in genome-wide association studies. In molecular epidemiology, the growing feasibility of testing many thousands of associations in genome-wide studies have forced investigators to address the problem of multiple testing (3). To help investigators to protect themselves from over-interpreting statistically significant findings that are not likely to signify a true effect — a problem connected to multiple comparisons — consideration of the false-positive report probability (FPRP) has been proposed (4). The FPRP approach simplifies the analysis by assuming a simple binary choice between the null hypothesis of no association and the hypothesis of an association of an assumed effect size. A crucial problem when calculating the FPRP is how best to assess the prior probability of no association (5). Another concern with the FPRP approach is that the data provided by a study are reduced to a single, binary observation of whether a statistically significant association was found (3).

Here, an estimation-based semi-Bayes approach (6) is outlined. The approach offers an attractive alternative to the test-based FPRP approach, when the task is to carry out an initial selection of promising genetic markers for further analyses and studies.

Methods

The approach relies on the assumption that the varying true effects of the candidate markers cannot be graded a priori. First, the effects of each candidate marker, together with their 95% confidence intervals (CIs), are estimated using a conventional logistic regression model incorporating a single candidate marker and, possibly, some other influential variables. Second, a semi-Bayes method, adjusting the conventional effect estimates, is used. Such adjustments pull outlying effect estimates towards an overall average of the effect estimates and lead to more narrow 95% CIs than with the conventional estimation method. Third, for each candidate marker, a probability of marked effect is calculated (7); that is, effect above or below given effect sizes considered to have prioritized implications. In Bayesian analyses, by contrast to frequen-tist analyses, the concept of probability of marked effect is accepted (8). Those pro-babilities can be used to select the most influential genetic markers, since a high

(24)

probability indicates an influential marker. Investigators aim at selecting a limited number of promising candidate markers — the top candidates. The proposed estimation-based approach can yield a notably different list of top candidates than the FPRP approach, as we have demonstrated by using empirical data (6).

In the proposed Bayesian approach, the basic idea is that an over- or underestimation of the effect of a candidate marker is suspected to occur when the conventional point estimate of the log odds ratio (OR) is far away from other estimates, and unstable (that is, with large standard error, yielding wide 95% CI). Consequently, the variance of the distribution of log OR estimates across the candidate markers (V) is likely to overestimate the variance of the true individual log ORs (VT). In an empirical-Bayes approach, the goal is to estimate VT by “correcting” the overestimation of V, which is approximately done by subtracting the mean of the variances of the individual log OR estimates (VM) from V, that is, VT ≅ V — VM (9,10). The two quantities V and VM need to be estimated by an iterative computational procedure (9,11). However, when the proportion of true (marked) effects is low, the empirical-Bayes approach can produce an estimate of VT equal to zero, implying extreme stabilization of the individual effect estimates around an overall average effect and therefore not useful results. Therefore, a semi-Bayes approach is used, by assuming VT > 0. A specified value on VT can be viewed as a lower limit of the variability of the log ORs and should not exceed the estimate of V. Given VT, the semi-Bayes adjusted OR estimates, with 95% CIs, can be computed for the candidate genetic markers (9).

Conclusions

In the modern era of genomics, a crucial problem is how to select the most influential genetic markers for further analyses and studies. The proposed estimation-based approach offers an attractive alternative to the FPRP approach based on hypothesis testing. The estimation-based approach addresses the problem with false-positives — a statistically significant candidate marker may yield a relatively low probability of marked effect. At the same time, it addresses the problem with potentially false--negatives — a statistically non-significant candidate marker may yield a relatively high probability of marked effect and, therefore, be selected for further investigations.

Our proposed approach is suited for performing an initial selection of promising genetic markers among a large number of candidates with limited or no prior functional information. More advanced empirical- and semi-Bayes approaches, taking into account possible correlation between the effect estimates, can be developed for further multivariate analyses (10). Only a limited number of markers can then be considered, due to statistical power reasons. The initially estimated effects for the selected genetic markers can be weakened or strengthened by further multivariate modelling, where gene-gene and gene-gene-environment interactions may be addressed.

(25)

References

1. Rothman KJ. Significance questing. Ann Intern Med 1986;105:445–7.

2. Walter SD. Methods for reporting statistical results from medical research studies. Am J Epidemiol 1995;141:896–906.

3. Thomas DC, Clayton DG. Betting odds and genetic associations. J Natl Cancer Inst 2004;96:421–3.

4. Wacholder S, Chanock S, Garcia-Closas M, El Ghormli L, Rothman N. Assesing the probability that a positive report is false: an approach for molecular epidemiology studies. J Natl Cancer Inst 2004;96:434–41.

5. Matullo G, Berwick M, Vineis P. Gene-environment interactions: how many false positivies? J Natl Cancer Inst 2005;97:550–1.

6. Stromberg U, Bjork J, Broberg K, Mertens F, Vineis P. Selection of most influential genetic markers among a large number of candidates based on effect estimation rather than hypothesis testing: An approach for genome-wide association studies. [Submitted].

7. Smith AH, Bates MN. Confidence limit analyses should replace power calculations in the interpretation of epidemiologic studies. Epidemiology 1992;3:449–52.

8. Clayton D, Hills M. Statistical models in epidemiology. Oxford (UK): Oxford University Press; 1993.

9. Greenland S, Poole C. Empirical-Bayes and semi-Bayes approaches to occupational and environmental hazard surveillence. Arch Environ Health 1994;49:9–16.

10. Hung RJ, Brennan P, Malaveille C, Porru S, Donato F, Boffetta P, et al. Using hierarchical modeling in genetic association studies mith mutiple markers: application to a case-control study of bladder cancer. Cancer Epidemiol Biomarkers Prev 2004;13:1013–21.

11. Steenland K, Bray I, Greenland S, Bofetta P. Empirical Bayes adjustments for multiple results in hypothesis-generating or surveillance studies. Cancer Epidemiol Biomarkers Prev 2000;9:895–903.

(26)

Relatively expensive Less stable

Influenced by the environment Reproducibility problems

Often a poor predictor of complex traits (e.g., efficiency of DNA repair) since: 1) protein expression, stability, activity; and 2) environmental and lifestyle factors are not taken into account

Weak association with disease Low statistical power

Not modifiable for preventive trial High ethical considerations

2.5. Strategic issues: genotype-phenotype correlations

Micheline Kirsch-Volders1, Mateuca Raluca1, Giuseppe Matullo2 1 Free Unversity of Bruxelles, Belgium

2 Institute for Scientific Interchange (ISI) Foundation, Turin, Italy

General introduction

The mechanisms responsible for individual differences in response to muta-gens/carcinogens are largely attributed to inherited variations in genes involved in me-tabolism, repair of DNA damage, regulation of apoptosis, control of the cell cycle and immune response. Due to these variations, some individuals exhibit an increased sensitivity compared to the general population. Therefore the identification of higher risk individuals carrying genetic polymorphisms responsible for activation/detoxification of mutagens/carcinogens and for reduced DNA repair capacity has substantial preventive implications as these individuals could be targeted for primary cancer prevention. The identification of susceptible individuals can be achieved by genotyping, phenotyping and assessment of functional enzyme activity.

In addition to genetic polymorphism in metabolic and DNA repair genes, phenotypic markers have been widely used in recent years as a tool for elucidating the role of particular genes in human cancer (1–3). The major advantages and disadvantages of using phenotyping versus genotyping are summarized in Table 2.4.

Table 2.4. Advantages and disadvantages of phenotyping versus genotyping

Advantages Disadvantages

Phenotyping

Genotyping

Represents complex genotypes Better predictive ability

Consistent results despite small sample size

Provides information on host’s response to ongoing exposure Can quantify response

Can study target tissue Lower ethical considerations

Polymorphism is a germ line trait (present in all tissues): a blood test faithfully represents the situation in other target organ

Simple, rapid, reliable, and high-throughput More precise

Measure is stable

(27)

The phenotypic assays are based either on some direct evidence of the mechanism underlying the phenotype or on indirect effects. As an example, assays of DNA repair phenotyping used, to date, can be broadly grouped into five categories: 1) indirect tests based on DNA damage induced by chemicals or physical agents (e.g., the mutagen sensitivity assay, the G2-radiation assay, chromosome aberrations or micronuclei assays, and the Comet assay); 2) indirect tests of DNA repair (e.g., unscheduled DNA synthesis); 3) tests based on more direct measures of repair kinetics (e.g., the host cell reactivation assay, repair of DNA adducts by immunoassay and the Comet-based DNA strand break repair assay); 4) direct assessment of functional activity of the DNA repair enzymes and 5) combinations of more than one category of assays. The use of such tests in human populations yielded positive and consistent associations between DNA repair capacity and cancer occurrence (with odds ratios in the range of 1.4–75.3, with the majority of values between 2 and 10) (1). However, the studies have limitations, including small sample size, "convenience" controls, the use of a biomarker which can also be induced by non-DNA damaging agents (micronucleus), the use of cells different from the target organ and the use of mutagens that do not occur in the natural environment.

Genetic variants in candidate environmental disease genes

The US National Institute of Environmental Health Sciences (NIEHS) set up the Environmental Genome Project (EGP) in 1997 to investigate the role of common genetic polymorphisms (see box for genetic definitions) in environmentally induced disease. The project has focused on the discovery and annotation of single nucleotide polymorphisms (SNPs) in candidate environmental disease genes, with the creation of databases that integrate this data with annotated gene models. A total of 213 candidate genes with a possible role in environmental susceptibility to disease were analysed to determine whether functionally significant polymorphisms in the genes affected individuals’ susceptibility to genotoxic environmental agents. These genes were selected because they were known to be involved in biological processes that are influenced by environmental exposures (such as gene expression, DNA repair, and metabolism). In many cases loss-of-function mutations in these genes have previously been associated with serious disease states.

Genetic definitions

Allele: One of several forms of a gene; at the DNA sequence level it refers to one of several (usually, 2) nucleotide sequences at a particular position in the genome.

Genotype: The two specific alleles present in an individual; called a homozygote or heterozygote depending on whether the two alleles are identical or different.

Phenotype: The observable traits of an organism. The phenotype results from the combination of genetic and environ-mental factors

(28)

Genetic definitions — cont.

Cellular phenotype: The observable traits of a cell. The phenotype results from the combination of genetic and environ-mental factors

Polymorphism: The occurrence of multiple alleles at a specific site in the DNA sequence. Classically, a site has been called polymorphic if the rarer of the two alleles, called the minor allele, has a frequency above 1% in the population. SNP (single nucleotide polymorphism): Polymorphism where multiple (usually, 2) bases (alleles) exist at a specific

genomic sequence site within a population, such as A and G. In individuals, the possible combinations (genotypes) may be homozygous (AA or GG) or heterozygous (AG).

Heterozygosity: The frequency of heterozygotes in the population.

Haplotype: A combination of polymorphic alleles on a chromosome delineating a specific pattern that occurs in a popu-lation. The term is short for haploid genotype and has been used classically to describe the patterns of variation in a small segment of the genome where genetic recombination is rare, such as the HLA locus. However, when described as a haploid genotype it can refer to the specific arrangement of alleles along an entire chromosome observed in an individual, or in a specific region of a chromosome. For two SNPs with alleles A and G, and C and G, the possible haplotypes are AC, AG, GC and GG.

Linkage phase: The specific arrangement of alleles in the haplotypes. For an individual who is heterozygous at two SNPs, AG and CG (see above), the two haplotypes are either AC and GG, or AG and GC. These arrangements are referred to as the phases of the genotypes.

Linkage disequilibrium (LD): The statistical association between alleles at two or more sites (SNPs) along the genome in a population. Irrespective of the starting genetic composition of a population, over time, the frequencies of the four possible haplotypes AC, AG, GC and GG are expected to become the numerical products of the constituent allele frequencies, that is, reach an equilibrium state. Any departure from this state is called disequilibrium and defined as D = P(AC)P(GG) — P(AG)P(GC) (using the above example) where P(.) refers to the frequency of that haplotype. LD is commonly measured by the statistic D’, which is the absolute value of D divided by the maximum value that D could take given the allele frequencies; D’ ranges between 0 (no LD) and 1 (complete LD). LD decays depending on the rate of recombination between the SNPs. Thus, the patterns of genomic recombination, and the occurrence of recombi-nation hotspots and coldspots, affect the decay of LD and its local patterns. When two SNPs are in strong linkage disequilibrium, one or two of the four possible haplotypes may be missing. Another way of measuring LD is by the coefficient of determination between the two alleles of the two SNPs, a statistic called r2. The value of r2(the square

of the correlation coefficient) lies between 0 and 1 and its maximum possible value depends on the MAFs of the two SNPs. It has been used because its theoretical properties have been well studied and, most importantly, because it measures how well one SNP can act as a surrogate (proxy) for another.

Tag SNPs (or tags): The set of SNPs selected for genotyping in a disease study. Given the considerable extent of LD in local genomic regions, the choice of these SNPs for genotyping in a disease association study is critical, as long as the cost of genotyping is still substantial. The extensive correlation among neighbouring SNPs implies that not all of them need to be genotyped since they provide (to some degree) redundant information. Tag SNP selection can be performed using a variety of methods, with a common goal to capture efficiently the variation in the genomic region of interest.

Cytaty

Powiązane dokumenty

The article presents the economic progress and inequality development of China, the liberalisation effect of China foreign trade policy under World Trade Organisation (WTO),

Ił Tom ten jest uzupełnieniem do 11 woluminów dzieła Alfonsa Tostaty de Madrigal przecho­ wywanych w Bibliotece Uniwersyteckiej w Warszawie - S. 14; składan­

Do 1958 roku podstawą działalności Ośrodka był „Regulamin” au- torstwa o. Cały czas trwały jednak prace nad opracowaniem sta- tutu tej jednostki. Komisja prawników powołana

Znaj­ dują one swe (częściowe) uregulowanie w omówionej wyżej ustawie z 28 IV 1952 r. Niepodobna w ramach tego artykułu omówić nawet w sposób ogólny treści przepisów

Europejska Rosja może przy tym przemieszczać projekt Unii Europejskiej pod względem kulturowym, geograficznym i historycznym aż do Oceanu Spokojnego i wzmacniać globalny jej

Given the specification module, the evalua- tion module will evaluate each plan following two parallel steps: (1) the compliance evaluation is to verify the plan against the

Przedstawiona powyŜej, inspirowana teoriami Michela Foucaulta, wizja muzeum jako nowoczesnej instytucji społecznej, którą cechuje arbitralność, i która postrzegana

gw ary w ileń