• Nie Znaleziono Wyników

Multi-scale network analyses for cancer omics data

N/A
N/A
Protected

Academic year: 2021

Share "Multi-scale network analyses for cancer omics data"

Copied!
135
0
0

Pełen tekst

(1)

M

ULTI

-

SCALE NETWORK ANALYSES FOR CANCER

OMICS DATA

Proefschrift

ter verkrijging van de graad van doctor aan de Technische Universiteit Delft,

op gezag van de Rector Magnificus prof. ir. K. C. A. M. Luyben voorzitter van het College voor Promoties,

in het openbaar te verdedigen op vrijdag 30 oktober 2015 om 12.30 uur

door

Sepideh B

ABAEI

Master of Science in Biomedical Engineering geboren te Tehran, Iran.

(2)

promotor: Prof. dr. ir. M. J. T. Reinders and copromotor: Dr. ir. J. de Ridder

Composition of the doctoral committee: Rector Magnificus,

Prof. dr. ir. M. J. T. Reinders, promotor

Dr. ir. J. de Ridder, copromotor

Independent members:

Prof. dr. E. Marchiori, Radboud University

Prof. dr. M. Nykter, Tampere University of Technology Prof. dr. L. F. A. Wessels, EWI, Delft University of Technology

Netherlands Cancer Institute Prof. dr. B. van Steensel, Netherlands Cancer Institute

Erasmus University Medical Center Prof. dr. J. Jonkers, Netherlands Cancer Institute

Leiden University

Prof. dr. R. van Ham, Delft University of Technology, substitute member

This work was supported by the BioRange programme of the Netherlands Bioinformat-ics Centre (NBIC).

Cover Image Credit: Ms. Annelies te Selle

Printed by: Proefschriftmaken.nl || Uitgeverij BOXPress ISBN 978-94-6186-553-3

Copyright © 2015 by S. Babaei

An electronic version of this dissertation is available at

(3)
(4)
(5)

C

ONTENTS

1 Introduction 1

1.1 Cancer genomics . . . 2

1.1.1 Classes of DNA alterations. . . 2

1.1.2 Somatic and germline mutations in the cancer genome . . . 2

1.1.3 Identifying driver mutations. . . 3

1.1.4 Identifying cancer genes. . . 3

1.2 Network-based data integration in cancer genomics . . . 4

1.2.1 Omics data integration. . . 4

1.2.2 Network-based data integration . . . 5

1.3 Three-dimensional genome organization. . . 6

1.3.1 3D genome conformation and cancer . . . 7

1.4 Contribution of this thesis . . . 8

1.4.1 Integration of omics data improves subtype classification. . . 8

1.4.2 Recurrence of mutations should not only be defined in terms of the linear genome . . . 10

1.4.3 Data integration requires scale-space analysis. . . 10

1.4.4 Analysis of biological networks requires scale-space analysis . . . . 13

References. . . 14

2 Integration of gene expression and DNA-methylation profiles improves molec-ular subtype classification in acute myeloid leukemia 23 2.1 Abstract. . . 24

2.2 Introduction . . . 24

2.3 Results . . . 26

2.3.1 Gene expression profiles outperform DNA-methylation profiles. . . 26

2.3.2 Early integration improves (or equals) predictive power . . . 26

2.3.3 Late integration demonstrates best classification accuracy . . . 27

2.4 Conclusions. . . 28

2.5 Material and methods. . . 29

2.5.1 AML dataset . . . 29

2.5.2 Cytogenetical and molecular abnormalities in AML . . . 29

2.5.3 Classification strategies . . . 30

2.5.4 Classification accuracy and performance . . . 31

2.6 Supporting information. . . 31

References. . . 32 v

(6)

3 Integrating protein family sequence similarities with gene expression to find signature gene networks in breast cancer metastasis 35

3.1 Abstract. . . 36

3.2 Introduction . . . 36

3.3 Material and methods. . . 37

3.3.1 Data description. . . 37

3.3.2 Protein-protein sequence similarity . . . 38

3.3.3 Subnetworks construction. . . 38

3.3.4 Signature concordance. . . 39

3.3.5 Classification procedure. . . 39

3.4 Results and discussion . . . 40

3.4.1 Cross study prediction evaluation . . . 41

3.4.2 Functionally coherent subnetworks . . . 42

3.4.3 Consistent consensus genes across the datasets . . . 43

3.5 Conclusion . . . 45

References. . . 46

4 Detecting recurrent gene mutation in interaction network context using multi-scale graph diffusion 49 4.1 abstract. . . 50

4.2 Background. . . 50

4.3 Methods . . . 53

4.3.1 Mutation data . . . 53

4.3.2 Interaction graph construction . . . 53

4.3.3 Diffusion kernel . . . 53

4.3.4 Permutation procedure . . . 54

4.3.5 Graph clustering. . . 54

4.3.6 KEGG pathway enrichment . . . 55

4.3.7 Mutual exclusivity analysis. . . 55

4.4 Results and discussion . . . 56

4.4.1 Identifying Genes Recurrently Mutated in Interaction Context (ReMIC) 56 4.4.2 ReMIC genes are organized in connected components . . . 56

4.4.3 ReMIC gene clusters strongly enriched for cancer related pathways . 58 4.4.4 ReMIC gene clusters exhibit a pattern of mutual exclusive mutation. 58 4.4.5 ReMIC clusters include many non-CIS genes . . . 60

4.4.6 ReMIC genes co-localize in leukemia pathways . . . 60

4.5 Conclusions. . . 62

4.6 Supporting information. . . 63

References. . . 63

5 3D hotspots of recurrent retroviral insertions reveal long-range interactions with cancer genes 67 5.1 Abstract. . . 68

(7)

CONTENTS vii

5.3 Results . . . 70

5.3.1 Data . . . 70

5.3.2 Insertion sites occur often in open chromatin compartments . . . . 71

5.3.3 Recurrently inserted bins are spatially co-localized . . . 72

5.3.4 Detecting co-localized insertion clusters. . . 74

5.3.5 Major CIS-genes are involved in CLICs. . . 75

5.3.6 Insertions in CLICs exhibit a pattern of mutual exclusion . . . 75

5.3.7 CLIC-loci coincide with cancer genes . . . 76

5.3.8 CLIC-loci link distal insertions to known cancer genes. . . 76

5.3.9 CLIC-loci interacting with cancer genes enrich for TFBSs . . . 77

5.3.10 Insertions in CLIC-loci affect co-localized gene expression . . . 78

5.4 Discussion . . . 78

5.5 Methods . . . 80

5.5.1 Binning of Hi-C and insertion data. . . 80

5.5.2 Detecting ICs . . . 80 5.5.3 Rank-based normalization. . . 81 5.5.4 Detecting CLICs . . . 81 5.5.5 Expression analysis . . . 82 5.6 Supporting information. . . 82 References. . . 82

6 Hi-C chromatin interaction networks predict co-expression in the mouse cor-tex 87 6.1 Abstract. . . 88

6.2 Author Summary . . . 88

6.3 Introduction . . . 88

6.4 Results . . . 90

6.4.1 Intra-chromosomal Hi-C data . . . 90

6.4.2 Multi-resolution Hi-C data. . . 91

6.4.3 Chromatin interaction network (CIN) . . . 91

6.4.4 CIN Topology . . . 91

6.4.5 Co-expression network . . . 93

6.4.6 Highly co-expressed genes are spatially co-localized. . . 93

6.4.7 Chromatin interaction profiles as co-expression predictors . . . 93

6.4.8 Topological descriptions of multi-resolution interaction networks increase the prediction performance. . . 95

6.4.9 STMs improve the prediction performance . . . 96

6.4.10 CIN topology differs per chromosome. . . 98

6.4.11 Topological signatures of CINs. . . 98

6.4.12 STMs effectively characterize the CIN of Chromosome 16 to predict co-expression . . . 99

6.5 Discussion . . . 100

6.6 Materials and Methods . . . 102

6.6.1 Rank-based normalization of Hi-C contact matrices. . . 102

6.6.2 Scale-aware topological measures. . . 103

(8)

6.6.4 Supervised learning procedure. . . 103 6.6.5 t-SNE map. . . 104 6.7 Supporting information. . . 104 References. . . 104 7 Discussion 109 7.1 Network-based integration . . . 110

7.1.1 Results of the thesis . . . 110

7.1.2 Future Outlook. . . 111

7.2 Multi-scale analysis. . . 112

7.2.1 Results of the thesis . . . 112

7.2.2 Future outlook. . . 113

7.3 Spatial organization of the genome. . . 113

7.3.1 Results of the thesis . . . 113

7.3.2 Future Outlook. . . 114 7.4 Closing Remarks . . . 115 References. . . 116 Summary 119 Samenvatting 121 Acknowledgment 123 Curriculum Vitæ 125 List of Publications 127

(9)

1

I

NTRODUCTION

(10)

1

Cancer is a complex disease characterized by uncontrolled division of cells. Recent ac-cumulation of high throughput data has provided a detailed description of the different

cellular processes, including those that play in cancer cells. Combining omics measure-ment technologies with advanced computational methods promises to greatly improve our understanding of the initiation and progression of cancer. Despite these encourag-ing developments, survival rates of this disease around the world are still low, demon-strating that curing cancer is still very difficult.

We believe that comprehensive understanding of the molecular mechanisms that underlie the development of cancer is the key to improve cancer diagnostics, prognos-tics, and targeted therapeutics. In this thesis we therefore contribute novel computa-tional techniques to extract useful information from canceromics data and demonstrate that these techniques enable novel insights into the molecular mechanisms of this com-plex disease.

1.1.

C

ANCER GENOMICS

Cancer studies often focus on the starting point of this disease, the genome. The idea that cancer is a disease of the genome was proposed by Theodor Boveri in 1914 [1], even before the concept of genes had been developed. He suggested that uncontrolled growth of a tumor might be the consequence of an abnormal number of chromosomes (i.e. ane-uploidy). In the last decades, sequencing technologies enabled biologists to prove that cancer is caused by alterations in genes that control the growth and division mechanisms of the cell [2].

1.1.1.

C

LASSES OF

DNA

ALTERATIONS

DNA sequence alterations can occur at different scales ranging from large-scale chro-mosomal rearrangements to small-scale point mutations in genes. These alterations encompass various classes of mutations including substitution of single bases, inser-tions or deleinser-tions of DNA segments, rearrangement of DNA segments, and copy number variations that may result in amplification or deletion of a gene [3]. Another class of al-terations is insertion of a completely new DNA sequence into the host cell’s genome. A common mechanism for this is by viruses such as the hepatitis B virus (frequently caus-ing liver cancer [4], the human papilloma virus (frequently causing cervical cancer [5]) and retroviruses such as the Rous sarcoma virus and the mouse mammary tumor virus [6]. For instance, HTLV-I is a retrovirus infection associated with adult T-cell leukemia [7].

In addition to mutations that change the actual DNA code, there are epigenetic al-terations. These changes, which include DNA (de)methylation or histone modifications, can also contribute to cancer by changing the gene expression [8]. Finally, there is mount-ing evidence that the spatial organization of the genome in the cell nucleus has an im-portant role in tumor initiation [9].

1.1.2.

S

OMATIC AND GERMLINE MUTATIONS IN THE CANCER GENOME

Cancer studies have been focused on finding links between cancer development and ge-nomic alterations. These alterations can either be transmitted through the germline or

(11)

1.1.CANCER GENOMICS

1

3

arise de novo by somatic mutations. Mutations can occur in any cell type. Mutations in a germ cell (eggs or sperm, i.e. germline mutations) can be passed on to the next gen-eration. A germline mutation can therefore be found in every cell of the descendant. Somatic mutations, in contrast, may arise in any cell in the body at any time during life. The census of cancer genes indicates that a large fraction of putative cancer genes (80%) harbor only somatic mutations, 10% harbor only germline mutations and 10% harbor both [3]. The association of germline mutations with clinical characteristics of cancer is mainly studied by genome wide association studies (GWAS) [10]. Most cancer studies, however, analyze somatic mutations in cancer genomes. Important examples of such studies are The Cancer Genome Atlas (TCGA) [11] and International Cancer Genome Consortium (ICGC) [12]. In the last two decades sequencing cancer genomes has iden-tified more than 100 000 somatic mutations [3]. Over the next few years, undoubtedly tens of thousands more will be revealed as the large-scale whole-genome sequencing of cancer genomes will become commonplace.

1.1.3.

I

DENTIFYING DRIVER MUTATIONS

Importantly, not all somatic mutations in a cancer genome contribute to cancer devel-opment. Many mutations, called passengers, do not play a role in tumorigenesis because their impact on the affected protein function is minor or the protein does not have an important function in tumor initiation and progression. A much smaller fraction of mu-tations, called driver mumu-tations, are directly involved in cancer development. The first somatic driver mutation was determined in the HRAS gene in a human bladder cancer cell line [13]. Thereafter, many studies followed that tried to identify cancer genes that harbor driver mutations and distinguish those from random passenger mutations in hu-man cancer cell tissues [3,14].

Until now, more than 520 (2.3%) of the approximately 22000 human protein-coding genes in the human genome are reported to show recurrent somatic mutations in can-cer (Cancan-cer Gene Census, accessed: Jan 2015) [15]. Forward genetic screens in mice, however, suggest that more than 2000 genes are potential cancer genes [16,17]. One of the key challenges in cancer research therefore is to derive a catalog of all driver mu-tations in cancer and consequently identify all cancer-associated genes. Consequently, the other challenge is determining the pathways through which these genes exert their effects on the cellular process. Ultimately, identification of cancer-causing genes and their associated pathways will improve targeted therapy for cancer treatment.

1.1.4.

I

DENTIFYING CANCER GENES

Depending on the type of data and the goal of the analysis, cancer studies may focus on either identifying biomarkers that can improve patient clinical classification (e.g. sur-vival outcome, cancer subtypes) or determining cancer genes and the pathways through which these genes influence cellular function.

The first class of approaches consists of examining a cohort of cancer patients to identify marker genes based on which patients can be stratified in cancer subtypes or poor or good prognosis after treatment[18–22]. For some cancer types, classification, staging of tumors, and treatment protocols are established by the presence of abnormal cancer genes [3]. A well-known example is AML subtype classification. AML is

(12)

classi-1

fied on the basis of the presence of abnormalities involving specific cancer genes suchas NRAS, FLTK [23]. Each subtype has a characteristic gene expression profile, cellular

morphology, clinical syndrome, prognosis and opportunity for targeted therapy. The main challenge in such methods is finding robust markers, i.e. markers discov-ered in one cohort should remain predictive in an independent cohort [24]. Typically, gene signatures have been identified by analyzing gene expression measurements of tu-mor tissues. Several studies, however, showed that integration of gene expression with functional relations such as pathways or protein interactions results in more robust can-cer markers [21,25].

The second class of approaches focuses on determining mutations that recur across multiple independent tumors. Genes that harbor recurrent mutations in their genomic vicinity are potential cancer-causing genes, and so, may be amenable for therapeutic in-tervention [26]. Many machine learning algorithms and statistical methods have been developed to determine which regions in the genome contain more recurring mutations than would be expected from background mutation rate [17,27–30]. In many of these approaches, the significance estimates are based on a permutation procedure in which the mutation scores are randomized. In case of CNV detection, for instance, the log2 sig-nal intensity ratios are circularly shifted along the genome and a distribution of aggre-gate signal intensities for these random signals is constructed [27]. This null distribution enables estimation of empirical p-values to distinguish driver mutations from random background mutations [27]. Genes that are located in significantly mutated regions are considered as cancer candidate genes. Using this approach, classic examples of cancer genes, such as TP53 and KRAS, can easily be identified as they are frequently mutated across many cancer types [31].

1.2.

N

ETWORK

-

BASED DATA INTEGRATION IN CANCER GENOMICS

1.2.1.

O

MICS DATA INTEGRATION

A single omic analysis can provide us only with a partial picture of what is wrong with a cancer cell. Since cancer is a multifactorial disease [32], integrating multiple layers of information, including data from epidemiological studies, clinical studies and genomic and metabolomic profiling, is required to obtain a comprehensive understanding of the causes of cancer [33]. Therefore, data integration strategies in which multiple sources of data are used to improve knowledge discovery[34] are vastly used in many computa-tional cancer studies [35].

Diverse computational tools have been developed to combine and analyze multi-ple levels of omics data [36–38]. For example, the Integrative Multi-species Prediction (IMP) tool provides a pooled dataset of gene expression profiles and pathways informa-tion which is used for determining funcinforma-tionally related genes across multiple tissues and organisms [38]. Several consortiums are established aimed at obtaining and analyzing a wide-range of genomic and epigenomic signals. Examples include the cancer consortia TCGA [11] and ICGC [12], but also the generic consortia ENCyclopedia Of DNA Elements (ENCODE) [39] and Epigenomic Roadmap data consortium [40]. For example, the Can-cer Genome Project [3] focuses on integrating various datasets including the Catalogue of Somatic Mutations in Cancer (COSMIC) [41] and publicly available gene expression

(13)

1.2.NETWORK-BASED DATA INTEGRATION IN CANCER GENOMICS

1

5

profiles.

A common strategy for data integration is based on combining several (at least two) omics data types from the same set of samples [11,42]. For instance, the TCGA pilot project has been developed by integrating gene expression profiles, DNA aberrations, and DNA methylation to find molecular signatures in glioblastoma samples [43]. Sev-eral cancer studies have integrated copy-number variation with gene expression data in cancer patients to detect genes that are associated with metastasis and reduced survival for example in breast cancer [44,45], cervical cancer [46], Lymphoma and gastrointesti-nal cancer [47]. Moreover, some studies have suggested that integrated analysis of gene expression and chromatin remodeling in tumor and normal tissues will be required to explain the mechanisms of the breast cancer progression [48].

1.2.2.

N

ETWORK

-

BASED DATA INTEGRATION

An important class of integrative approaches is network-based methods. Biological net-works catalogue all interactions between a cells’ constituent molecules including DNA, RNA, and proteins and consequently provide a holistic description of the inner work-ings of the cell as a system. Network analysis provides a holistic framework for omics data integration and therefore is a useful technique in cancer genomics studies. For this reason, many omics data integration techniques rely on network representations of the data [49–51].

While many biological networks are direct representations of measurement data (such as protein-protein interactions), substantial efforts have been made to infer networks by integrating many sources of indirectly related data. One prominent example is the class of Bayesian networks, which have been widely used in cancer research for omics data integration, through the design of appropriate prior distributions [52]. A well-known example is the PARADIGM(-shift) method that has been developed based on a Bayesian network to construct pathways from cancer transcriptomic profiling data sets [53]. It has been applied in several cancer genomic research studies, including TCGA data analysis [54].

Once constructed, biological networks can be analyzed and exploited to facilitate in-terpretation of measurement data. A common analysis is the characterization of network structure by calculating topological measures such as degree, betweenness or Jaccard index [55,56]. Such topological measures can be used to quantify the functional asso-ciation between nodes and edges in a biological network [57]. For example, the degree centrality measure in a PPI, i.e. hubs and non-hubs nodes, are used to characterize the importance of proteins’ function and their corresponding genes [58–60]. A well-known example is the putative cancer genes TP53 which is a highly connected (i.e. hub) node in the PPI network [61].

Network topology, and more specifically, network modularity, can be used for gene prioritization and to determine which pathways are associated with the cancer under study. In the pioneering work by Ideker et al [50] protein-protein interaction (PPI) net-works are used to identify active subnetnet-works that exhibit significant gene expression changes [50]. This method was extended in several studies [21,62]. One notable ex-ample is the work of [21] who investigated integration strategies to generate informative predictors in breast cancer metastasis by finding discriminative gene subnetworks from

(14)

1

the PPI network. The network modules that are significantly associated with the phe-notype under study can be examined by means of permutation-based strategies [63].

In such approaches candidate genes are scored for proximity in the context of the net-work. To assess significance, those scores are compared with scores computed on ran-domized networks that are obtained by switching interactions in an interaction-degree-preserving manner [64].

A powerful formulation in network analysis is the concept of information flow through a network using random walks on graphs [65]. The flow of information between two nodes is determined using an electric circuit analogy [66,67], in which edges are con-ductors and nodes through which a significant current flows are candidate causal node. This idea was applied to propagate biological process annotations over a PPI network by [68]. Tu et al. applied this method to detect causal genes and generate hypotheses on the underlying regulatory mechanisms by integrating eQTL data, protein-protein inter-actions, protein phosphorylation, and transcription factor binding sites [69]. Similarly, the diffusion kernel [70] and shortest path [56] algorithms are also used for modeling the flow in the network to identify subnetworks enriched in recurrently mutated proteins across patients [71,72]. These algorithms are used for cancer gene prioritization based on assessing the impact of mutations on protein function [71,73,74].

1.3.

T

HREE

-

DIMENSIONAL GENOME ORGANIZATION

The unfolded human DNA sequence is about 2 meters, much longer than the diameter of average mammalian cell nucleus. To fit within the cell’s nucleus, the genome has to be tightly packaged and therefore has a specific three-dimensional organization. Impor-tantly, this non-random 3D structure is known to play a key role in regulating the activity, for instance gene expression [75], in the cell and thereby in the cell’s biological functions [76–79]. Hence, to fully understand how the genome works, it is required that we know how the DNA is spatially organized in the cell nucleus [77].

Three decades ago, the fluorescence in situ hybridization (FISH) technique showed that a large fraction of the genome occupies preferred positions in the nucleus. This im-plies that chromosomes have specific territories which results in a non-random spatial organization of the genome [80,81]. Moreover, recent techniques that map genome-wide chromatin interactions provide evidence that transcription occurs at specific nu-clear sites, called transcription factories [82]. These factories might explain the non-random chromatin looping and gene clustering (Fig.1.1).

In the last decade, the chromosome conformation capture (3C) technique was de-veloped to derive chromosome folding at a much higher resolution than FISH [77]. To do this, the cross-linked chromatin is cut with a restriction enzyme such as HindIII [77]. Then, the cross-linked DNA fragments are religated. Spatial proximate fragments, which can be distal on the linear genome, will be ligated to each other as shown in Fig.1.1. Thereby, the spatial proximity of two genomic loci is set equal to the number of ligation products between them. The 3C strategy was designed to determine the chromatin con-tact between two specific loci on the genome (i.e. one versus one map). 3C was first applied to show that activation of the g-globin genes involves physical interaction be-tween the activated gene and a more distal locus control region (LCR) as a consequence of a large chromatin loop [83].

(15)

1.3.THREE-DIMENSIONAL GENOME ORGANIZATION

1

7

The 3C technique has been modified to capture chromatin interaction between more than two loci. In general, the 3C technique and its variants (i.e. 4C, 5C, Hi-C) are based on the determination of the physical contact frequency between a pair of loci in the cross-linked DNA in cell populations. 3C was combined with microarrays to measure chromatin interactions between a specific locus and all other loci that are embedded in the array (i.e. one versus all map). This method is called chromosome conforma-tion capture-on-chip (4C) [84]. In the chromosome conformation capture carbon copy (5C) strategy, 3C is combined with hybrid capture approaches to identify interactions be-tween two specific sets of loci (e.g. bebe-tween promoters and distal regulatory elements). It can be considered as a many versus many mapping technology [85].

The all versus all genome-wide version of the 3C technique, which is called Hi-C [86], is designed based on next-generation sequencing (NGS). The 3C strategy was modified by filling the ends of DNA fragments with biotinylated nucleotides to ensure that only ligation junctions are selected for further analysis. Then, ligation junctions will be di-rectly sequenced and reads are mapped back to the genome. When a pair is found on two different restriction fragments, this is scored as an interaction between these two fragments. The Hi-C contact matrix is constructed from the ligation product library in which an entry of the matrix indicates number of ligation products between a pair of loci. The Hi-C resolution depends on the sequencing depth and are varied in range of 20kb to 1Mb for the mammalian genome [86,87].

The 3C strategy and chromatin immunoprecipitation (ChIP) were combined to de-tect the chromatin interaction between sites bound by chromatin-interacting protein [88]. This method is the combination of ChIP and paired-end ditag (PET) sequencing (so called ChIA-PET). Both Hi-C and ChIA-PET can describe the genome-wide chromatin interactions while Hi-C determines protein-independent interactions and ChIA-PET de-termines protein-mediated functional interactions.

Using these newly developed techniques, the three-dimensional structure of the genome has been explored, improving our understanding about the organizational principles of the chromosomes at different scales ranging from intra-chromosomal to inter-chromosomal interactions. The intra-chromosomal interaction can have different functional effect on transcription. For example, chromatin loops that join the 5’ end of transcribed genes with the transcription termination site can enhance transcription directionality of protein-coding genes [89]. Another type of inter-chromosomal loops brings distant enhancers in contact with promoters. The g-globin locus control region (LCR) was the first of many examples of this type of loop [83].

Functional interactions between distinct chromosomes (i.e. inter-chromosomal) are also emerging as prominent functional regulators. For example some long distance con-tacts among active genes on the different chromosomes which might be mediated by the same transcription factors have been detected in mouse [75].

1.3.1.

3D

GENOME CONFORMATION AND CANCER

The three-dimensional organization of the genome in the cell nucleus influences the cel-lular activity. Although the genome encodes the genetic information in the linear DNA sequence, the appropriate expression of a gene is influenced by the non-random 3D or-ganization of the genome [78]. The regulation of a gene is therefore controlled by the

(16)

1

physical contacts between the gene locus and regulatory elements through chromatinlooping, which can be far from that gene on the linear genome.

Several cancer studies have utilized the 3C technique and it variants to identify long-range chromatin loops that can contribute to the regulation of cancer genes. It has been shown that genomically distal SNPs can promote cancer through the regulation of distal genes by large chromatin loops in colon cancer [90]. It has also been reported that cancer risk loci interact through long-range chromatin loops in lung cancer [91] and multiple epithelial cancers including colon, breast and prostate [92,93].

In cancer studies, Hi-C chromatin interaction maps allow to unravel the mechanisms by which somatic mutations deregulate cancer genes [94–96]. For example, it has been demonstrated that the chromatin interactions between genomic loci influence the part-ner choice in chromosomal rearrangements in hematopoietic tumor [97]. By integrat-ing Hi-C data and somatic copy-number variations, it has been shown that spatial or-ganization of the genome contributes to amplifications or deletions in cancer genomes [95,96]. It has also been reported that over expression of oncogenic transcription fac-tors in prostate cancer is associated with 3D organization of the genome [94]. Moreover, long-range chromatin interactions between cancer risk loci have been identified by us-ing Hi-C contact maps in colorectal [98] and breast cancer [99].

1.4.

C

ONTRIBUTION OF THIS THESIS

In this thesis, we study the molecular mechanism of complex genetic diseases such as cancer in order to improve our ability to fight such diseases. Without exception, this is achieved by developing novel data-analysis algorithms that integrate multiple sources of data. Fig.1.2summarizes different levels of omics data which are employed in this thesis.

1.4.1.

I

NTEGRATION OF OMICS DATA IMPROVES SUBTYPE CLASSIFICATION

We start this thesis by investigating integration strategies in case multi-omics data are available for the same sample. In Chapter 2, gene expression and DNA-methylation pro-files (both from the same patients cohort) are integrated to find gene signatures that pre-dict AML subtypes. We explore different integration strategies to derive signatures that optimize the predictive power. We find that prediction of molecular subtypes of AML can be further improved by integrating gene expression and DNA-methylation profiles.

Although the signatures that are discovered based on the multi-omics analysis are highly predictive, they may not be optimally suited to unravel the underlying disease mechanisms. To address this, in Chapter 3, we investigate the benefit of integrating pa-tient specific data with generic datasets that provide information on the interactions be-tween proteins. For this purpose, we propose a novel protein similarity network based on only the protein sequences that can be used in the functional integrative strategies. We find more robust marker genes that are functionally related when the gene expres-sion data was integrated with such functional interaction network.

(17)

1.4.CONTRIBUTION OF THIS THESIS

1

9 Chromatin-associated factors Gene Biotin dCTP fill in Endonuclease digestion Protein Protein Sonication Immunoprecipitation biotinilated linkers Contact library PCR with

speci c primers universal primersPCR with Multiplexed

amplification four base cutterDigestion with Ligation Inverse PCR Sonicate Pull down Mmel digestion Pull down B B B B B B B B B B B B B B B B DETECTION LIGATION CUTTING CROSSLINK COMPUTATIONAL ANALYSIS REVERSE CROSSLINKS 3C 4C 5C Hi-C ChIA-PET B B A B C

Figure 1.1: Gene clustering in the cell nucleus (Top Panel). Genes can be found in spatial proximity inside the nucleus even though their genomic loci are distal in the linear genome or not even on the same chromo-some. Gene clustering can be due to A) sharing of the same transcription factory, B) co-association of multiple genes with the same splicing speckle (Note that splicing speckles are nuclear structures which are enriched by splicing factors. Splicing factors play an important role during the transcription [100].) and C) Unknown mechanisms which may include chromosome territories, nuclear factor sharing (other than speckles or fac-tories) (Figure is adapted from [82]). The 3C technique and its variants (Bottom Panel). The chromosome conformation capture technique is based on a set of biochemical approaches to determine the physical in-teraction of the genomic loci. In 3C and its variant (4C, 5C, Hi-C) and ChIA-PET, the cross-linked chromatin is cut with a restriction enzyme such as HindIII. Then, the cross-linked DNA fragments are religated. Spatial proximate fragments, which can be genomically distal, will be ligated to each other. They difference between variants is because of the interaction detection method (Figure is adapted from [101].

(18)

1

1.4.2.

R

OF THE LINEAR GENOMEECURRENCE OF MUTATIONS SHOULD NOT ONLY BE DEFINED IN TERMS

Analysis of recurrence in terms of the linear genome is insufficient to detect all cancer driver genes [61]. The reason for this is twofold. Firstly, many different mutations can disrupt the same gene and secondly many different genes can disrupt the same cellular process.

It is frequently assumed that disruption of gene function can occur by mutations anywhere in the proximity of a gene. However, evidence is mounting that the 3D or-ganization of the genome plays a major role in determining gene expression regulation [102,103]. As a result, genomically distal mutations may contribute to deregulation of the same target gene simply because they are proximal in the 3D physical environment as a result of long-range chromatin looping [78]. This would clearly invalidate detect-ing driver genes based on recurrent mutations in a region in the linear genome. In-stead, recurrent mutations should be defined in the context of the 3D organization of the genome.

Secondly, genes interact with each other and, as such, can be considered as part of a pathway. It is long established that a mutation of one of several genes can exert the same deleterious effect on the whole pathway. This explains that mutation of different genes represent alternative routes to the acquisition of the same cancer hallmark [26,104]. A compelling example is given by the Rb pathway. Mutations in four genes (RB1, CDKN2A, CDK4 and CCND1) can be considered as recurrent because their effect on controlling transition to replication phase is similar [2].

This example illustrates that in some cases recurrence is better defined at the level of functional groups (pathways). These functional groups may vary considerably in their size. Somatic mutations tend to gather around particular spots that harbor functional structures of the genome. Such functional hot spots can range from single genes to large pathways [105].

In Chapter 4 and 5, we focus on detecting cancer-associated genes that are not re-currently mutated in their close genomic vicinity and thus are not detectable by typi-cal statistitypi-cal analyses. In Chapter 4, we consider insertional mutagenesis data in the context of a pathway neighborhood as encoded by a protein interaction network. We developed a network diffusion algorithm and confirmed that somatic mutations tend to co-localize in particular hotspots in the functional PPI network. Such hotspots can range in size from single genes to large pathways in which mutations are spread across many genes. In Chapter 5, we developed a statistical framework to show that mutations far apart in the genome may contribute to the deregulation of cancer genes through long-range chromatin looping bringing them into close spatial proximity.

1.4.3.

D

ATA INTEGRATION REQUIRES SCALE

-

SPACE ANALYSIS

In both cases described above, defining recurrent mutations at the scale of a single gene is insufficient. Instead, we defined recurrent gene mutations at the scale of a pathway or in physical 3D space, encompassing multiple genes. It is a priori unknown what scale is the most appropriate. For this reason, we have proposed multi-scale analyses that allow us to analyze recurrent mutations across many different scales.

(19)

1.4.CONTRIBUTION OF THIS THESIS

1

11

~~

Gene 1

~~

Gene 2

Hypermethylation Tumor Suppressor Hypomethylation Oncogene

mRNA Expression mRNA

Expression Co-expression Network

Chromatin Interaction Network

Gene 1 Gene 2 Protein 2 Protein 1 Protein Network Mutations 3D Chromosome Conformation in Cell Nucleus

~

Gene 1 Gene 2 Protein1 Protein2 Translation Tr anslation

Figure 1.2: Overview of integrative network analysis. The function of a cell is characterized by different layers of networks. The three-dimensional organization of the genome in the cell nucleus influences the cell func-tion. Interactions between all genomic loci in the cell nucleus yield the chromatin interaction network. At a higher level, proteins also interact with each other. Multiple proteins bind together to perform functions in the machinery of the cell. All possible interactions between proteins are represented in the protein interac-tion network. Associainterac-tion networks that represent the funcinterac-tional relainterac-tionship between molecules include the gene co-expression network and protein sequence similarity network. Such biological networks can be inte-grated with sample specific including DNA alterations, gene expression and methylation profiles to improve our understanding of biology and eventually the mechanism of complex genetic disease such as cancer.

(20)

1

the field of computer vision. In an image, the physical scale of an object is an importantproperty because they are recognizable only at the appropriate scale. As a simple

exam-ple, an apple is only meaningful at a scale of a few centimeters, which is the scale we can observe with the naked eye. On the other hand, molecules should be considered at the scale of a several nanometers, and can thus only be observed by the human eye when aided by an electron microscope. A well-known application of the notion of scale is car-tography where maps are produced at different scales depending on the application of interest; a city map including details of all streets is very different to a map of the world that only shows locations of major cities.

In some situations, however, an appropriate scale is not obvious. In such cases a comprehensive description of underlying structures can only be extracted when multi-ple scales are considered simultaneously. This was pioneered in computer vision tech-niques [106,107] where scale-space analysis is a well-established method to deal with the multi-scale nature of structure in images. In the scale-space framework, an image is represented by a family of smoothed signals through convolution with a Gaussian kernel. The width of the Gaussian kernel serves as a scale parameter: for small values, small-scale structures are apparent, whereas for large values large-scale structures can be identified.

The concept of scale is also important in molecular biology, although in biology the elementary units are not pixels, such as is the case for images. One approach would be to build a functional scale-space based on the Gene Ontology (GO) [108]. The hierarchical structure of the GO tree represents different scales for genes, such as broad categories containing many genes (e.g. cellular component, biological process, molecular function categories in the root of the tree) as well as narrow categories (e.g. apoptotic protease activator activity, intracellular organelle lumen) [108].

Another type of scale-space can be constructed from the linear genome. Genome-wide profiling studies, such as the ENCODE project, showed that information in the lin-ear genome is encoded at different scales [39]. These scales range from tens of bases (e.g. transcription factor binding sites) to megabases (e.g. nuclear lamina-associated domains (LADs) [109]. In this case the scale-space can be built from the location of the base pairs along the genome, somewhat similar to the spatial organization of image pix-els. Also here a kernel convolution framework can be used to detect significant events on the genomic signals in which the width of kernel function defines the scale parame-ter. By using kernel function with variable widths a scale-space hierarchy can be build. This approach has been applied on insertional mutagenesis data [110], copy-number variation data [28,111], epigenomic data and DNA replication timing domains [112].

In this thesis, we particularly focused on functional (Chapter 4) and spatial scale-spaces (Chapter 6). The functional scale-space is constructed on the basis of interactions between proteins. The relevant structures discovered in such scale-space range from the physical binding of two proteins, to interactions between protein complexes and interactions of molecular pathways and connected components of the complete protein interaction network. The spatial scale-space is constructed based on the hierarchical structure of the three-dimensional physical location of the genome in the cell nucleus [113]. Here, scale ranges from hundred bases (e.g. small chromatin loops that connect genes and enhancers), to megabases (e.g. interactions between topological associated

(21)

1.4.CONTRIBUTION OF THIS THESIS

1

13

domains (TADs) or chromatin compartments). For each data type appropriate scales-space elements have to be determined. For instance, for protein interactions, a network-based space may be the most appropriate.

1.4.4.

A

NALYSIS OF BIOLOGICAL NETWORKS REQUIRES SCALE

-

SPACE ANAL

-YSIS

The concept of a scale-space also plays an important role in the analysis of biological networks. For example, de Bivort et al. grouped genes by similarity of expression across multiple cellular conditions and then, by analysis of these multi-scale networks, demon-strated that properties of the transcriptional network are different when different num-bers of genes are grouped into modules (i.e. networks of different scales are considered). More specifically, they generate sub-networks at three distinct scale levels: large scale where modules consist of many genes; medium and small scale where modules are com-prised of a few genes and individual genes, respectively [105].

When a biological network is considered at multiple scales, different types of biolog-ical interactions are captured at different scales. For instance, direct functional relations between two genes can be determined at the small scale, whereas small signaling cas-cades that are characterized by a few interactions can be determined at a medium scale. At the larger scales, densely connected components of the network can be captured that may describe complete pathways.

When networks are considered in a scale-space context, also the network topology should be considered at different scales. For this purpose, scale-based descriptions of topological measures of a network can be used. For instance, functional interactions around proteins in the large PPI network have been investigated at different zoom-levels by Estrada et al. [114]. They have shown that using the multi-scale degree centrality identification of essential proteins in the yeast PPI network can be improved compared to using standard centrality measures [114].

In this thesis, we proposed to apply diffusion kernels on discrete spaces [70], as the principal algorithm for analyzing biological networks in a multi-scale fashion (Chapter 4). The concept of diffusion kernels over a network is similar to the heat diffusion process in continuous space. In a network the information on nodes (e.g. number of genomic alterations in the genomic vicinity of a gene) is diffused throughout the network edges (e.g. physical interactions between two genes) dependent on the network topology that connects this node to the rest of the network.

Effectively this process performs a network smoothing. The level of smoothing deter-mines the scale of the network neighborhood that will be taken into account. For low dif-fusion strength, information hardly diffuses and the topological properties of a node are determined by itself and a few well-connected neighbors. When the diffusion strength approaches infinity, the diffusion reaches equilibrium where all connected nodes receive the same diffusion contribution. Therefore, information on nodes diffuses to distant and low-degree nodes albeit in very small amounts.

The diffusion concept has also been used to calculate topological measures of a net-work at different scales [115]. Consider the Jaccard index, which determines the fraction of common neighbors of two nodes, or a shortest path measure. A scale-based variant of this topological measure can be calculated by varying the diffusion strength. By

(22)

in-1

creasing the amount of diffusion between nodes, the Jaccard index takes into accountmore and more extended neighborhoods. Similarly, the scale-aware shortest path

con-siders paths that are longer than the standard shortest path between two nodes by when the scale parameter increases. Therefore, these scale-aware topological measures enable us to construct descriptions of network structure across different scales simultaneously, ranging from direct interactions between nodes (small-scale) through more indirect in-teractions between connected components of the networks (medium-scale) and inter-actions between large modules in the network (large-scale).

In Chapter 6, we exploit these multi-scale descriptions to investigate the influence of the physical 3D structure of the genome on the spatial expression patterns of genes. Interactions between all genomic loci in the cell nucleus yield the chromatin interac-tion network. The scale-aware topological measures of the network were calculated by means of diffusion kernels. These measures enable us to construct descriptions of net-work structure across different scale simultaneously, ranging from direct interactions between genes (small-scale) through more indirect interactions between connected to chromatin compartment interactions (large-scale). We used the scale-aware measures to predict co-expression between gene-pairs.

R

EFERENCES

[1] A. Di Lonardo, S. Nasi, and S. Pulciani, Cancer: We should not forget the past, Jour-nal of Cancer 6, 29 (2015).

[2] B. Vogelstein and K. W. Kinzler, Cancer genes and the pathways they control, Nature medicine 10, 789 (2004).

[3] M. R. Stratton, P. J. Campbell, and P. A. Futreal, The cancer genome, Nature 458, 719 (2009).

[4] C. Seeger and W. S. Mason, Hepatitis b virus biology, Microbiology and Molecular Biology Reviews 64, 51 (2000).

[5] G. Clifford, J. Smith, M. Plummer, N. Munoz, and S. Franceschi, Human papillo-mavirus types in invasive cervical cancer worldwide: a meta-analysis, British jour-nal of cancer 88, 63 (2003).

[6] P. K. Vogt, Retroviral oncogenes: a historical primer, Nature Reviews Cancer 12, 639 (2012).

[7] H. Zur Hausen, Viruses in human cancers, Science 254, 1167 (1991).

[8] K. GRØNBÆK, C. Hother, and P. A. Jones, Epigenetic changes in cancer, Apmis 115, 1039 (2007).

[9] K. L. Reddy and A. P. Feinberg, Higher order chromatin organization in cancer, in Seminars in cancer biology, Vol. 23 (Elsevier, 2013) pp. 109–115.

[10] Z. K. Stadler, P. Thom, M. E. Robson, J. N. Weitzel, N. D. Kauff, K. E. Hurley, V. De-vlin, B. Gold, R. J. Klein, and K. Offit, Genome-wide association studies of cancer, Journal of Clinical Oncology 28, 4255 (2010).

(23)

REFERENCES

1

15

[11] J. N. Weinstein, E. A. Collisson, G. B. Mills, K. R. M. Shaw, B. A. Ozenberger, K. Ell-rott, I. Shmulevich, C. Sander, J. M. Stuart, C. G. A. R. Network, et al., The cancer genome atlas pan-cancer analysis project, Nature genetics 45, 1113 (2013). [12] T. J. Hudson, W. Anderson, A. Aretz, A. D. Barker, C. Bell, R. R. Bernabé, M. Bhan,

F. Calvo, I. Eerola, D. S. Gerhard, et al., International network of cancer genome projects, Nature 464, 993 (2010).

[13] E. P. Reddy, R. K. Reynolds, E. Santos, and M. Barbacid, A point mutation is re-sponsible for the acquisition of transforming properties by the t24 human bladder carcinoma oncogene, (1982).

[14] J. R. Pon and M. A. Marra, Driver and passenger mutations in cancer, Annual Re-view of Pathology: Mechanisms of Disease (2014).

[15] P. A. Futreal, L. Coin, M. Marshall, T. Down, T. Hubbard, R. Wooster, N. Rahman, and M. R. Stratton, A census of human cancer genes, Nature Reviews Cancer 4, 177 (2004).

[16] I. P. Touw and S. J. Erkeland, Retroviral insertion mutagenesis in mice as a compar-ative oncogenomics tool to identify disease genes in human leukemia, Molecular Therapy 15, 13 (2007).

[17] J. Kool and A. Berns, High-throughput insertional mutagenesis screens in mice to identify oncogenic networks, Nature Reviews Cancer 9, 389 (2009).

[18] A. A. Alizadeh, M. B. Eisen, R. E. Davis, C. Ma, I. S. Lossos, A. Rosenwald, J. C. Boldrick, H. Sabet, T. Tran, X. Yu, et al., Distinct types of diffuse large b-cell lym-phoma identified by gene expression profiling, Nature 403, 503 (2000).

[19] A. Ben-Dor, L. Bruhn, N. Friedman, I. Nachman, M. Schummer, and Z. Yakhini, Tissue classification with gene expression profiles, Journal of computational biology 7, 559 (2000).

[20] S. Ramaswamy, K. N. Ross, E. S. Lander, and T. R. Golub, A molecular signature of metastasis in primary solid tumors, Nature genetics 33, 49 (2003).

[21] H.-Y. Chuang, E. Lee, Y.-T. Liu, D. Lee, and T. Ideker, Network-based classification of breast cancer metastasis, Molecular systems biology 3 (2007).

[22] R. G. Verhaak, K. A. Hoadley, E. Purdom, V. Wang, Y. Qi, M. D. Wilkerson, C. R. Miller, L. Ding, T. Golub, J. P. Mesirov, et al., Integrated genomic analysis identi-fies clinically relevant subtypes of glioblastoma characterized by abnormalities in pdgfra, idh1, egfr, and nf1, Cancer cell 17, 98 (2010).

[23] T. Haferlach, Molecular genetic pathways as therapeutic targets in acute myeloid leukemia, ASH Education Program Book 2008, 400 (2008).

[24] L. Ein-Dor, I. Kela, G. Getz, D. Givol, and E. Domany, Outcome signature genes in breast cancer: is there a unique set? Bioinformatics 21, 171 (2005).

(24)

1

[25] M. A. Pujana, J.-D. J. Han, L. M. Starita, K. N. Stevens, M. Tewari, J. S. Ahn, G. Ren-nert, V. Moreno, T. Kirchhoff, B. Gold, et al., Network modeling links breast cancer

susceptibility and centrosome dysfunction, Nature genetics 39, 1338 (2007). [26] D. Hanahan and R. A. Weinberg, Hallmarks of cancer: the next generation, cell 144,

646 (2011).

[27] R. Beroukhim, G. Getz, L. Nghiemphu, J. Barretina, T. Hsueh, D. Linhart, I. Vivanco, J. C. Lee, J. H. Huang, S. Alexander, et al., Assessing the significance of chromosomal aberrations in cancer: methodology and application to glioma, Proceedings of the National Academy of Sciences 104, 20007 (2007).

[28] C. Klijn, H. Holstege, J. de Ridder, X. Liu, M. Reinders, J. Jonkers, and L. Wessels, Identification of cancer genes using a statistical framework for multiexperiment analysis of nondiscretized array cgh data, Nucleic acids research 36, e13 (2008). [29] C. W. Brennan, R. G. Verhaak, A. McKenna, B. Campos, H. Noushmehr, S. R.

Salama, S. Zheng, D. Chakravarty, J. Z. Sanborn, S. H. Berman, et al., The somatic genomic landscape of glioblastoma, Cell 155, 462 (2013).

[30] D. Tamborero, A. Gonzalez-Perez, C. Perez-Llamas, J. Deu-Pons, C. Kandoth, J. Reimand, M. S. Lawrence, G. Getz, G. D. Bader, L. Ding, et al., Comprehensive identification of mutational cancer driver genes across 12 tumor types, Scientific reports 3 (2013).

[31] Z.-Z. Pan, D.-S. Wan, G. Chen, L.-R. Li, Z.-H. Lu, and B.-J. Huang, Co-mutation of p53, k-ras genes and accumulation of p53 protein and its correlation to clinico-pathological features in rectal cancer, World J Gastroenterol 10, 3688 (2004). [32] K.-I. Goh, M. E. Cusick, D. Valle, B. Childs, M. Vidal, and A.-L. Barabási, The

hu-man disease network, Proceedings of the National Academy of Sciences 104, 8685 (2007).

[33] K. Mitra, A.-R. Carvunis, S. K. Ramesh, and T. Ideker, Integrative approaches for finding modular structure in biological networks, Nature Reviews Genetics 14, 719 (2013).

[34] D. Gomez-Cabrero, I. Abugessaisa, D. Maier, A. Teschendorff, M. Merkenschlager, A. Gisel, E. Ballestar, E. Bongcam-Rudloff, A. Conesa, and J. Tegnér, Data integra-tion in the era of omics: current and future challenges, BMC systems biology 8, I1 (2014).

[35] V. N. Kristensen, O. C. Lingjærde, H. G. Russnes, H. K. M. Vollan, A. Frigessi, and A.-L. Børresen-Dale, Principles and methods of integrative genomic analyses in cancer, Nature Reviews Cancer 14, 299 (2014).

[36] D. R. Rhodes, J. Yu, K. Shanker, N. Deshpande, R. Varambally, D. Ghosh, T. Barrette, A. Pander, and A. M. Chinnaiyan, Oncomine: a cancer microarray database and integrated data-mining platform, Neoplasia 6, 1 (2004).

(25)

REFERENCES

1

17

[37] Q. Zhu, A. K. Wong, A. Krishnan, M. R. Aure, A. Tadych, R. Zhang, D. C. Corney, C. S. Greene, L. A. Bongo, V. N. Kristensen, et al., Targeted exploration and analysis of large cross-platform human transcriptomic compendia, Nature methods (2015). [38] A. K. Wong, C. Y. Park, C. S. Greene, L. A. Bongo, Y. Guan, and O. G. Troyan-skaya, Imp: a multi-species functional genomics portal for integration, visualiza-tion and predicvisualiza-tion of protein funcvisualiza-tions and networks, Nucleic acids research 40, W484 (2012).

[39] E. P. Consortium et al., The encode (encyclopedia of dna elements) project, Science 306, 636 (2004).

[40] X. Zhou, D. Li, B. Zhang, R. F. Lowdon, N. B. Rockweiler, R. L. Sears, P. A. Madden, I. Smirnov, J. F. Costello, and T. Wang, Epigenomic annotation of genetic variants using the roadmap epigenome browser, Nature biotechnology 33, 345 (2015). [41] S. A. Forbes, N. Bindal, S. Bamford, C. Cole, C. Y. Kok, D. Beare, M. Jia, R.

Shep-herd, K. Leung, A. Menzies, et al., Cosmic: mining complete cancer genomes in the catalogue of somatic mutations in cancer, Nucleic acids research , gkq929 (2010). [42] C. J. Vaske, S. C. Benz, J. Z. Sanborn, D. Earl, C. Szeto, J. Zhu, D. Haussler, and J. M.

Stuart, Inference of patient-specific pathway activities from multi-dimensional cancer genomics data using paradigm, Bioinformatics 26, i237 (2010).

[43] R. McLendon, A. Friedman, D. Bigner, E. G. Van Meir, D. J. Brat, G. M. Mastro-gianakis, J. J. Olson, T. Mikkelsen, N. Lehman, K. Aldape, et al., Comprehensive genomic characterization defines human glioblastoma genes and core pathways, Nature 455, 1061 (2008).

[44] K. Chin, S. DeVries, J. Fridlyand, P. T. Spellman, R. Roydasgupta, W.-L. Kuo, A. La-puk, R. M. Neve, Z. Qian, T. Ryder, et al., Genomic and transcriptional aberrations linked to breast cancer pathophysiologies, Cancer cell 10, 529 (2006).

[45] R. Shen, A. B. Olshen, and M. Ladanyi, Integrative clustering of multiple genomic data types using a joint latent variable model with application to breast and lung cancer subtype analysis, Bioinformatics 25, 2906 (2009).

[46] M. Lando, M. Holden, L. C. Bergersen, D. H. Svendsrud, T. Stokke, K. Sundfør, I. K. Glad, G. B. Kristensen, and H. Lyng, Gene dosage, expression, and ontology analysis identifies driver genes in the carcinogenesis and chemoradioresistance of cervical cancer, PLoS genetics 5, e1000719 (2009).

[47] F. Chibon, P. Lagarde, S. Salas, G. Pérot, V. Brouste, F. Tirode, C. Lucchesi, A. de Reynies, A. Kauffmann, B. Bui, et al., Validated prediction of clinical outcome in sarcomas and multiple types of cancer on the basis of a gene expression signature related to genome complexity, Nature medicine 16, 781 (2010).

[48] M. R. Quigley, O. Fukui, B. Chew, S. Bhatia, and S. Karlovits, The shifting landscape of metastatic breast cancer to the cns, Neurosurgical review 36, 377 (2013).

(26)

1

[49] A.-L. Barabási, N. Gulbahce, and J. Loscalzo, Network medicine: a network-basedapproach to human disease, Nature Reviews Genetics 12, 56 (2011).

[50] T. Ideker, O. Ozier, B. Schwikowski, and A. F. Siegel, Discovering regulatory and sig-nalling circuits in molecular interaction networks, Bioinformatics 18, S233 (2002). [51] D. Beisser, G. W. Klau, T. Dandekar, T. Müller, and M. T. Dittrich, Bionet: an r-package for the functional analysis of biological networks, Bioinformatics 26, 1129 (2010).

[52] S. Imoto, T. Higuchi, T. Goto, K. Tashiro, S. Kuhara, and S. Miyano, Combining microarrays and biological knowledge for estimating gene networks via bayesian networks, Journal of Bioinformatics and Computational Biology 2, 77 (2004). [53] S. Ng, E. A. Collisson, A. Sokolov, T. Goldstein, A. Gonzalez-Perez, N. Lopez-Bigas,

C. Benz, D. Haussler, and J. M. Stuart, Paradigm-shift predicts the function of mu-tations in multiple cancers using pathway impact analysis, Bioinformatics 28, i640 (2012).

[54] Cancer Genome Atlas Network, Comprehensive molecular characterization of hu-man colon and rectal cancer, Nature 487, 330 (2012).

[55] W. Winterbach, P. Van Mieghem, M. Reinders, H. Wang, and D. de Ridder, Topology of molecular interaction networks, BMC systems biology 7, 90 (2013).

[56] J. I. F. Bass, A. Diallo, J. Nelson, J. M. Soto, C. L. Myers, and A. J. Walhout, Using networks to measure similarity between genes: association index selection, Nature methods 10, 1169 (2013).

[57] A.-L. Barabasi and Z. N. Oltvai, Network biology: understanding the cell’s func-tional organization, Nature Reviews Genetics 5, 101 (2004).

[58] X. He and J. Zhang, Why do hubs tend to be essential in protein networks? PLoS genetics 2, e88 (2006).

[59] E. Zotenko, J. Mestre, D. P. O’Leary, and T. M. Przytycka, Why do hubs in the yeast protein interaction network tend to be essential: reexamining the connec-tion between the network topology and essentiality, PLoS computaconnec-tional biology 4, e1000140 (2008).

[60] H. Yu, P. M. Kim, E. Sprecher, V. Trifonov, and M. Gerstein, The importance of bot-tlenecks in protein networks: correlation with gene essentiality and expression dy-namics, PLoS computational biology 3, e59 (2007).

[61] B. Vogelstein, D. Lane, and A. J. Levine, Surfing the p53 network, Nature 408, 307 (2000).

[62] I. Ulitsky and R. Shamir, Identification of functional modules using network topol-ogy and high-throughput data, BMC systems bioltopol-ogy 1, 8 (2007).

(27)

REFERENCES

1

19

[63] E. J. Rossin, K. Lage, S. Raychaudhuri, R. J. Xavier, D. Tatar, Y. Benita, C. Cotsapas, M. J. Daly, I. I. B. D. G. Constortium, et al., Proteins encoded in genomic regions as-sociated with immune-mediated disease physically interact and suggest underlying biology, PLoS genetics 7, e1001273 (2011).

[64] R. Milo, N. Kashtan, S. Itzkovitz, M. Newman, and U. Alon, On the uniform gen-eration of random graphs with prescribed degree sequences, arXiv preprint cond-mat/0312028 (2003).

[65] P. G. Doyle and J. L. Snell, Random walks and electric networks, AMC 10, 12 (1984). [66] S. Suthram, A. Beyer, R. M. Karp, Y. Eldar, and T. Ideker, eqed: an efficient method for interpreting eqtl associations using protein networks, Molecular systems biol-ogy 4 (2008).

[67] Y.-A. Kim, J. H. Przytycki, S. Wuchty, and T. M. Przytycka, Modeling information flow in biological networks, Physical biology 8, 035012 (2011).

[68] E. Nabieva, K. Jim, A. Agarwal, B. Chazelle, and M. Singh, Whole-proteome predic-tion of protein funcpredic-tion via graph-theoretic analysis of interacpredic-tion maps, Bioinfor-matics 21, i302 (2005).

[69] Z. Tu, L. Wang, M. N. Arbeitman, T. Chen, and F. Sun, An integrative approach for causal gene identification and gene regulatory pathway inference, Bioinformatics 22, e489 (2006).

[70] R. I. Kondor and J. Lafferty, Diffusion kernels on graphs and other discrete input spaces, in ICML, Vol. 2 (2002) pp. 315–322.

[71] F. Vandin, E. Upfal, and B. J. Raphael, Algorithms for detecting significantly mu-tated pathways in cancer, Journal of Computational Biology 18, 507 (2011). [72] E. O. Paull, D. E. Carlin, M. Niepel, P. K. Sorger, D. Haussler, and J. M. Stuart,

Dis-covering causal pathways linking genomic events to transcriptional states using tied diffusion through interacting events (tiedie), Bioinformatics 29, 2757 (2013). [73] P. Kumar, S. Henikoff, and P. C. Ng, Predicting the effects of coding non-synonymous

variants on protein function using the sift algorithm, Nature protocols 4, 1073 (2009).

[74] I. A. Adzhubei, S. Schmidt, L. Peshkin, V. E. Ramensky, A. Gerasimova, P. Bork, A. S. Kondrashov, and S. R. Sunyaev, A method and server for predicting damaging mis-sense mutations, Nature methods 7, 248 (2010).

[75] S. Schoenfelder, I. Clay, and P. Fraser, The transcriptional interactome: gene ex-pression in 3d, Current opinion in genetics & development 20, 127 (2010).

[76] T. Cremer and M. Cremer, Chromosome territories, Cold Spring Harbor perspec-tives in biology 2, a003889 (2010).

(28)

1

[77] J. Dekker, K. Rippe, M. Dekker, and N. Kleckner, Capturing chromosome confor-mation, Science 295, 1306 (2002).

[78] J. Dekker, M. A. Marti-Renom, and L. A. Mirny, Exploring the three-dimensional organization of genomes: interpreting chromatin interaction data, Nature Reviews Genetics 14, 390 (2013).

[79] E. de Wit and W. de Laat, A decade of 3c technologies: insights into nuclear organi-zation, Genes & development 26, 11 (2012).

[80] T. Cremer, M. Cremer, S. Dietzel, S. Müller, I. Solovei, and S. Fakan, Chromosome territories–a functional nuclear landscape, Current opinion in cell biology 18, 307 (2006).

[81] T. Cremer, C. Cremer, H. Baumann, E. Luedtke, K. Sperling, V. Teuber, and C. Zorn, Rabl’s model of the interphase chromosome arrangement tested in chinise ham-ster cells by premature chromosome condensation and laser-uv-microbeam experi-ments, Human genetics 60, 46 (1982).

[82] D. Rieder, Z. Trajanoski, and J. G. McNally, Transcription factories, Frontiers in genetics 3 (2012).

[83] B. Tolhuis, R.-J. Palstra, E. Splinter, F. Grosveld, and W. de Laat, Looping and inter-action between hypersensitive sites in the activeβ-globin locus, Molecular cell 10, 1453 (2002).

[84] M. Simonis, P. Klous, E. Splinter, Y. Moshkin, R. Willemsen, E. de Wit, B. van Steensel, and W. de Laat, Nuclear organization of active and inactive chromatin domains uncovered by chromosome conformation capture–on-chip (4c), Nature ge-netics 38, 1348 (2006).

[85] J. Dostie, T. A. Richmond, R. A. Arnaout, R. R. Selzer, W. L. Lee, T. A. Honan, E. D. Rubio, A. Krumm, J. Lamb, C. Nusbaum, et al., Chromosome conformation capture carbon copy (5c): a massively parallel solution for mapping interactions between genomic elements, Genome research 16, 1299 (2006).

[86] E. Lieberman-Aiden, N. L. van Berkum, L. Williams, M. Imakaev, T. Ragoczy, A. Telling, I. Amit, B. R. Lajoie, P. J. Sabo, M. O. Dorschner, et al., Comprehen-sive mapping of long-range interactions reveals folding principles of the human genome, science 326, 289 (2009).

[87] J. R. Dixon, S. Selvaraj, F. Yue, A. Kim, Y. Li, Y. Shen, M. Hu, J. S. Liu, and B. Ren, Topological domains in mammalian genomes identified by analysis of chromatin interactions, Nature 485, 376 (2012).

[88] M. J. Fullwood, C.-L. Wei, E. T. Liu, and Y. Ruan, Next-generation dna sequencing of paired-end tags (pet) for transcriptome and genome analyses, Genome research 19, 521 (2009).

(29)

REFERENCES

1

21

[89] S. M. Tan-Wong, J. B. Zaugg, J. Camblong, Z. Xu, D. W. Zhang, H. E. Mischo, A. Z. Ansari, N. M. Luscombe, L. M. Steinmetz, and N. J. Proudfoot, Gene loops enhance transcriptional directionality, Science 338, 671 (2012).

[90] J. B. Wright, S. J. Brown, and M. D. Cole, Upregulation of c-myc in cis through a large chromatin loop linked to a cancer risk-associated single-nucleotide polymor-phism in colorectal cancer cells, Molecular and cellular biology 30, 1411 (2010). [91] T. H. Vu, A. H. Nguyen, and A. R. Hoffman, Loss of igf2 imprinting is associated

with abrogation of long-range intrachromosomal interactions in human cancer cells, Human molecular genetics 19, 901 (2010).

[92] N. Ahmadiyeh, M. M. Pomerantz, C. Grisanzio, P. Herman, L. Jia, V. Almendro, H. H. He, M. Brown, X. S. Liu, M. Davis, et al., 8q24 prostate, breast, and colon cancer risk loci show tissue-specific long-range interaction with myc, Proceedings of the National Academy of Sciences 107, 9742 (2010).

[93] Q. Wang, W. Li, Y. Zhang, X. Yuan, K. Xu, J. Yu, Z. Chen, R. Beroukhim, H. Wang, M. Lupien, et al., Androgen receptor regulates a distinct transcription program in androgen-independent prostate cancer, Cell 138, 245 (2009).

[94] D. S. Rickman, T. D. Soong, B. Moss, J. M. Mosquera, J. Dlabal, S. Terry, T. Y. Mac-Donald, J. Tripodi, K. Bunting, V. Najfeld, et al., Oncogene-mediated alterations in chromatin conformation, Proceedings of the National Academy of Sciences 109, 9083 (2012).

[95] J. Paulsen, T. G. Lien, G. K. Sandve, L. Holden, Ø. Borgan, I. K. Glad, and E. Hovig, Handling realistic assumptions in hypothesis testing of 3d co-localization of ge-nomic elements, Nucleic acids research , gkt227 (2013).

[96] G. Fudenberg and L. A. Mirny, Higher-order chromatin structure: bridging physics and biology, Current opinion in genetics & development 22, 115 (2012).

[97] J. M. Engreitz, V. Agarwala, and L. A. Mirny, Three-dimensional genome architec-ture influences partner selection for chromosomal translocations in human disease, (2012).

[98] R. Jäger, G. Migliorini, M. Henrion, R. Kandaswamy, H. E. Speedy, A. Heindl, N. Whiffin, M. J. Carnicer, L. Broome, N. Dryden, et al., Capture hi-c identifies the chromatin interactome of colorectal cancer risk loci, Nature communications 6 (2015).

[99] N. H. Dryden, L. R. Broome, F. Dudbridge, N. Johnson, N. Orr, S. Schoenfelder, T. Nagano, S. Andrews, S. Wingett, I. Kozarewa, et al., Unbiased analysis of potential targets of breast cancer susceptibility loci by capture hi-c, Genome research 24, 1854 (2014).

[100] S. Lin, G. Coutinho-Mansfield, D. Wang, S. Pandit, and X.-D. Fu, The splicing factor sc35 has an active role in transcriptional elongation, Nature structural & molecular biology 15, 819 (2008).

(30)

1

[101] O. Hakim and T. Misteli, Snapshot: Chromosome confirmation capture, Cell 148,1068 (2012).

[102] M. B. Gerstein, A. Kundaje, M. Hariharan, S. G. Landt, K.-K. Yan, C. Cheng, X. J. Mu, E. Khurana, J. Rozowsky, R. Alexander, et al., Architecture of the human regulatory network derived from encode data, Nature 489, 91 (2012).

[103] X. Dong, C. Li, Y. Chen, G. Ding, and Y. Li, Human transcriptional interactome of chromatin contribute to gene co-expression, BMC genomics 11, 704 (2010). [104] D. Hanahan and R. A. Weinberg, The hallmarks of cancer, cell 100, 57 (2000). [105] B. De Bivort, S. Huang, and Y. Bar-Yam, Empirical multiscale networks of cellular

regulation, (2007).

[106] J. J. Koenderink, The structure of images, Biological cybernetics 50, 363 (1984). [107] T. Lindeberg, Detecting salient blob-like image structures and their scales with a

scale-space primal sketch: a method for focus-of-attention, International Journal of Computer Vision 11, 283 (1993).

[108] M. Ashburner, C. A. Ball, J. A. Blake, D. Botstein, H. Butler, J. M. Cherry, A. P. Davis, K. Dolinski, S. S. Dwight, J. T. Eppig, et al., Gene ontology: tool for the unification of biology, Nature genetics 25, 25 (2000).

[109] L. Guelen, L. Pagie, E. Brasset, W. Meuleman, M. B. Faza, W. Talhout, B. H. Eussen, A. de Klein, L. Wessels, W. de Laat, et al., Domain organization of human chromo-somes revealed by mapping of nuclear lamina interactions, Nature 453, 948 (2008). [110] J. de Ridder, J. Kool, A. Uren, J. Bot, L. Wessels, and M. Reinders, Co-occurrence analysis of insertional mutagenesis data reveals cooperating oncogenes, Bioinfor-matics 23, i133 (2007).

[111] E. van Dyk, M. J. Reinders, and L. F. Wessels, A scale-space method for detecting re-current dna copy number changes with analytical false discovery rate control, Nu-cleic acids research 41, e100 (2013).

[112] T. A. Knijnenburg, S. A. Ramsey, B. P. Berman, K. A. Kennedy, A. F. Smit, L. F. Wessels, P. W. Laird, A. Aderem, and I. Shmulevich, Multiscale representation of genomic signals, Nature methods 11, 689 (2014).

[113] J. H. Gibcus and J. Dekker, The hierarchy of the 3d genome, Molecular cell 49, 773 (2013).

[114] E. Estrada and Ö. Bodin, Using network centrality measures to manage landscape connectivity, Ecological Applications 18, 1810 (2008).

[115] M. Hulsman, C. Dimitrakopoulos, and J. de Ridder, Scale-space measures for graph topology link protein network architecture to function, Bioinformatics 30, i237 (2014).

(31)

2

I

NTEGRATION OF GENE EXPRESSION AND

DNA-

METHYL ATION PROFILES IMPROVES

MOLECUL AR SUBTYPE CL ASSIFICATION IN

ACUTE MYELOID LEUKEMIA

Erdogan Taskesen* Sepideh Babaei* Marcel Reinders Jeroen de Ridder * Equal Contribution

This chapter has been published in BMC Bioinformatics (2015), vol. 16, no. Suppl 4, pp. S5 (d oi : 10.1186/1471 − 2105 − 16 − S4 − S5).

Cytaty

Powiązane dokumenty

Visualization is the conversion of data into a visual or tabular format so that the characteristics of the data and the relationships among data items or attributes can be

Some displays will show wonderful blacks in a bright environment, but those same blacks will be seen as dark gray when that display is placed in a dark

Jednym ze współorganizatorów V Kongresu Mediewistów Polskich był Zakład Literatury Staropolskiej i Polskiego Oświecenia Instytutu Filologii Polskiej Uniwersytetu

Pomocą dla uczestników są otrzymane przez nich zadania oraz obecność konsultanta (funkcję tę pełni na ogół osoba prowadząca zajęcia tandemowe), który pomoże

Now here it is the picture which demonstrates us the real shape of chain and sagging in a horizontal position with sprockets (Fig. We made the experiment using special test

Further, the report of phase 1 analysed the current cadastral procedures, land model and database in Israel, made an initial comparison between the Israeli model and the ISO

[r]

Członków stowarzyszonych i korespondentów może przyjąć Za­ rząd, przy czym wniosek w tej kwestii może być złożony wicepre­ zesowi oraz (albo) należącym do