Data integration strategies for bioinformatics with applications in biomarker and network discovery

(1)

D

ATA INTEGRATION STRATEGIES FOR

BIOINFORMATICS

(2)

(3)

D

ATA INTEGRATION STRATEGIES FOR

BIOINFORMATICS

WITH APPLICATIONS IN BIOMARKER AND NETWORK DISCOVERY

Proefschrift

ter verkrijging van de graad van doctor aan de Technische Universiteit Delft,

op gezag van de Rector Magnificus prof. ir. K.C.A.M. Luyben, voorzitter van het College voor Promoties,

in het openbaar te verdedigen op vrijdag 13 juni 2014 om 15:00 uur

door

Marcel H

ULSMAN

ingenieur geboren te Utrecht.

(4)

Dit proefschrift is goedgekeurd door de promotor: Prof. dr. ir. M.J.T. Reinders

Samenstelling promotiecommissie:

Rector Magnificus voorzitter

Prof. dr. ir. M.J.T. Reinders Technische Universiteit Delft, promotor

Prof. dr. L. Wessels Netherlands Cancer Institute

Prof. dr. J.T. Pronk Technische Universiteit Delft

Prof. dr. J.N. Kok Leiden University

Prof. dr. J. Heringa Vrije Universiteit

Prof. dr. J. de Boer University of Twente

Prof. dr. T. Heskes Radboud University Nijmegen

Prof. dr. ir. A.P. de Vries Technische Universiteit Delft / CWI, reservelid

Keywords: Data integration, normalization, kernel methods, causal inference, regulatory networks, biomarker discovery, materiomics

Printed by: Ipskamp Drukkers

Front & Back: M. Hulsman with use of IStock figure

An electronic version of this dissertation is available at

(5)

(6)

(7)

C

ONTENTS

1 Introduction 1

1.1 Research challenges. . . 2

1.2 The need for integration . . . 3

1.3 Data, integration and learning . . . 4

1.3.1 The DIKW model. . . 5

1.3.2 The place of integration . . . 6

1.3.3 Relation to human and machine learning . . . 9

1.3.4 Relation to JDL data fusion model. . . 12

1.3.5 Summary . . . 13

1.4 Overview of integration levels. . . 14

1.4.1 Data integration . . . 14 1.4.2 Information integration . . . 15 1.4.3 Knowledge integration. . . 16 1.4.4 Comprehension integration . . . 17 1.5 Chapter overview. . . 17 References. . . 20

2 Querying flexible data structures 23 2.1 Introduction . . . 25

2.2 Approach and Results. . . 26

2.2.1 Annotating data: roles and relations. . . 26

2.2.2 Query language . . . 27

2.2.3 Optimization and scalability. . . 30

2.3 Methods . . . 31 2.3.1 Architecture . . . 31 2.3.2 Data representation . . . 32 2.3.3 Data operations . . . 33 2.4 Related work . . . 33 2.4.1 Data models. . . 34 2.4.2 Query systems. . . 35

2.4.3 Mediators and workflow tools . . . 35

2.5 Discussion . . . 36

2.6 Availability . . . 36

References. . . 37 vii

(8)

viii CONTENTS

3 An algorithm-based topographical biomaterials library to instruct cell fate 41

3.1 Introduction . . . 43

3.2 Results . . . 44

3.2.1 Cell seeding and culture . . . 46

3.2.2 Mitogenic effect of surface topographies. . . 46

3.2.3 Machine learning algorithms for identification of important topo-graphic parameters . . . 50

3.2.4 Surface topography enhances osteogenic differentiation of hMSCs . 51 3.3 Discussion . . . 52

3.4 Materials and methods . . . 53

3.4.1 Design and fabrication. . . 53

3.4.2 Cell culture. . . 54

3.4.3 Immunofluorescence staining. . . 54

3.4.4 Image acquisition and analysis. . . 54

3.4.5 Data analysis. . . 54

3.A Supplement. . . 56

References. . . 57

4 Analyzing surface induced cell responses 61 4.1 Introduction . . . 63 4.2 Approach. . . 64 4.2.1 Topochip. . . 64 4.2.2 Analysis . . . 65 4.2.3 Data . . . 66 4.3 Results . . . 66

4.3.1 Scoring surface-induced cell responses . . . 66

4.3.2 Ranking surfaces on their effect on cell morphology. . . 70

4.3.3 Using surface patterns to guide cells. . . 70

4.3.4 Finding surface properties relevant to promote cell response. . . 71

4.3.5 Predicting surface responses in silico . . . 72

4.4.1 Computational implications. . . 73

4.4.2 Biological implications. . . 74

4.5 Materials and methods . . . 75

4.5.1 Topochip cell seeding . . . 75

4.5.2 Image Processing . . . 76

4.5.3 Measuring cell morphologies . . . 76

4.5.4 Kruskal Wallis test . . . 77

4.5.5 Regression models. . . 77

4.A.1 Pipeline overview . . . 79

(9)

CONTENTS ix

5 Delineation of amplification, hybridization and location effects in

microar-ray data yields better-quality normalization 89

5.1 Background. . . 91

5.1.1 Technical effects. . . 91

5.1.2 Background removal after normalization phase . . . 94

5.1.3 Background removal within summarization phase. . . 94

5.2 Results and Discussion . . . 95

5.2.1 Algorithm overview . . . 95

5.2.2 Differentially expressed gene detection performance . . . 101

5.2.3 Differential gene finding - hMSC dataset. . . 102

5.2.4 Signal bias and background estimation . . . 104

5.2.5 Signal precision for low, medium and high expression spike-ins. . . 104

5.2.6 Inspection of dataset after normalization . . . 105

5.3 Discussion . . . 107 5.4 Conclusions. . . 107 5.5 Methods . . . 108 5.5.1 Data . . . 108 5.5.2 Spike-in performance . . . 108 5.5.3 Quantile-quantile normalization. . . 109 5.5.4 Availability: . . . 109

5.A Biological variation versus technical variation . . . 110

5.B Spike-in genes . . . 111

5.C M-estimation. . . 111

5.D B-splines . . . 111

5.E Amplification. . . 112

5.E.1 Calculating distance to 3’end of transcript. . . 112

5.E.2 5’end bias can be explained by incomplete amplification . . . 113

5.E.3 Sequence affects amplification effect . . . 115

5.F Array location effect. . . 116

5.G Signal bias . . . 118

References. . . 119

6 Evolutionary optimization of kernel weights improves protein complex co-membership prediction 121 6.1 Introduction . . . 123

6.2 Methods . . . 125

6.2.1 Kernel combination methods . . . 125

6.2.2 Criteria. . . 126

6.2.3 Optimization algorithm . . . 129

6.3 Experiments . . . 130

6.3.1 Data . . . 130

(10)

x CONTENTS 6.4 Results . . . 133 6.4.1 Weighted kernels. . . 133 6.4.2 Individual kernels . . . 134 6.4.3 Computational cost . . . 136 6.4.4 Influence of noise . . . 138 6.5 Conclusion . . . 138 References. . . 140

7 A probabilistic network simulation method to reveal gene regulatory path-ways from gene perturbation experiments 145 7.1 Introduction . . . 147

7.2 Approach. . . 149

7.2.1 Multiple Instance Classification . . . 149

7.2.2 Network simulation . . . 152 7.2.3 Network construction . . . 153 7.2.4 Network learning . . . 154 7.2.5 Link similarity . . . 154 7.2.6 Overview. . . 156 7.3 Results . . . 158

7.3.1 Genome wide effect prediction . . . 158

7.3.2 Inferring causal genes . . . 164

7.3.3 Interpretation of interaction scores . . . 167

7.3.4 Application to the MAPK pathways . . . 168

7.A.1 Clustering and prototype representation. . . 175

7.A.2 Performance scores . . . 175

7.A.3 Causal gene inference . . . 175

7.A.4 Network simulation . . . 175

7.A.5 GO annotation kernel . . . 176

References. . . 178

8 Discussion 183 8.1 Introduction . . . 184

8.2 Research topics. . . 184

8.2.1 Information handling as part of the DIKW hierarchy. . . 184

8.2.2 Information integration hierarchy for a high-throughput screening platform. . . 186

8.2.3 Enabling information integration through normalization . . . 187

8.2.4 Handling object and feature heterogeneity using information inte-gration. . . 188

8.2.5 Network integration through knowledge integration. . . 189

8.3 Integration: overall perspective and future directions. . . 191

8.4 Conclusion . . . 192

(11)

CONTENTS xi

Summary 195

Samenvatting 197

(12)

(13)

1

I

NTRODUCTION

(14)

1

2 1.INTRODUCTION

Understanding the biology behind life is one of the great frontiers of science. In the last decades, research has opened up this black box of incredible complexity, in which many components work together as a dynamic system [1]. Understanding these systems is important, with valuable applications in health, energy and food (e.g. [2,3]).

To understand these systems, it needs to be determined how the individual compo-nents of these systems act on each other, and how these compocompo-nents together create the observed (emergent) behaviour of the organism as a whole.

Significant progress has already been made towards this goal. Much of the focus has been directed towards obtaining an understanding of the cell, the functional unit of life. In millions of research articles, small pieces of a large puzzle have been uncovered. This has revealed a picture, in which information stored in the genome (genes), is the basis for constructing and regulating many machines (proteins) that together create and form the cellular environment. This picture is not static, but dynamically responds to exter-nal and interexter-nal stimuli, by regulating the activity of the proteins. The regulatory net-work underlying these responses has been shown to have hierarchical and/or modular characteristics, revealing additional layers of order [4,5]. Together, the individual com-ponents work towards a common goal, such as reproduction or the execution of some specialized function as part of a multi-cellular organism.

1.1. R

ESEARCH CHALLENGES

Despite this progress, we still have uncovered only the proverbial tip of the iceberg. In most cases, there is at best only a partial understanding of the functionality of the com-ponents and subsystems, of the regulation mechanisms and causality (e.g. [6]). The scale and complexity of the underlying systems often make it difficult to explain ob-served system behaviour (phenotypes) from the obob-served behaviour of the individual components.

This is the case even despite the availability of high-throughput experimental meth-ods nowadays, which allow measurements to be taken over a large number of these com-ponents simultaneously. Reviewing the concept of protein function, Eisenberg et al. [7] asks:

Faced with the avalanche of genomic sequences and data on messenger RNA expression, biological scientists are confronting a frightening prospect: piles of information but only flakes of knowledge. How can the thousands of sequences being determined and deposited, and the thousands of expres-sion profiles being generated by the new array methods, be synthesized into useful knowledge? What form will this knowledge take?

Ten years later, this question is still valid. For example, although we are able to observe which genes in a cell respond to a certain change in its environmental condition, it is in many cases a challenge to extend this analysis beyond those simple associations.

An important role here is played by our limited ability to grasp the details of the im-mense cellular complexity. This has its causes not only in the sheer amount of relevant information that is available, but also in the scale and complexity of the underlying sys-tem. The introduction of high-throughput methods has only accentuated this problem: now we are faced with experimental results in which e.g. genes from all over the cell are

(15)

1.2.THE NEED FOR INTEGRATION

1

3

found to be responding to some experimental condition. One would like to put such re-sults within the context of our knowledge on the workings of the inner cell, to determine what might have caused these genes to respond, or what effect this response might have on the cell. Although there can be many thousands of studies that are relevant to answer-ing these questions, no person is able to integrate or even to grasp all that knowledge.

In too many cases, the progression of science in understanding such complex sys-tems as the cell has therefore degenerated from ‘standing on the shoulders of giants’ to something more akin to crowd-surfing, where researchers hunt through the literature for background information and/or validation on certain genes, in search for those pieces of the puzzle that, together with the obtained results, form a coherent picture. More systemic explanations of genome-wide screens are difficult to obtain, given the sheer amount of relevant knowledge. Discussions are therefore often limited to either gen-eral descriptions on a rather high abstraction level, or very narrow discussions on single genes in which most of their overall context (e.g. other genes that interact) is left out.

This problem (amongst others) has led to a significantly increased role for informa-tion technology, and specifically bioinformatics, within biology. Automatic data integra-tion and model construcintegra-tion has become a significant research area. This integraintegra-tion effort ranges from combining similar measurements in order to uncover response pat-terns, to combining diverse data sources to infer a genome-scale picture of the cellular systems. In essence, using information technology, we are trying to re-establish the gi-ant’s shoulders, as they might help us in looking further, discovering more and more of the landscape of life.

1.2. T

HE NEED FOR INTEGRATION

An example, in which the need for data integration is shown in more detail, can be found in the analyses of SNPs [8]. SNPs, short for Single Nucleotide Polymorphisms, are small changes in the genome, that might have functional consequences. Such consequences could for example be a certain (susceptibility to a) disease or an influence on a per-sons height. Due to genomic linkage, SNPs might also be indicators for larger genomic changes. To discover which genomic loci play a functional role, numerous genome wide association studies (GWAS) have been performed, where the SNP statuses of the ge-nomic loci of persons with and without certain characteristics are compared. Although many loci associations have been found, the predictive power of these results has been disappointing. The SNPs that are found can predict only a fraction of the observed phe-notype differences [9].

One of the factors that complicates such studies is one of size: large numbers of SNPs are found within genomes. The probability that one finds a SNP that correlates by chance to a certain phenotype is therefore rather high, when not controlled by us-ing large cohorts of many thousands of people. Due to this one has to go through large amounts of data to find significant, useful patterns. Park et al. [10] show that using datasets beyond 100 thousand samples still increases the number of found loci.

A second complicating factor is related to the complexity of the underlying system: phenotypes are often the result of the cooperation of numerous cellular components. This means that there can be numerous ‘SNP targets’ available that can affect the same phenotype. For example, disabling any of the essential genes in a pathway means

(16)

dis-1

4 1.INTRODUCTION

abling the whole pathway. Many distinct, possibly low-frequency, SNPs might thus cause a similar phenotype. On the other hand, SNP-based perturbations of cooperating genes might also cause combinatorial effects, which are not explainable as just the combina-tion of the effects of the individual perturbacombina-tions [11]. This combinatorial property is illustrated in the robustness shown by many biological systems, where individual gene perturbations show only limited effects compared to combinations of such perturba-tions [12].

The large role of complexity is supported by the fact that human height, which has a heritable component of approximately 80%, is still difficult to predict based on discov-ered GWAS loci. In fact, in a recent study of Allen et al. [13], SNP-data of more than 180 thousand persons was analyzed together, finding 180 height related loci. However, these loci together only explained about 16% of the variance in human height.

Unfortunately, taking into account combinatorial behaviour causes the analysis of SNPs to attain the properties of a combinatorial problem: the number of possible com-binations becomes extremely large very quickly. This, in turn, increases the problem of size, as almost certainly some of the many possible combinations will relate to the phe-notype just by chance. Combined with the fact that the (combination of ) SNPs might be of low-frequency, the amount of SNP data required to control for this approaches the in-feasible. Some of the SNPs might even be unique in a person, and one will never be able to determine their effect through a population based study. Adding to these issues is the presence of measurement noise, which can have both technical and biological causes.

It is therefore increasingly recognized that charting the effects of genome variations will require more than SNP data alone [14]. Stranger et al. note in their review on the progress made by GWAS studies [15], that:

A system genetics approach is thus needed, in which large sets of genetic variants and/or genes are analyzed together, genetic data are integrated with external functional data types, and the results inform the biology of the com-plex trait directly.

This integration of our gathered knowledge on the workings of the cellular systems can help to link SNPs together that are likely to be causal factors for similar phenotypes [16]. Such SNPs can then be observed in combination, thereby increasing statistical power dramatically.

These three problems (size, complexity and noise) are by no means unique to SNP analysis, as they find their basis in the properties of the underlying cellular systems. The same problems occur, for example, also in gene expression-based analyses. Together, they show how central the integration of data has become to biological problems: so-lutions all require the combined use of large amounts of homogeneous (replications) and/or hetereogenous data (diverse information on the workings of the cell). In this thesis, we explore such integration in a variety of settings.

1.3. D

ATA

,

INTEGRATION AND LEARNING

We are drowning in information but starved for knowledge. John Naisbitt

(17)

1.3.DATA,INTEGRATION AND LEARNING

1

5

Before studying integration, one first has to understand what it is that we are integrat-ing. Various data types might need to be integrated, such as raw high-througput data, information on gene functions, relations or interactions between proteins, patterns in conditional gene responses, models of the regulatory networks or knowledge obtained from research papers. What is common and what is different between these data source types? And how does what we integrate, influence how we integrate?

To put these questions into a context, we first investigate how terms such as data, in-formation, relations, patterns, models or knowledge relate to each other. There is already quite some difference in opinion about what these terms actually represent by themself. For example, Zins [17] lists 130 definitions of data, information and knowledge. Some of these terms have already been debated since the ancient Greek philosophers [18].

1.3.1. T

HE

DIKW

MODEL

It has been recognized that studying these terms in relation to each other makes their meaning more clear. Rowley [18] notes even that

it may be difficult to justify any discussion of information that does not also explore knowledge and vice versa.

The basic model generally used for this purpose is the D-I-K-W pyramid, linking Data, Information, Knowledge and Wisdom together. Here, we discuss a combination of two versions described by Bellinger et al. [19] and Ackoff [20], and extend it with concepts that have a role in integration.

Based on Ackoff, we consider a slightly extended version of the model1, consisting of 5 different levels, which we define as:

• Data: signifiers. Numeric values, words, or more general, representations of (ab-stract) objects. Data does not represent meaning in itself, beyond the intensional meaning of its signifiers. It just exists, representing which objects have been en-countered.

Examples: 3.14, ‘gene’, ‘SWI4’.

• Information: descriptions. Data is given meaning (semantics) through relations (structure). These relations link predicates with signifiers.

Examples: ‘SWI4’ ‘is a’ ‘gene’, ‘SWI4’ ‘is a’ ‘transcription factor’, ‘SWI4’ ‘interacts with’ ‘SWI6’.

• Knowledge: useful facts. Obtained by selection of information or extraction from information.

Examples: ‘SWI4’ ‘is a’ ‘transcription factor’ (selection). ‘SWI4’ ‘changes’ ‘expres-sion’ ‘during the’ ‘cell cycle’ (extraction of a pattern from microarray results). The criterium ‘fact’ considers to what extent such statements are found to be true in the available information (its significance). A significant pattern is not necessarily useful however, so a second criterium considers to what extent the pattern con-nects with existing knowledge (its specificity).

(18)

1

6 1.INTRODUCTION

• Comprehension: explanatory models. An interpolative and probabilistic reason-ing process [20], in which knowledge is connected into a system, allowing for ex-planations and predictions based on causality and implications.

Example: ‘SWI4 binds with SWI6 to form the SBF complex. Together with the MBF complex, it regulates cyclins and DNA synthesis/repair genes in the late G1 cell cyle phase. These cyclins in turn have a postive feedback on SWI4 expression, ex-plaining why its regulated in a cell cycle dependent manner’.

• Wisdom: deep understanding. A combined evaluation of comprehensions, lead-ing to a recognition of general principles, and an ability to apply this extrapola-tively to new situations.

Examples: ‘in closed systems, entropy tends to increase’ or ‘dynamic order re-quires regulation’, and being able to apply these principles to situations in com-panies, society, the internet, et cetera.

Note that an excellent review of the DIKW literature can be found in [18], which should allow one to put these definitions into their context. Here, we just replicate for comparison a table from this review, which discerns Ackoff’s definitions [20] from those of Zeleny [21] (Table 1.1). Note that Zeleny does not have a separate understanding level, but instead considers an extra ‘enlightenment’ level. The wisdom and enlightenment level of Zeleny are however more or less comparable to the understanding and wisdom levels of Ackoff.

Zeleny [21] Ackoff [20]

Data Know nothing Symbols

Information Know what Data that are processed to be useful;

pro-vides answers to who, what, where and when questions

Knowledge Know how Application of data and information;

an-swers how questions

Understanding Appreciation of why

Wisdom Know why Evaluated understanding

Enlightenment Attaining the sense of truth, the sense of right and wrong, and having it socially ac-cepted, respected and sanctioned

Table 1.1: Replicated from [18], comparing Ackoff’s and Zeleny’s definitions of the DIKW hierarchy.

1.3.2. T

HE PLACE OF INTEGRATION TRANSFORMATIONS

The DIKW model has often been represented as a hierarchy, in which each step describes an improved grasping of the underlying concepts (Figure 1.1a). Each transition between these steps is characterized by an increase in the associations that link the underlying data. Bellinger et al. [19] represented this as a connectedness axis. With this ‘horizontal

(19)

1

7

Figure 1.1: a) A representation of the DIKW pyramid [18]. b) The model as discussed in this introduction. Through various transformations, data is processed into wisdom. This transformatory process entails an in-crease in both conceptualization (vertical) and association (horizontal).

linking’, there is also a corresponding tendency to form ‘vertical linking’, i.e. the forma-tion of a hierarchy of concepts, in which low-level concepts are connected into higher-order concepts/ideas. Chun [22] describes this as an order/structure axis, in which data is considered to be physically structured, information cognitively structured, and knowl-edge ‘belief structured’. Here, we represent both concepts as respectively an association and conceptualization axis. A visualization of the model according to these two axes is shown inFigure 1.1b. The transformations between the steps are each annotated with a keyword, describing the way in which data is linked to get from a lower level to a higher level (see also [19]). One can relate these steps to the association and conceptualization axes, as shown inTable 1.2.

Description Association Conceptualization

Predicates relate data elements together to describe objects Relation Object

Patterns compare objects based on ‘predicates’, in order to recognize commonalities (templates), such as classes, concepts or correlations

Similarity Template

Processes causally link objects, based on ‘patterns’, in order to form a working model

Causality Model

Principles discover commonalities in the way our ‘world’ works

Commonality Coherency

Table 1.2: A description of the four transformatory steps in the DIKW hierarchy, and their relation to the as-sociation and conceptualization axes. The mentioned terms are only indicative, and should be interpreted broadly. For example, a model can also mean a proof, and causality implication.

This brings us back to integration. Note that, similar to these transformation steps, integration can also be seen as a process in which one associates data, and generates higher-order concepts from such data. For example, one can integrate microarray data

(20)

1

8 1.INTRODUCTION

to improve the power to find differential expressed genes, integrate protein interaction evidence to decide on interaction existence, or integrate multiple interactions into one (functional) pathway. Basically, any task which links underlying data together can be seen as an integration task. As such, one can also see the transformations in the DIKW hierarchy as actually representing integration tasks.

This immediately gives us a way to distinguish various levels of data integration. Thus, integration should not be seen as taking place at a certain hierachy-level, rather, it is the way in which we connect data to achieve such a level. This also highlights the prevalence of integration, as basically every task in which we attempt to attain knowl-edge, comprehension or wisdom, is in essence an integration task.

BI-DIRECTIONALITY

The arrows inFigure 1.1bthat accompany the transformations should not be interpreted as suggesting a one-directional flow of concepts, from data to wisdom. They depict a conditional requirement, e.g. wisdom requires comprehension, knowledge requires in-formation. In fact, the flow of concepts can be considered to be bi-directional. It has been observed that (extracted) knowledge can again become information, for example for purposes of communication [17]. This has been a rich source of confusion in under-standing the distinction between these concepts, as a statement can be both information as well as knowledge. Similarly, in Ackoff [20], it is considered that comprehension can lead to new knowledge. Wisdom has, in itself, already the connotation of application: being wise is reflected in ones actions and comprehensions of new situations.

The (downward) expression of learned concepts to lower levels in the hierarchy can be extended to the whole hierarchy. Starting from wisdom, we propose the following (Figure 1.2):

• Insight: based on wisdom, acquire new comprehension.

• Reasoning: based on comprehension, one can argue towards new knowledge, us-ing deduction and induction.

• Recognition: knowledge extracted from information, or obtained through argu-ments, can be described as new information.

• Representation: Recognized objects or concepts may be represented as new signi-fiers (e.g. forming words using letter symbols). Example: ‘SWI4’. This allows for a practical descripton of the relations between new, higher level concepts.

The fact that a certain data element can be assigned to multiple levels shows that the DIKW hierarchy is not so much about assigning a certain piece of ‘data’ to a certain level in the hierarchy. Instead, its main point is in its distinction between the different transformatory steps. It describes thus a transformation process, not a classification scheme. In summary, it can be seen as representing the way in which, through learning and science, knowledge, comprehension and wisdom are acquired2.

2_{This does not mean that all learning follows this hierarchy. For example, learning a certain procedure does}

(21)

1

9

What does this mean for data integration? With data elements that might be assigned to multiple levels of the hierarchy, integration tasks might encompass multiple transfor-mation steps in the hierarchy. This has some important implications for the way in which tools should be designed that support this process. A separate transformation step can-not always be handled in isolation from other transformation steps, even if the individual tasks that need to be performed for each transformation are distinct from each other. In

section 1.5, we discuss how this plays a role in work described in this thesis.

Figure 1.2: The process of learning. A model relating data, information, knowledge, comprehension and wis-dom.

1.3.3. R

ELATION TO HUMAN AND MACHINE LEARNING

Too often we forget that genius, too, depends upon the data within its reach, that even Archimedes could not have devised Edison’s inventions. E. Dimnet

THE ITERATIVENESS OF LEARNING

The bi-directional flow of concepts hints at an important aspect of these transforma-tion processes, namely their unified nature. For example, while reading the definitransforma-tion of information, one might have asked oneself what the input is which enables the transfor-mation of data into infortransfor-mation. The answer to this is that a higher level concept (knowl-edge) is required, enabling one to determine e.g. that the symbol ‘blue’ is an instance of

(22)

1

10 1.INTRODUCTION

color, and that some data elements describe properties of a common ‘gene’. Similarly, determining which patterns to search for as useful knowledge is largely based on how it fits into ones comprehension of the ‘world’. Thus, what happens at the lower levels of the hierarchy is dependent on the higher levels. In fact, Rowley [18] even proposes to turn the wisdom pyramid upside down:

Perhaps the wisdom hierarchy upended ... is more evocative. This is a wis-dom funnel, where data naturally becomes more concentrated, but the whole edificie is delicately balanced on wisdom and will collapse without sufficient wisdom.

suggesting thereby that our capacity to associate and conceptualize data is dependent on our already available knowledge, understanding and/or wisdom. Similarly, in the context of machine learning, Michalski et al. [23] note that:

One lesson from the failures of earlier tabular rasa and knowledge-poor learn-ing systems is that to acquire new knowledge a system must already posses a great deal of initial knowledge.

From this it follows that learning is essentialy an iterative process. Our capability to infer information, knowledge or comprehension from data is dependent on our already existing grasp of the world around us. In each iterative step, we build up this knowledge, which in turn allows us to make another step, and so on. Illustrating this iterative nature, it has often been recognized that various scientists, independently from each other, can make a similar discovery around the same time [24]. Knowledge acquired at one point in time enables multiple people to make the same, iterative, step, where this was not yet possible before without that knowledge.

This highlights the importance of being able to use the available knowledge. This is especially an issue in the academic world, where the learning process is performed in a massively parallel fashion. It is quite obvious that the speed of scientific progress is dependent on our ability to distribute and integrate the acquired knowledge. Providing better access to available knowledge, for which there appears to be room in biology as argued earlier, can increase both the number and ‘size’ of the discoveries (i.e. size of each individual iterative step) that are made.

LEARNING LIMITATIONS

What are the fundamental limitations preventing us from doing this? We identify two basic limitations. First, our ability as humans to attain and store large amounts of in-formation (not yet identified as knowledge) is rather limited. Human memory is much more focused on storing knowledge and understanding, in the form of a mind model formed of concepts and their relations (next to our non-declarative memory) [25]. This is a problem, as relevant (new) knowledge might be hidden in the available information, and only becomes apparent when considered in conjunction with other knowledge, e.g. when one attempts to form a model. An example of this would be measuring interactions in order to detect possible regulatory interactions. Only by observing the measured in-teractions together in the context of an (existing) regulatory model one might be able to infer which of the individual interactions is actually true.

(23)

1

11

The second limitation has to do with the nature of the iterative learning. In the con-text of the DIKW model, an individual iterative learning step in this iteration can be seen as a dynamic process, in which one moves up and down the hierarchy, trying to associate and conceptualize data, searching for those associations and concepts that elicit further understanding. Thus, one can see it as a search operation, where one walks across a space of possible patterns and models, until the outcome ‘fits’, i.e. appears to correctly describe or predict reality, thereby leading to further knowledge, comprehension or even wisdom3. Michalsky et al. [23] refer to this phase as ‘discovery’.

Our limited working memory however, as well as the enormous search space of pos-sible patterns/models, puts limits on our ability to discover. In educational psychology, this is known as Cognitive Load Theory [26], which describes how our learning ability is influenced by the number of concepts that need to be put into a ‘mind model’ (schema) simultaneously. More concepts represents more load, making it progressively more dif-ficult to organize the concepts. This appears to be the main bottleneck which prevents us from working with overly large models, as we find in biology.

This automatically leads us to the question: to what extent can machines support us in our quest to understand the cellular systems? Can we automate the required integra-tion tasks?

MACHINE LEARNING

The improvements in computer hardware have indeed enabled researchers to address the two mentioned limitations. Computers have as advantage their much higher (serial) speed, as well as larger (working) memories. First, much research has been performed in the development of data storage models, allowing us to store and efficiently address large amounts of information (thereby addressing the first limitation). Secondly, ma-chine learning research has developed many techniques to recognize patterns (or learn models) using large amounts of input information (addressing the second limitation). Together, they allow the effective integration of acquired knowledge in the learning pro-cess.

Machine learning cannot work wonders however. Although computers can solve the working memory size problem, they cannot really solve the search space problem, as, unconstrained, it increases exponentially for a linear increase in the number of model parameters. With this comes also the problem of overfitting, i.e. the danger of finding a model that, by chance, fits the available data well, but will not generalize to new data [27]. So, each discovery step cannot be made larger without limits.

More importantly, machines are not very good at conceptualization and abstract rea-soning. To solve a problem, humans use various problem solving techniques, which generally focus on subdividing a problem in subproblems, and solving these then in-dividually, thereby reducing the search space. This activity can require a lot of back-ground knowledge, insight and conceptualization, and this is currently a weak point of machine learning algorithms. Generally speaking, one could say that machine learning algorithms are often used to solve single discovery steps, while humans still perform the

3_{We noted before that our ordering of data into information requires knowledge, which might have evoked the}

question how learning can be started when there is no knowledge yet available. This bootstrapping problem is solved by the here described dynamic search process, as it allows us to try different data associations/con-ceptualizations, in order to find the knowledge that ‘works’.

(24)

1

12 1.INTRODUCTION

more iterative aspect of learning/problem solving4. Chapter 2 in [23] refers to this, and notes that

Hence, I think it is quite appropiate that a large part of the effort in the do-main of ‘machine learning’ is really directed at ‘machine discovery’.

Despite these limitations to machine learning, it appears that its strenghts and weak-nesses complement well with the learning/problem solving ability of humans. An in-teresting illustration of this was given in Gilski and Jaskolski [28], which described how protein structures were solved through a game, in which a combination of algorithmic tools and players intuition and insight was used to find better scoring structures. With this, a problematic structure was solved, which for 15 years had remained unsolved due to the failure of other approaches.

1.3.4. R

ELATION TO

JDL

DATA FUSION MODEL

As discussed, the DIKW model allows us to separate between different levels of inte-gration, which we represent inFigure 1.2as data integration, information integration, knowledge integration, and comprehension integration. With integration being a trans-formatory step in the model, this means that e.g. data integration transforms data into information and thus occurs at the information level (Figure 1.2). Similarly, the integra-tion of informaintegra-tion occurs at the knowledge level, where informaintegra-tion is conceptualized into patterns. Thus, integration is not the same as combining data: combining just gath-ers, while integration also processes the data, taking its associations into account.

Interestingly, the different levels of integration discussed here, correlate well with the JDL data fusion model, which was developed in the context of military applications. Extensions and revisions were proposed in [29,30], where the following levels were con-sidered:

• Level 0: Signal asssessment. Estimation of states of sub-object entities (e.g. sig-nals,features). Process: detection. Product: Estimated signal state.

• Level 1: Object assessment. Estimation of states of objects (e.g. vehicles, build-ings), by assigning attributes. Process: attribution. Product: Estimated entity state.

• Level 2: Situation assessment. Estimation of relations among entitities (e.g. ag-gregates, cuing, intent, acting on). Process: relation. Product: estimated situation state (aggregation).

• Level 3: Impact assessment. Estimation of impacts (e.g. conseqeunces of threat activities on one’s own assets and goals). Process: game-theoretic interaction. Product: Estimated situation utility (effect).

• Level 4: Process refinement. Adapt data acquisition and processing to support mission objectives. Process: control. Product: action.

4_{Another factor which plays a role here is that, in academic research, many of these discoveries (or hypotheses)}

need to be validated independently through extra experiments, to make sure that we do not acquire incorrect knowledge. Normally, machines do not have the means to do this, although they might suggest experiments that would be most informative.

(25)

1

13

Although this JDL model describes a rather specific application domain, the similarities with the integration levels in the DIKW model are obvious. Object assesment (level 1) de-scribes the handling of object predicates (data integration). Situation assessment (level 2) describes the recognition and aggregation into patterns (information integration). Im-pact assessment (level 3) the comprehension of the systems and processes (knowledge integration). Finally, the evaluation of the processes (comprehension integration) leads to process refinement (level 4).

Level 0, signal assignment, was proposed as an addition to the JDL model by Stein-berg [29], to describe the process in which one obtains a data element. For example by time-averaging a signal, or extracting data from an image region. In the DIKW, this could have been added as a ‘detection integration’ phase. However, we think this type of in-tegration is actually better represented as an instance of pattern discovery (or situation assessment in terms of the JDL model), in which one determines if a signal (pattern) can be found or not. This again highlights the iterative, bi-directional, nature of the model.

In the next section, we will discuss these various forms of integration in more detail.

1.3.5. S

UMMARY

• Integration is a process in which data elements become progressively more or-dered through associations and conceptualization.

• Data, information, knowledge, comprehension and wisdom describe to which ex-tent data elements have been integrated.

• The transformations from one level to another in this hierarchy involve different forms of integration, respectively based around predicates, patterns, processes and principles. These transformations are referred to as data integration, infor-mation integration, knowledge integration and comprehension integration.

• Individual transformations cannot (always) be approached as separate subprob-lems. Instead, they take part in the process of learning in a unified manner.

• Integration is not just a bottom-up process, it requires input from higher levels in the hierarchy to perform its transformations.

• Central to integration is the discovery of good conceptualizations and associa-tions. These discovery steps represent a search problem.

• There is an iterative aspect to integration: one is ‘confined’ by existing knowledge. The amount of existing knowledge influences the size of the discovery step that can be made.

• Humans have difficulty with handling discovery steps in which large amounts of information need to be considered. This is however one of the strengths of ma-chine learning algorithms. Human learning and mama-chine discovery are comple-mentary to each other.

(26)

1

14 1.INTRODUCTION

1.4. O

VERVIEW OF INTEGRATION LEVELS

1.4.1. D

ATA INTEGRATION

Within computer science, the storage of information has been studied for more than four decades. To allow machine access to information (which is necessary for data analysis), data has to be stored in machine-readable structured formats. Databases are such orga-nized collections of data, and are in widespread use. Well known are relational, object-oriented, and hierarchical approaches. Their common goal is to capture the object struc-ture (i.e. what elements describe an object) and the relations between objects, as well as allow high-level query languages to access and use objects and relations.

Most of the difficulties in data integration are not in defining a data structure how-ever, but rather in linking multiple heterogeneous data sources. Such linking is required to enable interoperability over the different data sources. Sheth [31] considers the fol-lowing hetereogeneity levels:

• System: file systems, communication protocols

• Syntactic: tables, XML

• Structural: relational database, RDF

• Semantics: object identities, attribute domains

A common approach to solve the heterogeneity of data sources has been the con-struction of data warehouses [32]. Such warehouses store data from multiple sources in one common data structure (schema). In this case, the heterogeneity problems are solved by the database designer. This can take quite a lot of work however, as for example the addition of a data source will often require that one writes a new importer and modi-fies the schema. Also, keeping a warehouse up to date w.r.t. to its data sources is often not trivial. To handle the latter issue, research has shifted from data warehouse approaches (in which all data is stored in a common schema) towards more loose approaches such as federated databases and mediators [32]. These loose approaches perform data inte-gration on demand, such that the integrated database is always up to date. Data source additions are however still complicated, as data schemas have to be manually mapped towards each other. Also, query execution is generally less efficient.

To handle schema mapping, various theoretical approaches (e.g. global-as-view, local-as-view) have been proposed [33]. Automating schema mapping is however a diffi-cult problem, as knowledge is required on the (common) semantics of the data sources. Due to this, in the last couple of years, focus has shifted towards semantic integration. Whereas in the previously described approaches there is still work required for each data source that is added, the goal here is to remove this altogether. This is accomplished by describing each data source in terms of a common semantic description, called an ontol-ogy. Such an ontology defines predicates and their relationship to each other, and solves the semantic heterogeneity problem. Also, data is stored into a common structure -often RDF- (solving structural heterogeneity), which is a form of XML (solving syntactic het-erogeneity). Communication takes place through standard internet protocols (solving system heterogeneity). This way, all levels of heterogeneity are removed, and data in-tegration can be performed automatically. Enabling such data inin-tegration is now just a

(27)

1.4.OVERVIEW OF INTEGRATION LEVELS

1

15

problem of representing data as information, using the common semantic contract. In-stead of users having to solve the integration problem, the burden is now shifted towards the data source developers. This form of integration enables maximal interoperability, which has led to the concept of the semantic web, where the goal is to represent all data sources in such a format, enabling querying across many available data sources without any extra required integration effort.

While the semantic web approach solves the data integration problem, and allows for information selection, its querying capabilities are not the most well suited for pattern finding (information integration, i.e. data analysis). By standardizing data structures to their greatest common denominator (triples), and using query languages which are more or less closed for such a structure, there will in many cases remain a significant data handling burden for the data analyst, who often has to implement this using custom scripts. Thus, while the semantic web solves the interoperability between data sources, it still lacks the interoperability with data analysis algorithms. In this thesis, we propose a solution that deals with interoperability of data sources as well as data analysis (see

section 1.5).

1.4.2. I

NFORMATION INTEGRATION

Information integration, often referred to as data fusion, differs from data integration in that it performs a data set reduction instead of a data set union. The reduction has as goal to extract knowledge from the underlying information.

Data can be reduced by selection, in which one specifies a query based on existing knowledge, which searches for linked information. For example, ‘is there an interaction being measured between SWI4 and SWI6’? A special case is the deduction, based on the semantic representation of information. An example of such a query could be ‘Is there a pathway that connects SWI4 and SWI6?’. This would require that one needs to determine whether there is a path of interactions that connect SWI4 and SWI6, which involves some reasoning over the measured interactions5

Much research is being performed on automating such types of queries on semantic data representations. This requires that the query processor understands the concept of a pathway, and is able to use this concept to execute the query. This is often done through semantic representations, with axiomatic descriptions of the relations (e.g. is_a or part_of ) and logic axioms that allow the derivation of new rules (e.g. if A part_of B and B part_of C, then A part_of C).

Another way in which data can be reduced is by means of information extraction, in which statistics are used to find patterns. Detecting patterns makes use of statisti-cal tests, correlation-based association tests, regression approaches, and class or cluster descriptors. Two main distinguishing approaches are there in finding patterns: finding specified patterns (supervised) and finding unspecified patterns (unsupervised). This strongly relates to two criteria that were specified in the definition of knowledge we gave, namely significance (is the pattern really there) and specificity (does the pattern

con-5_{One could argue that this should be seen as knowledge integration. However, in our view the goal here is not}

to understand (the why) but to know (the what), i.e. is there a pathway, yes or no? Answering the question only requires that one determines if a known pattern (‘pathway’) occurs in the available interaction information. Knowledge integration would instead involve causality or implication.

(28)

1

16 1.INTRODUCTION

nect with existing knowledge). Patterns found through unsupervised learning are not specified (i.e. do not directly connect with existing knowledge), and will require addi-tional analysis (e.g. a GO enrichment on a clustering result) for them to become useful knowledge. Instead, supervised pattern finding can lead directly to both significant and specific results (knowledge), for example finding the transcription factor that binds to a significant number of promoters of a specified set of genes. One could say that the large role of supervision in machine learning has its basis in the properties of knowledge.

An often encountered difficulty in pattern extraction is that finding patterns across multiple objects requires such objects to be comparable to each other. Usually feature-based data descriptions are used, in which each object has certain attributes (features) in numerical form. However, not all attributes can be represented in numerical format. Examples of such attributes are molecule graphs or DNA sequences. To still be able to extract patterns from such objects, distance-based [34] or kernel-based data represen-tations [35] can be used. This approach is explored in two chapters in this thesis (see

section 1.5)

1.4.3. K

NOWLEDGE INTEGRATION

The purpose of knowledge integration is to bring together knowledge into a system or model, such that it leads to comprehension. Such a system or model should be com-posed of rules, explaining how one state leads to another, either causally or by (logi-cal) implication. Causes/implications should be seen very broadly: e.g. gravity ‘causes’ the moon to orbit around the earth, or one can ‘understand’ a person by knowing what drives him, and so on. Knowledge integration ranges thus from finding a mathematical proof, which relates ‘what is known’ through logical implication to a new theorem, to the ‘intuitive’ understanding of a person. As such, it is a broad field, in which many sciences have some role.

Within bioinformatics, one of the most prevalent knowledge integration methods is the ‘reverse engineering’ of causal models based on the observed actions of a system [36,37]. One of the main problems is here the derivation of causal relations from corre-lations. This is difficult as one cannot directly derive a) the causal direction and b) the causal order. Solving the problem of causal directionality (i.e. determining if gene A in-fluences gene B, gene B inin-fluences gene A, or both) often requires either perturbations (measuring the effects of a known intervention), or time course measurements (measur-ing how a perturbation propagates through the network over time). Both indicate causal directions which can be used in the reverse engineering process. The problem of causal order is more difficult. The question here is if an observed correlation is direct (e.g. due to a physical interaction), or if there are other causal players inbetween. Solutions to this problem assume that the strongest correlations occur between direct interaction part-ners, and that the physical interaction network is sparse (e.g. by using gaussian graphical models [38] or LASSO-regression [39]). One can also use measurements on the physical interaction network, thereby enforcing an ‘informed sparsity’. Another solution is the use of multiple perturbations, and determine how the effects are nested w.r.t. to each other [40]. The use of the physical interaction network in these type of analyses is ex-plored in this thesis.

(29)

1.5.CHAPTER OVERVIEW

1

17

1.4.4. C

OMPREHENSION INTEGRATION

Comprehension integration requires that one evaluates ones understanding as obtained through knowledge integration. While machines might be able to find models that can explain a certain behaviour, in general they are not really capable to step back and eval-uate such an ‘understanding’, let alone evaleval-uate it in the context of other obtained un-derstandings. Still, insights obtained by humans through comprehension integration can still be applied to machine-based algorithms, by encoding them in the model build-ing algorithms. An example of this is the application of Occam’s razor, which numerous model building algorithms include as an extra regularization term to guide the algorithm to simpler solutions (e.g. [41]).

1.5. C

HAPTER OVERVIEW

In this thesis, integration has been approached from several perspectives. A visualization in terms of the earlier described classification of integration levels is shown inFigure 1.3.

Data integration Information integration Knowledge integration Comprehension integration Homogenous Heterogenous Chapter 2 Chapter 3/4 Chapter 6 Chapter 7 Chapter 5 Ibidas Normalization Materials Kernels Perturbation networks

Within chip _{Between chip} Multiple data sources

Multiple data sources on proteins, physical and functional network

Data source independent

Input data type

Type of integra

tion

= Focus of proposed algorithm = Connects integration levels

Figure 1.3: An overview how the different topics discussed in this thesis relate to the integration types. Note that if one wants to perform information integration, one is also automatically required to perform data integration. This is represented through the light blue bar. The actual focus of the algorithm is shown through dark blue bars. The black lines indicate where multiple integration levels are connected.

(30)

1

18 1.INTRODUCTION

CHAPTER2: INTEGRATION ACROSS DATA STRUCTURES. LINKING DATA INTEGRATION AND DATA ANALYSIS.

As discussed insubsection 1.4.1, the optimal way to represent data is dependent on the integration task that has to be performed. In this chapter, we attempt to strengthen the link between data and information integration, thereby improving not only the inter-operability between data sources, but also the interinter-operability with data analysis algo-rithms. This led to the development of a problem solving environment for data handling in bioinformatics, allowing one to quickly integrate and explore data.

CHAPTER3AND4:INTEGRATION WITHIN A MEASURING CHIP. CHARACTERIZING MATERIAL ACTIVITY.

Inchapter 4we explore how one can extract measurements from a chip, in which materi-als are tested for the conductivity to various cell parameters. Two ways of information are being dealt with: one in which features are selected and extracted from the chip images, and one in which the performance of the different materials is linked to their material descriptors. We are able to create a predictive model of material performance, that can predict performances of new materials that are not on the chip.

CHAPTER 5: INTEGRATION ACROSS MEASURING CHIPS. ESTIMATION AND REMOVAL OF TECHNICAL EFFECTS FOR MICROARRAYS.

Microarray gene expression is senstive to even the slightest changes in experimental conditions. In this chapter, we use patterns in the signal across multiple chips to identify the underlying technical effects. Based on these patterns, models are created that can be fitted to these effects, which are subsequently used to remove these technical effects. We show how this significantly improves the precision of the signal across chips, allowing one to improve differential expression detection.

CHAPTER6: INTEGRATION ACROSS DATA SOURCES. PREDICTING INTERACTIONS USING A WEIGHTED COMBINATION OF PROTEIN AND INTERACTION INFORMATION.

In this chapter, we explore how we can predict protein interactions using numerous heterogeneous data sources, with measurements on proteins and interactions. We link these measurements using a kernel framework, and explore how we can take into ac-count the importance of each data source by weighting their contribution to this com-bined kernel. We show how such weighting can improve the score of the interaction prediction.

CHAPTER7:INTEGRATION ACROSS NETWORKS. LINKING THE PHYSICAL AND FUNCTIONAL NETWORK:PREDICTING PERTURBATION CAUSES AND EFFECTS.

This chapter explores how we can combine a multitude of features that might be predic-tive for the activity of a regulatory network. Specifically, we link the protein, interaction and pathway level, allowing us to combine physical and functional information. The goal of this work is to be able to predict perturbation causes and effects. This work is a combination of information integration (finding predictive feature patterns for inter-action activity) and knowledge integration (combining these active interinter-actions into a predictive network). We validated the results in Saccharomyces Cerevisiae, showing that we could use the data to predict with reasonable accuracy how perturbations of genes

(31)

1.5.CHAPTER OVERVIEW

1

19

would propagate through the regulatory network, as well as predict which pertubation could be responsible for certain observed effects.

(32)

1

20 REFERENCES

R

EFERENCES

[1] J. Shapiro, Revisiting the central dogma in the 21st century, Annals of the New York Academy of Sciences 1178, 6 (2009).

[2] J. Ostrowski and L. Wyrwicz, Integrating genomics, proteomics and bioinformatics in translational studies of molecular medicine, Expert review of molecular diagnostics 9, 623 (2009).

[3] W. Feero, A. Guttmacher, W. Feero, A. Guttmacher, and F. Collins, Genomic medicine—an updated primer, New England Journal of Medicine 362, 2001 (2010). [4] J. Ihmels, G. Friedlander, S. Bergmann, O. Sarig, Y. Ziv, and N. Barkai, Revealing

modular organization in the yeast transcriptional network, Nature genetics 31, 370 (2002).

[5] A. Barabási and Z. Oltvai, Network biology: understanding the cell’s functional orga-nization, Nature Reviews Genetics 5, 101 (2004).

[6] M. Gerstein, C. Bruce, J. Rozowsky, D. Zheng, J. Du, J. Korbel, O. Emanuelsson, Z. Zhang, S. Weissman, and M. Snyder, What is a gene, post-encode? history and updated definition, Genome research 17, 669 (2007).

[7] D. Eisenberg, E. Marcotte, I. Xenarios, and T. Yeates, Protein function in the post-genomic era, NATURE-LONDON- , 823 (2000).

[8] A. Corvin, N. Craddock, and P. Sullivan, Genome-wide association studies: a primer, Psychological medicine 40, 1063 (2010).

[9] G. Gibson, Hints of hidden heritability in gwas, Nat Genet 42, 558 (2010).

[10] J. Park, S. Wacholder, M. Gail, U. Peters, K. Jacobs, S. Chanock, and N. Chatter-jee, Estimation of effect size distribution from genome-wide association studies and implications for future discoveries, Nat Genet 42, 570 (2010).

[11] H. Cordell, Epistasis: what it means, what it doesn’t mean, and statistical methods to detect it in humans, Human Molecular Genetics 11, 2463 (2002).

[12] J. Ihmels, S. Collins, M. Schuldiner, N. Krogan, and J. Weissman, Backup without redundancy: genetic interactions reveal the cost of duplicate gene loss, Molecular sys-tems biology 3 (2007).

[13] H. Allen, K. Estrada, G. Lettre, S. Berndt, M. Weedon, F. Rivadeneira, C. Willer, A. Jackson, S. Vedantam, S. Raychaudhuri, et al., Hundreds of variants clustered in genomic loci and biological pathways affect human height, Nature (2010).

[14] K. Pattin and J. Moore, Exploiting the proteome to improve the genome-wide genetic analysis of epistasis in common human diseases, Human genetics 124, 19 (2008). [15] B. Stranger, E. Stahl, and T. Raj, Progress and promise of genome-wide association

(33)

REFERENCES

1

21

[16] A. Califano, A. J. Butte, S. Friend, T. Ideker, and E. Schadt, Leveraging models of cell regulation and gwas data in integrative network-based association studies, Nature genetics 44, 841 (2012).

[17] C. Zins, Conceptual approaches for defining data, information, and knowledge, Journal of the American Society for Information Science and Technology 58, 479 (2007).

[18] J. Rowley, The wisdom hierarchy: representations of the dikw hierarchy, Journal of Information Science 33, 163 (2007).

[19] G. Bellinger, D. Castro, and A. Mills, Data, information, knowledge, and wisdom, URL: http://www. systems-thinking. org/dikw/dikw. htm (2004).

[20] R. Ackoff, From data to wisdom, Journal of applied systems analysis 16, 3 (1989). [21] M. Zeleny, Knowledge-information autopoietic cycle: towards the wisdom systems,

International Journal of Management and Decision Making 7, 3 (2006).

[22] W. Chun, The knowing organization how organizations use information to construct meaning, create knowledge, and make decisions, (1998).

[23] R. Michalski, J. Carbonell, and T. Mitchell, Machine learning: An artificial intelli-gence approach, (1985).

[24] D. Lamb and S. Easton, Multiple discovery: The pattern of scientific progress, (1984). [25] L. Squire, Memory systems of the brain: A brief history and current perspective,

Neu-robiology of learning and memory 82, 171 (2004).

[26] J. Sweller, Cognitive load during problem solving: Effects on learning* 1, Cognitive science 12, 257 (1988).

[27] D. Hawkins, The problem of overfitting, Journal of chemical information and com-puter sciences 44, 1 (2004).

[28] F. Khatib, F. DiMaio, S. Cooper, M. Kazmierczyk, M. Gilski, S. Krzywda, H. Zabranska, I. Pichova, J. Thompson, Z. Popovi´c, et al., Crystal structure of a monomeric retro-viral protease solved by protein folding game players, Nature structural & molecular biology 18, 1175 (2011).

[29] A. Steinberg, Revisions to the JDL data fusion model, Tech. Rep. (DTIC Document, 1999).

[30] J. Llinas, C. Bowman, G. Rogova, A. Steinberg, E. Waltz, F. White, SPACE, and N. W. S. C. S. D. CA., Revisiting the jdl data fusion model ii. (2004).

[31] A. Sheth, Changing focus on interoperability in information systems: From system, syntax, structure to semantics, KLUWER INTERNATIONAL SERIES IN ENGINEER-ING AND COMPUTER SCIENCE , 5 (1999).

(34)

1

22 REFERENCES

[32] C. Goble and R. Stevens, State of the nation in data integration for bioinformatics, Journal of Biomedical Informatics 41, 687 (2008).

[33] A. Levy, A. Mendelzon, and Y. Sagiv, Answering queries using views, in Proceedings of the fourteenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems (ACM, 1995) pp. 95–104.

[34] E. P˛ekalska and R. Duin, The dissimilarity representation for pattern recognition: foundations and applications, (2005).

[35] T. Gärtner, A survey of kernels for structured data, ACM SIGKDD Explorations Newsletter 5, 49 (2003).

[36] M. Bansal, V. Belcastro, A. Ambesi-Impiombato, and D. Di Bernardo, How to infer gene networks from expression profiles, Molecular Systems Biology 3 (2007). [37] F. Markowetz and R. Spang, Inferring cellular networks–a review, BMC

bioinformat-ics 8, S5 (2007).

[38] A. Dobra, C. Hans, B. Jones, J. Nevins, G. Yao, and M. West, Sparse graphical models for exploring gene expression data, Journal of Multivariate Analysis 90, 196 (2004). [39] N. Meinshausen and P. Bühlmann, High-dimensional graphs and variable selection

with the lasso, The Annals of Statistics 34, 1436 (2006).

[40] F. Markowetz, J. Bloch, and R. Spang, Non-transcriptional pathway features recon-structed from secondary effects of rna interference, Bioinformatics 21, 4026 (2005). [41] E. Someren, L. Wessels, M. Reinders, and E. Backer, Regularization and noise

in-jection for improving genetic network models, Computational and statistical ap-proaches to genomics , 279 (2006).

(35)

2

I

BIDAS

: Q

UERYING FLEXIBLE DATA

STRUCTURES TO EXPLORE

HETEROGENEOUS

BIOINFORMATICS DATA

Marc H

ULSMAN

, Jan J. B

OT

, Arjen P. de V

RIES

, Marcel J.T. R

EINDERS

This chapter has been published in Data Integration in the Life Sciences (DILS), 7970: 23-37 (2013) [1]. Repro-duced with kind permission from Springer Science and Business Media".

(36)

2

24 2.QUERYING FLEXIBLE DATA STRUCTURES

Background: Nowadays, bioinformatics requires the handling of large and diverse datasets. Analyzing this data demands often significant custom scripting, as reuse of code is limited due to differences in input/output formats between both data sources and algorithms. This recurring need to write data-handling code significantly hinders fast data exploration. Results: We argue that this problem cannot be solved by just data integration and stan-dardization alone. We propose that the integration-analysis chain misses a link: a query solution which can operate on diversely structured data throughout the whole bioinfor-matics workflow, rather than just on data available in the data sources. We describe how a simple concept (shared ’dimensions’) allows such a query language to be constructed, enabling it to handle flat, nested and multi-dimensional data. Due to this, one can oper-ate in a unified way on the outputs of algorithms and the contents of files and databases, directly structuring the data in a format suitable for further analysis. These ideas have been implemented in a prototype system called Ibidas. To retain flexibility, it is directly integrated into a scripting language.

Conclusions: We show how this framework enables the reuse of common data operations in different problem settings, and for different data interfaces, thereby speeding up data exploration.

(37)

2.1.INTRODUCTION

2

25

2.1. I

NTRODUCTION

Research in the field of biological systems has become a strongly data-driven activity. Measurements are performed at multiple levels (genomics, transcriptomics, etc.), and combined with already-available information, which can be accessed through the more than 1300 available public data sources [2]. Handling these large and diverse datasets can be a time-consuming and complex task, requiring the development of many custom-written data-handling scripts. This problem has attracted significant attention from re-searchers, which has led to the development of numerous approaches to improve this process [3].

Generally this is solved in a bottom-up fashion, where one starts from the data sources and enables structured access to the data, for example by making use of warehouses, webservices or the semantic web, e.g. [4–6]. Bottom-up approaches have some limita-tions however. Data interfaces offer relatively limited functionality (for computational and security reasons as queries often run on public servers). Furthermore, queries span-ning multiple data sources often are not supported, as data can only be queried when it is available within a common (or in case of the semantic web: similar) data store. For example, comparing organisms by linking BLAST results with gene ontology (GO) an-notation data requires manual linking of the two data sources. Finally, once the data has been retrieved and processed, further use of the query functionality offered by the data sources is only possible by importing the data back into a data store. E.g., to per-form queries on the results of a differential expression analysis requires one to put the data back into a database/triple store. Due to this overhead, most users will elect to write custom data-handling code instead. Especially within a bioinformatics research context, the mentioned limitations are encountered often.

Why do so many trivial data-handling tasks still require custom scripting solutions, while high-level data-handling query languages such as SPARQL, SQL or XPath are avail-able? Fundamentally, the underlying cause for these problems is that both data inte-gration and data analysis play a large role in bioinformatics. These two tasks have very different requirements. Data integration favors the absence of data structures, such as tables/matrices/nested arrays, as mapping these structures onto each other can be a difficult process. Data analysis on the other hand requires such data structures to allow for easy reasoning and aggregation across related data elements (Figure 2.1ab). Cur-rent query languages however do not support this complete range of data structuring, but only a limited subset. For example, RDF/SPARQL focuses on data integration, re-ducing datasets to collections of single facts; similarly, SQL focuses on relational tables; and XPath queries are used to query hierarchical descriptions of objects (XML). None of these query languages handle analysis-focused data structures (e.g. matrices) well. Support for data-handling operations within and between algorithms is therefore more or less absent, while this is exactly the area where it is most often needed. Therefore, most of the complex data-handling operations are still performed by the user, often by implementing them in custom written scripts.

To solve this problem, we propose a query language that can operate on data, irre-spective of whether it is stored in simple or more complicated data structures. That is, we solve data-handling issues at the language level (’top-down’). Note that the bottom-up and top-down approaches are complementary: top-down needs bottom-bottom-up, as it