MichaªKierzynka GPUACCELERATEDGRAPHCONSTRUCTIONFORTHEWHOLEGENOMEASSEMBLY

(1)

Pozna« University of Technology Faculty of Computing Institute of Computing Science

and

European Center for Bioinformatics and Genomics

PhD thesis

GPUACCELERATED GRAPH CONSTRUCTION FOR THE WHOLE GENOME ASSEMBLY

Michaª Kierzynka

Promotor

prof. zw. dr hab. in». Jacek Bªa»ewicz Assistant promotor

dr in». Paweª Wojciechowski

Pozna«, 2014

(2)

(3)

(4)

(5)

I would like to thank:

both supervisors for guidance and useful remarks,

Wojciech Frohmberg for many years of joint work,

my parents for their love and support,

my ancée for motivation and her optimism,

and everyone who gave me a helping hand.

(6)

Contents 6

1 Introduction 9

1.1 Computational biology and the genome assembly . . . . 9

1.2 Goals and scope of the thesis . . . 12

2 Biological primer 15 2.1 DNA, RNA and genetic code . . . 15

2.2 Protein structure . . . 19

2.3 Sequence alignment . . . 20

3 Basic notions in algorithms theory and computational complexity 23 3.1 Graph theory . . . 23

3.2 Computational complexity . . . 26

3.3 Dynamic programming . . . 30

3.4 Pairwise sequence alignment algorithms . . . 32

3.4.1 The Needleman-Wunsch algorithm . . . 33

3.4.2 The Smith-Waterman algorithm . . . 35

4 Genome sequencing technologies 37 4.1 The Sanger method . . . 37

4.2 Sequencing by hybridization . . . 38

4.3 Next-Generation Sequencing . . . 39

4.3.1 454 sequencing . . . 39

4.3.2 Illumina sequencing . . . 40

4.3.3 SOLiD sequencing . . . 41

4.3.4 Ion Torrent sequencing . . . 42

4.4 Summary . . . 42

6

(7)

7 5 DNA assembly problem 43

5.1 Denition . . . 43

5.2 Methods . . . 44

5.2.1 Original DNA graph models . . . 44

5.2.2 DNA overlap graphs . . . 45

5.2.3 Induced subgraphs of de Bruijn graphs . . . 46

5.2.4 Discussion . . . 47

5.3 State-of-the-art algorithms . . . 48

5.4 Conclusions . . . 49

6 Pairwise sequence alignment on GPU 51 6.1 Background . . . 51

6.2 GPGPU and CUDA programming model . . . 53

6.3 GPU implementation of sequence alignment . . . 55

6.4 Computational tests . . . 61

7 Algorithm for the DNA overlap graph construction 65 7.1 K-mer characteristics of the sequences . . . 66

7.2 Selection of overlapping sequences and graph construction . . . 68

7.3 Improvements of the original algorithm . . . 74

7.4 Analysis of the algorithm with respect to its parallelization . . . 79

8 Computational tests 85 8.1 Description of data sets . . . 85

8.1.1 In silico data sets . . . 85

8.1.2 Whole genome data sets . . . 89

8.2 Quality of the graph . . . 92

8.2.1 Results on in silico data sets . . . 92

8.2.2 Results on whole genome data sets . . . 102

8.3 Testing environment . . . 105

8.3.1 Time and memory consumption . . . 106

8.3.2 Scalability of the parallel algorithm . . . 109

9 Conclusions 113 A Software usage 115 A.1 Capabilities of the program . . . 115

A.2 Parameters of the program . . . 117

A.3 System requirements . . . 118

(8)

A.4 Software availability . . . 118

Bibliography 119

(9)

Streszczenie

Na przestrzeni ostatnich dziesi¦cioleci biologia molekularna jako nauka rozwin¦ªa si¦ w bardzo znacz¡cy sposób. Dane pochodz¡ce z biochemicznych eksperymentów nie mogªyby jednak by¢ poddane dogª¦bnej analizie gdyby nie zaawansowane metody informatyczne. To wªa±nie na styku tych dwóch dziedzin narodziªa si¦ nauka zwana bioinformatyk¡. Jednym z kluczowych problemów bioinformatyki jest asemblacja ªa«cuchów DNA typu de novo, która polega na odtworzeniu sekwencji nieznanego genomu na podstawie nakªadaj¡cych si¦ krótkich fragmentów DNA odczytanych przez sekwenator. Problem ten nale»y do klasy problemów silnie NP-trudnych, a zatem mo»na powiedzie¢, »e jest obliczeniowo trudny. Ma to szczególne znaczenie zwªaszcza w kontek±cie tzw. sekwencjonowania nowej generacji, w którym ilo±¢ danych pochodz¡cych z pojedynczego eksperymentu jest zdecydowanie wi¦ksza ni» w przypadku sekwencjonowania metod¡ tradycyjn¡.

Niniejsza praca jest odpowiedzi¡ na opisany powy»ej problem. W szczególno±ci przed- stawia ona nowy algorytm heurystyczny do konstrukcji grafów naªo»e« DNA, które z kolei s¡ szeroko wykorzystywane w procesie asemblacji typu de novo. W modelu tym rozwa»any jest graf skierowany G = {V, A}, w którym zbiór wierzchoªków V odpowiada poszczegól- nym sekwencjom, a ªuki ze zbioru A ª¡cz¡ nachodz¡ce na siebie fragmenty DNA. Zapro- ponowany algorytm oparty jest na k-merowej analizie sekwencji biologicznych. Dla ka»dej sekwencji algorytm oblicza tzw. peªn¡ oraz cz¦±ciow¡ charakterystyk¦ k-merow¡, które w dalszej kolejno±ci u»ywane s¡ do sortowania sekwencji. Sortowane przebiega w taki sposób aby zmaksymalizowa¢ prawdopodobie«stwo wyst¦powania podobnych do siebie sekwencji w bliskim s¡siedztwie, w obr¦bie którego poszczególne pary sekwencji tworz¡

tzw. pary obiecuj¡ce. Nast¦pnie ich wzajemne podobie«stwa, a mówi¡c precyzyjniej wªa±- ciwo±ci nakªadania si¦, s¡ werykowane przez dokªadny algorytm Needlemana-Wunscha dopasowania parami. Pary sekwencji, które pomy±lnie przejd¡ proces werykacji ª¡czone s¡ w grae ªukiem skierowanym. Zaproponowany algorytm ma zªo»ono±¢ O(n · log

2

n + n), gdzie n to liczba sekwencji w danej instancji problemu. W dalszej kolejno±ci jako±¢ grafu jest podnoszona za pomoc¡ czterech dodatkowych procedur. W rezultacie dokªadno±¢ za- proponowanego algorytmu jest bardzo wysoka.

Przedstawiony w pracy algorytm byª przetestowany na ró»norodnych zbiorach danych:

pocz¡wszy od wygenerowanych syntetycznie, ko«cz¡c na danych pochodz¡cych z rzeczy-

wistych sekwenatorów. Dzi¦ki tym pierwszym mo»liwy byª dokªadny pomiar jako±ci kon-

struowanych grafów przy u»yciu macierzy pomyªek. Czuªo±¢ zaproponowanej metody wa-

haªa si¦ pomi¦dzy 97% a 99%, co w poª¡czeniu z wysok¡ precyzj¡ si¦gaj¡c¡ 99% daªo

bardzo dobre wyniki. Pokazano równie», »e metoda bardzo dobrze radzi sobie nawet w przy-

padku danych z bª¦dami sekwencjonowania. Ponadto, jako±¢ grafu zostaªa zwerykowana

nie wprost na podstawie dwóch zbiorów danych pochodz¡cych z sekwenatora Illumina

(10)

zestawiona z najlepszymi dost¦pnymi algorytmami do asemblacji de novo i w wielu przy- padkach jako±ci¡ generowanych rozwi¡za« przewy»szaªa pozostaªe metody. Dowodzi to nie tylko wysokiej dokªadno±ci metody konstrukcji grafu, lecz równie» jej mo»liwo±ci efekty- wnego przetwarzania du»ych zbiorów danych.

Aby obliczenia wykonywaªy si¦ w mo»liwie krótkim czasie, du»¡ uwag¦ po±wi¦cono na zrównoleglenie metody. Najbardziej czasochªonn¡ procedur¦, czyli dokªadne dopasowywanie sekwencji, zaimplementowano na kartach gracznych. Testy przeprowadzone na szerokim spektrum realnych danych wykazaªy bardzo wysok¡ wydajno±¢ zaproponowanej metody oraz prawie liniow¡ skalowalno±¢ przy u»yciu wielu kart gracznych. W rezultacie omawiany algorytm konstrukcji grafu mo»e w rozs¡dnie krótkim czasie zwerykowa¢ wi¦ksz¡ liczb¦

obiecuj¡cych par sekwencji w obr¦bie s¡siedztwa, co pozytywnie wpªywa na jako±¢ kon- struowanego rozwi¡zania. Co wi¦cej, pozostaªe cz¦±ci algorytmu równie» zostaªy przeanal- izowane pod k¡tem zrównoleglenia, które zostaªo wykonane za pomoc¡ bardziej trady- cyjnych modeli wielow¡tkowo±ci na procesorach centralnych.

Podsumowuj¡c, praca przedstawia now¡, wysoce efektywn¡ oraz dokªadn¡ metod¦ kon-

strukcji grafu naªo»e« DNA. Otrzymane rezultaty s¡ bardzo obiecuj¡ce, co zach¦ca do

dalszych prac. Naturaln¡ kontynuacj¡ przedstawionych osi¡gni¦¢ jest stworzenie zaawan-

sowanej metody przechodzenia grafu oraz skªadania tzw. kontigów oraz skafoldów, czyli

zrekonstruowanych fragmentów genomu. Jednak»e jest to bardzo zªo»ony temat, który

wykracza poza zakres niniejszej pracy.

(11)

Chapter 1

Introduction

1.1 Computational biology and the genome assembly

Modern science, although divided into many dierent branches, increasingly often tends to merge two or more of them to study and eventually solve a given problem more deeply and more eciently. A prime example of this phenomena may be the interdisciplinary com- bination of biology and computer science that has become a commonplace within a few recent decades. Even though computer science is considered to be a relatively new eld, the rst algorithms are dated back to ancient times, e.g. the famous sieve of Eratosthenes.

Indeed, the rst functional computers were not developed until 1930s [RRH00], but the initial concept of a computing engine was introduced for the rst time one century ear- lier by Charles Babbage an English polymath, who is considered to be a father of the computer [Hal70]. Likewise, the biologically related studies also have their origins in the ancient civilizations, but the term biology came into its modern usage at the beginning of 19th century [Ric02]. However, a discovery that marked the beginning of modern molecular biology was made only in 1953, when the structure of deoxyribonucleic acid (DNA) was discovered by James Watson and Francis Crick [WC53] supported by Rosalind Franklin and Raymond Gosling [FR53] as well as Maurice Wilkins et al. [WMH53]. This pioneering work was of great importance as DNA is the basic carrier of genetic information of ani- mate matter and denes the rules governing all living organisms. Since then, the scientic community has observed a rapid development of aforementioned branch of biology which deals with the molecular basis of biological activity. However, large data volumes gener- ated by the biochemical experiments could not be processed without computer methods and algorithms. In this context a new eld of science has emerged computational biology, otherwise known as bioinformatics. It has not only allowed to better understand existing problems, but has also opened up a whole spectrum of completely new possibilities. As a result, determination of long genomes of living organisms became feasible. Moreover,

9

(12)

people started to analyze the sequence of nucleic acids and the structure of proteins, which allowed to reveal their functions. This had a lot of practical implications, e.g. discovery of the causes of many diseases or development of new drugs and vaccines, among others.

However, it is not only the computer science that contributes to the development of bio- logically related problems. Also a number of advances in computer science were inspired by biological phenomena like natural evolution/coevolution or human immune system e.g.

in computer networks [OSB07, SP12] or scheduling theory [SS14].

One of the projects that had a great impact on the knowledge in the molecular and computational biology was the Human Genome Project (HGP). The ultimate goal of this scientic venture, launched in 1990 and lasting 13 years, was to investigate in depth the sequence and structure of the human genome. The foremost motivation behind this kind of activities was and still is the hope of inventing new cures for diseases that so far were considered in medicine as incurable. Although the project was declared complete in 2003, the initial drafts of the human genome were already published in 2001 [Con01, Ven01] by HGP Consortium and the commercial Celera Corporation, respectively. The research re- vealed the staggering number of over 3 billion base pairs coding 20-25 thousands of genes.

Exploration and processing of such an amount of data were possible only due to the devel- opment of technology that usually accompanies this type of research. Hence, this project is a prefect example of a mutual cooperation between biochemists and computer scientists.

Furthermore, the high complexity level of models as well as the large amount of data are the main reasons for constantly growing number of ecient bioinformatic algorithms and rened methods, which are indispensable in many dierent areas of molecular biology.

The analysis of genetic sequences has become absolutely essential over the past years and currently almost every research in molecular biology involves this kind of analy- sis [AH04]. Even though the sequencing technology has evolved quickly over last four decades, biochemical limitations still prevent the machinery from reading long DNA mole- cules. This means that it is not feasible to read the whole genome at once. Instead, short DNA strands of maximum a few hundreds of nucleotides may only be sequenced. For this reason, once the process of sequencing is nished, the original genome needs to be re- constructed, i.e. assembled, by some computational method. This reconstruction is based on the overlaps between pairs of individual reads. It is worth noting that if the original sequence of a genome is not known beforehand this process is called de novo assembly.

Otherwise, it is often referred to as resequencing.

Historically, the rst popular sequencing method was developed by Frederick Sanger

and his colleagues in 1977 [SNC77]. From then on, the chain termination method was the

most widely used sequencing technique for around 25 years. In 1988 Ed Southern pro-

posed a new method called sequencing by hybridization (SBH) [Sou88]. This initiated a

(13)

1.1. Computational biology and the genome assembly

11 wider scientic discussion regarding an eective reconstruction method from short DNA fragments [BS88, LFK

⁺

88, BKK

⁺

97, BHKdW99, BFK

⁺

99]. It also turned out that the graph theory is very useful in this context, as graphs constitute a good representation of overlapping oligonucleotides (which are equivalents of reads in SBH) and traversing algo- rithms may be used to reconstruct the desired sequence, in this case of a known length.

Although Pavel Pevzner suggested in 1989 that this problem may be solved polynomially when no sequencing errors are present [Pev89], this question was nally settled only 10 years later in [BHKdW99]. The more realistic model with errors was shown to be strongly NP-hard [BK03]. However, with the advent of Next-Generation Sequencing (NGS) tech- nologies, e.g. 454 Life Sciences in 2005 [MEea05] or Illumina in 2007, things got even more dicult. The main dierences between traditional sequencing methods and NGS are the amount of produced data and cost per every sequenced nucleotide: NGS technologies tend to output huge data volumes for much less money. Therefore, sequencing of the whole genomes became more aordable, but at the same time the associated computational dif-

culty increased substantially. Additionally, in NGS methods the reads come from both strands of the DNA double helix and it is not known from which one. As a result the number of sequences to consider needs to be doubled.

The process of DNA de novo assembly is computationally hard and its complexity depends mainly on the number of sequences. Under some strict assumptions the problem may be formulated as the Shortest Superstring Problem (SSP), which was shown to be NP- hard [GMA80]. However, in realistic biochemical experiments reads usually contain errors, a genome is covered unevenly and it contains repetitive regions. These properties combined with the immense number of sequences produced by NGS make the outlined problem extremely dicult. To address this problem eciently several graph-based heuristic ap- proaches have been proposed recently, e.g. [ZB08, KTN

⁺

14, SD12, BBF

⁺

09], and many of them place a special emphasis on the parallelization. However, none of them was designed to take advantage of modern massively parallel computing architectures, e.g. graphics cards. The graphics processing units (GPUs), although primarily designed to render three- dimensional scenes, have gained a lot of attention in a few recent years, as they may serve as an ecient accelerators also for general-purpose computations. Interestingly, the com- putational power of GPUs has exceeded that oered by central processing units (CPUs) in the year 2003, and the gap between these two architectures has been growing ever since.

However, due to a specic hardware architecture not every algorithm may be executed on a GPU eciently. Fortunately, recent research [LR09, BFK

⁺

11, FK10, BFKW13, FKBW12]

has shown a great potential of graphics processing units in bioinformatics. One disadvan-

tage of such algorithms is that they are often much more complex than their CPU equiv-

alents. To partially overcome this problem the idea of hybrid programming [CLC08] was

(14)

introduced. According to this idea those calculations that cannot be eciently performed on a GPU are executed on a traditional CPU. The higher processing power delivered by accelerators may speed up the computations and, if used properly in heuristic algorithms, increase their accuracy. It is especially important in the considered problem as the quality of assembled DNA fragments is often unsatisfactory. This may result from both simplied graph models and traversing algorithms, which now can be more sophisticated with more computational power. Therefore, there is still a lot of headroom for improvements in the outlined problem.

1.2 Goals and scope of the thesis

The primary goal of this thesis is to propose an accurate, parallel and ecient method for graph construction that may be subsequently used in the process of the whole genome de novo assembly. The proposed algorithm needs to be also implemented and tested on various data sets in order to evaluate its performance. Such formulation of the problem is a direct result of the discussion presented in the introductory section and will be explained in more detail throughout this work. The realization of the main goal will advance the science in the area of DNA assembly as well as parallel processing.

Since the scope of this research is somewhat broad, let us now outline the main points that are considered in this thesis. First of all, based on the literature the author examined dierent graph representations for the DNA assembly together with the corresponding pros and cons. As a result, the best model was chosen for further considerations. The next activity concerns the development of a highly parallel GPU-accelerated library for pairwise sequence alignment. This implementation is a vital part of the algorithm for graph construction which heavily relies on the accurate comparison of biological sequences.

Therefore, the implementation was extensively tested and compared to other state-of-the- art solutions. However, the central point of this thesis is a novel algorithm for graph construction. It is mainly based on a detailed k-mer analysis of the biological sequences.

The k-mer approach was selected as a method of alignment-free sequence comparison and it was employed to eciently preselect the pairs of sequences that are likely to overlap. The graph construction algorithm is studied in detail together with all proposed improvements.

Furthermore, an analysis of the algorithm with respect to its parallelization was performed as well.

At this point it is worth noting that the entire algorithm was implemented using the

C/C++ programming language and some external libraries like NVIDIA CUDA, Pthreads

and OpenMP. The implementation was subsequently tested on two types of data sets: in

silico (that to some extend were created by a computer) and real paired-end sequences

(15)

1.2. Goals and scope of the thesis

13 coming from the Illumina sequencing machines (data come from the University of Lux- embourg and the European Center for Bioinformatics and Genomics, among others). The former type of data sets was required to precisely measure the accuracy of designed al- gorithm. On the other hand, the real data sets were used to test the method on large, real-world instances and indirectly estimate the quality of resulting graph.

It should be also stressed that the algorithm for graph traversal, i.e. the part of the DNA assembly that is responsible for the actual reconstruction of contigs/scaolds, is out of the scope of this thesis. Nevertheless, it is an ongoing research that is being carried out at the Poznan University of Technology.

The organization of this thesis is as follows. Chapter 2 describes the primer of molecular biology. Its goal is to provide a glimpse into the world of DNA, RNA, proteins and sequence alignment. Chapter 3, on the other hand, introduces the basic notions and denitions from the graph and algorithms theory. Moreover, it presents the idea of dynamic programming and the basic algorithms for pairwise sequence alignment. This information is intended to give the reader a better understanding of the problem form the computational point of view. In Chapter 4 the most popular sequencing technologies are briey explained.

Chapter 5, in turn, presents an informal denition of DNA assembly problem. It also discusses the two graph representations that may be found in literature. Finally, the pros and cons of selected state-of-the-art algorithms for graph construction are considered.

The architecture of a GPU as well as the background of sequence alignment methods using this architecture is presented in Chapter 6. Moreover, this chapter focuses on the details of the new implementation which is also extensively tested. Chapter 7 describes a novel algorithm for the DNA overlap graph construction and explains how the k-mer characteristics are employed. Furthermore, the parallelization analysis of the proposed algorithm is also carried out. Chapter 8 presents the properties of the data sets which are then used to assess the accuracy and performance of the algorithm. Conclusions are drawn in Chapter 9. Areas of possible further development are pointed out as well. Finally, Appendix A contains relevant information regarding the practical aspects of software usage.

Bibliography is provided at the very end of this thesis.

(16)

(17)

Chapter 2

Biological primer

The DNA assembly is an interdisciplinary problem involving two branches of science:

biology and computer science. Therefore, in order to properly understand this problem a brief introduction to molecular biology is required as well. This chapter is an attempt to explain some of the basic concepts in this area, e.g. what is DNA and RNA, or how we dene a gene and the genetic code. Moreover, a separate section presents the idea behind the protein structure. Finally, the problem of pairwise sequence alignment is also outlined.

2.1 DNA, RNA and genetic code

Deoxyribonucleic acid (DNA) molecules are the basic carriers of genetic information used to construct, develop and maintain all known living organisms and many viruses. They are also responsible for heredity of living organisms, i.e. passing genetic traits to ospring, and thereby for the process of evolution on Earth. An elementary DNA unit carrying in- formation is called a nucleotide. A DNA chain consists of a sequence of nucleotides. At a lower level, each nucleotide is composed of nucleobase (nitrogenous base), a ve-carbon sugar (2-deoxyribose) and a phosphate group. Furthermore, depending on the type of ni- trogenous base, four dierent nucleotides are distinguished: cytosine (C), guanine (G), adenine (A) and thymine (T ). In a single DNA strand the individual nucleotides are linked with asymmetric phosphodiester bonds and, as a result, form so-called polynucleotide. The asymmetric ends of a polynucleotide are called the 5

⁰

and 3

⁰

ends. The DNA molecules are double-stranded: the two chains combined by hydrogen bonds constitute a right-handed spiral (with about 10-10.5 nucleotides per turn), commonly known as a double helix. The hydrogen bonds link pairs of complementary nucleotides, i.e. adenine with thymine and cytosine with guanine. Two such nucleotides are called a base pair. Because of the comple- mentary base pairing a sequence of nucleotides from a single strand is sucient to represent the whole genetic information. However, both strands are needed e.g. in the process of DNA

15

(18)

replication. A diagram of DNA structure is presented in Figure 2.1.

Figure 2.1: The structure of deoxyribonucleic acids (DNA). The illustration by artist Darryl Leja. Courtesy: National Human Genome Research Institute.

The complete genetic material of a given organism, called genome, is stored inside a cell nucleus in a form of chromosomes, see Figure 2.2. For the vast majority of life forms, the basic information carrier is DNA. However, for many types of viruses the genetic information is encoded by ribonucleic acid (RNA). The chemical structure of RNA is very similar to DNA. The main dierences include: RNA contains ribose instead of deoxyribose, and thymine is replaced in RNA by a nitrogenous base called uracil (U). Moreover, RNA is usually single-stranded, while DNA double-stranded. The RNA molecules play also a key role in the process of protein synthesis. The ow of genetic information from DNA to proteins through RNA constitutes the central dogma of molecular biology [Cri56, Cri70].

This process is depicted in Figure 2.3 and is briey described below. First of all, it involves a few types of RNA molecules:

mRNA (messenger RNA) transcribed directly from a DNA strand is then translated into a polymer of amino acids (protein),

tRNA (transfer RNA) reads the genetic code and transfers amino acids to a growing polypeptide chain during the process of translation,

rRNA (ribosomal RNA) the RNA component of the ribosome, provides a mecha-

nism for decoding mRNA into amino acids.

(19)

2.1. DNA, RNA and genetic code

17 Figure 2.2: DNA and chromosomes within the cell nucleus. By artist Darryl Leja. Courtesy:

National Human Genome Research Institute.

The whole process starts with so-called transcription in which the genetic information

held in DNA is transferred into new mRNA molecules. Subsequently, the process of trans-

lation takes place which transforms the mRNA molecules into polypeptide chains, called

proteins. More precisely, the ribosomes read the information from mRNA in a form of so-

called triplet codons. These are the three-nucleotide-long subsequences in the mRNA chain

which unambiguously determine corresponding amino acids, cf. Table 2.1. The codons are

then matched to the anti-codons in the tRNA and thereby consecutive amino acids are

added to the growing polypeptide chain. It is worth noting that even though there are

4

³

= 64 dierent codons, they correspond to only 20 amino acids. Furthermore, only 61

codons actually represent amino acids, and the three remaining combinations serve as ter-

minators. The process of translation begins with a start codon (AUG) and proceeds until

the stop codon is identied. The polypeptide chains, or proteins, are nally released from

(20)

Figure 2.3: A diagram of central dogma of molecular biology. Modied version of gure by Jane Wang from http://scq.ubc.ca

ribosome. The way the genetic information is translated into polypeptide chains is referred to as genetic code. This is an universal code and in general applies to almost all living organisms.

2nd nucleotide

1st nucleotide

U C A G

U

UUU Phe, F UCU Ser, S UAU Tyr, Y UGU Cys, C UUC Phe, F UCC Ser, S UAC Tyr, Y UGC Cys, C UUA Leu, L UCA Ser, S UAA Stop UGA Stop UUG Leu, L UCG Ser, S UAG Stop UGG Trp, W C

CUU Leu, L CCU Pro, P CAU His, H CGU Arg, R CUC Leu, L CCC Pro, P CAC His, H CGC Arg, R CUA Leu, L CCA Pro, P CAA Gln, Q CGA Arg, R CUG Leu, L CCG Pro, P CAG Gln, Q CGG Arg, R A

AUU Ile, I ACU Thr, T AAU Asn, N AGU Ser, S AUC Ile, I ACC Thr, T AAC Asn, N AGC Ser, S AUA Ile, I ACA Thr, T AAA Lys, K AGA Arg, R AUG Met, M ACG Thr, T AAG Lys, K AGG Arg, R G

GUU Val, V GCU Ala, A GAU Asp, D GGU Gly, G GUC Val, V GCC Ala, A GAC Asp, D GGC Gly, G GUA Val, V GCA Ala, A GAA Glu, E GGA Gly, G GUG Val, V GCG Ala, A GAG Glu, E GGG Gly, G

Table 2.1: Genetic code

Finally, it should be stressed that not every DNA fragment is subject to transcription

and then indirectly to protein synthesis. Only selected parts of DNA, called genes, par-

(21)

2.2. Protein structure

19 ticipate in this process. However, the notion of gene is much more complex and actually wider. A gene may be dened as: a locatable region of genomic sequence, corresponding to a unit of inheritance, which is associated with regulatory regions, transcribed regions, and or other functional sequence regions [Pea06, Pen07]. Therefore, it might be a stretch of DNA or RNA coding a polypeptide or an RNA chain holding a function in the organism.

As a result, genes are responsible for building and maintaining the whole living organism.

Importantly, genes are also responsible for passing genetic traits to ospring.

2.2 Protein structure

Proteins are organic compounds consisting of at least one polypeptide chain, each of which is made of amino acids arranged in a linear way. Amino acids that have been incorporated into polypeptides are also referred to as residues. The sequence of residues in a polypeptide is determined by the sequence of a gene and is called primary structure. The proteins of living organisms consist of 20 types of dierent amino acid monomers, which dier in polarity and in charge at physiological pH 7.4, see Table 2.2. Because of dierent kinds of hydrogen bonds patterns between backbone amide and carboxyl groups, the proteins usually form some specic, non-random conformations called secondary structure. The basic types of conformations include α-helices, β-pleated sheets and β-hairpins. Moreover, amino acid side chains may interact and bond in a variety of ways. These interactions and bonds determine the protein tertiary structure which is described by coordinates of molecule atoms in three-dimensional space. However, the structure depends not only on disulde and hydrogen bonds or hydrophobic interactions but also on the environment in which a given protein is synthesized and folded. Furthermore, tertiary structure of a protein is considered to be largely determined by its primary and secondary structure and, therefore, to some extent it is possible to predict this spatial arrangement from the sequence of amino acids and their conformations. The folding process of protein molecules is often simulated in silico (by a computer program) [SNPG02, VBBP10], though this is a very complex task. Such simulation capabilities are in the cornerstone of some disease research and computational drug design, among others. The three-dimensional structure of proteins may be also determined to some degree by means of X-ray crystallography or nuclear magnetic resonance spectroscopy. Additionally, as already mentioned, a protein may be composed of more than one polypeptide chain, which is also known as protein subunit.

In this case its structure depends on all the subunits and is referred to as quaternary

structure. In other words, the quaternary structure describes the arrangement of multi-

subunit complexes.

(22)

Name Abbreviations Side chain polarity Side chain charge (pH 7.4)

Alanine Ala, A nonpolar neutral

Arginine Arg, R polar positive

Asparagine Asn, N polar neutral

Aspartic acid Asp, D polar negative

Cysteine Cys, C nonpolar neutral

Glutamic acid Glu, E polar negative

Glutamine Gln, Q polar neutral

Glycine Gly, G nonpolar neutral

Histidine His, H polar mostly neutral

Isoleucine Ile, I nonpolar neutral

Leucine Leu, L nonpolar neutral

Lysine Lys, K polar positive

Methionine Met, M nonpolar neutral

Phenylalanine Phe, F nonpolar neutral

Proline Pro, P nonpolar neutral

Serine Ser, S polar neutral

Threonine Thr, T polar neutral

Tryptophan Trp, W nonpolar neutral

Tyrosine Tyr, Y polar neutral

Valine Val, V nonpolar neutral

Table 2.2: Amino acids

2.3 Sequence alignment

In the context of computational biology, sequence alignment is dened as a process of de- termining the corresponding nucleotides or amino acid residues in two or more biologically related sequences. Therefore, the sequence alignment helps to identify regions of similar- ity that may be a result of functional, structural or evolutionary relationships among the sequences. In particular, when two sequences are the subject of such a comparison, the pro- cess is called pairwise sequence alignment. Multiple sequence alignment (MSA) extends this approach to more sequences by introducing more sophisticated methods [BFKW13, FK10].

This section focuses, however, on the pairwise sequence alignment as it is exploited later in this thesis.

Depending on the application and the similarity of the sequences one may employ dierent methods of alignment. In general, three basic types of pairwise sequence alignment may be distinguished:

global the goal of the global alignment is to arrange every single residue from both sequences; it is most useful when the sequences are similar and roughly of equal length,

semi-global the semi-global alignment is very similar to the global alignment, but

terminal residues (i.e. those at the beginning or at the end of the sequences) do not

(23)

2.3. Sequence alignment

21 have to be aligned; it may be especially useful when the rst part of one sequence overlaps with the second part of the other sequence,

local two sequences may not be very similar but may contain highly similar regions;

the local alignment is an attempt to nd such regions.

(a) Input sequences

(b) Global alignment

(c) Semi-global alignment

(d) Local alignment

Figure 2.4: Examples of pairwise sequence alignment

A typical representation of aligned sequences is in a form of a matrix where the rows contain sequences, and identical or similar characters are aligned in successive columns. Ex- ample global, semi-global and local alignments are presented in Figure 2.4. The sequences may be of dierent length and therefore they may be supplemented by gaps, denoted by

. Note that each column contains at least one character. This means that a column containing two gaps is redundant and may be removed without any information loss.

Let us now have a brief glance at the semantic meaning of mismatches and gaps within

alignments. When two sequences are related, it usually means that they have diverged

from a common ancestor by mutation processes. The most common mutation is when one

nucleotide or residue is changed into another. An example of this event, called substitution,

may be observed in the fourth column of local alignment in Figure 2.4d (C has been

substituted by A, or vice versa). Another kind of mutation is when a part of a sequence is

lost or removed which is referred to as deletion. Although deletions frequently aect only

single characters, longer segments may be missing as well. One example of such deletion

(but in two dierent interpretations) can be seen in Figures 2.4b and 2.4c, more specically

(24)

between ATT and the second part of the rst sequence. The last type of mutation, called

insertion, refers to the opposite situation, i.e. when one or more nucleotides/residues are

inserted into a sequence. In fact, deletion and insertion are indistinguishable given that no

additional knowledge about the sequences is available. In other words, when two sequences

are compared, it usually cannot be determined which mutation has occurred: insertion

into the rst sequence or deletion from the second one. As a result, these events are often

referred to as indels.

(25)

Chapter 3

Basic notions in algorithms theory and computational complexity

The chapter presents basic notions and denitions from algorithms theory and computa- tional complexity needed to understand the combinatorial nature of the methods presented later in this thesis. At the beginning a short compendium of graph theory fundamentals is outlined. Subsequently, the notions from the theory of combinatorial problems are ex- plained. These include i.a. the big O notation, P and NP classes, the polynomial trans- formation and the computational complexity. The reader may nd the latter denition particularly useful, e.g. in the context of designing ecient, parallel methods. Finally, the idea of dynamic programming is introduced, as the algorithm proposed in this thesis is partially based on this concept.

3.1 Graph theory

This section focuses on presenting the basic notions from the graph theory. However, it is only a brief introduction to this topic and more detailed information may be found in literature, e.g. [CLZ10, GY03].

An undirected graph G is an ordered pair G = {V, E} consisting of a nite nonempty set of objects called vertices V and a nite, but possibly empty, set of unordered pairs of usually distinct vertices of G called edges E.

A directed graph D, otherwise known as digraph, is an ordered pair D = {V, A} consist- ing of a nite nonempty set of objects called vertices V and a nite, but possibly empty, set of ordered pairs of usually distinct vertices of D called arcs A. In literature, the arcs are also referred to as directed edges.

The directed graphs are a more natural representation when order or direction between pairs of objects in a given problem does matter, as is the case with the DNA overlap

23

(26)

graphs considered in this thesis. For this reason, more emphasis in this section is placed on digraphs, although the corresponding notions are usually quite similar also for undirected graphs.

A single arc a = (u, v) joins two vertices u, also called tail, and v, called head. It is also said that vertex u is a direct predecessor of vertex v and vertex v is a direct successor of vertex u. In this case, u and v are adjacent vertices, while u and a are incident, and so are v and a. Moreover, two adjacent vertices may be called neighbors. If arc a = (u, u) of digraph D connects vertex u to itself, then a is called a self-loop, or simply a loop. Otherwise, i.e.

when arc connects distinct vertices, it is called a proper arc. A multi-arc is a set of two or more arcs having the same tail and the same head. A simple digraph is a digraph with no self-loops and no multi-arcs. Note that this kind of digraphs are considered in the context of the DNA graphs. Finally, the underlying graph of a digraph D is an undirected graph which results from removing all the designation signs of head and tail from the arcs of D.

An important notion in the graph theory is the degree of a vertex v, denoted as deg(v).

In the case of undirected graphs, it is the number of proper edges (i.e. those connecting distinct vertices) incident on v plus twice the number of self-loops, also incident on v. A vertex of degree 0 is called an isolated vertex. On the other hand, in digraphs, the indegree deg

⁺

(v) and the outdegree deg

⁻

(v) of a vertex v are distinguished. The former is dened as the number of arcs directed to v. The latter refers to the number of arcs directed from v . Note that each self-loop at v increases the value of both indegree and outdegree of v by one. An isolated vertex v is the one with deg

⁺

(v) = deg

⁻

(v) = 0.

Multiple practical problems that are modeled with graphs (or digraphs) usually boil down to a specic way of traversing vertices and edges. Any traversal of consecutive el- ements of graph G may be dened as a walk [GY03], i.e. an alternating sequence of its vertices and edges:

W = v

₀

, e

₁

, v

₁

, ... , e

_n

, v

_n

(3.1) where, v

0

is the initial vertex, v

n

is the nal or terminal vertex, and edges are denoted by e

j

= {v

j−1

, v

j

} for j = 1, 2, ... , n. Moreover, if edge e

j

is directed from v

j−1

to v

j

for each j = 1, 2, ... , n, then W is a directed walk. The number of edges (or arcs), including repetitions, is referred to as the length of a walk. If the initial vertex is the same as the

nal vertex, the walk is closed, otherwise it is open. A trial in a graph is dened as a walk

with unique edges, i.e. no edge is repeated. Similarly, a path in a graph is a trial such that

no vertex is repeated (excluding the initial and the nal one, which may be the same). A

closed trial of length at least 1 is called a circuit, whereas a closed path of length at least

1 is referred to as a cycle. Note that a graph that has no cycle is called acyclic. It should

(27)

3.1. Graph theory

25 be stressed that equivalent terminology (i.e. directed trial, path, circuit and cycle) applies also to digraphs.

From the historical perspective, the most famous trial and circuit are the Eulerian ones.

An Eulerian trial in graph G is a walk that contains every edge of G exactly once. Likewise, an Eulerian circuit of graph G is a circuit that contains every edge of G exactly once.

These issues were studied for the rst time in 1736 by Leonhard Euler solving the famous problem of Seven Bridges of Königsberg. Furthermore, a graph which has an Eulerian circuit is called an Eulerian graph. In the case of digraphs the denitions are formulated in the same way, only directed trial and circuit are considered there.

Another important collection of graph denitions concerns their connectivity proper- ties. The length of the shortest walk between two vertices in a graph is called a distance.

The corresponding notion in the context of digraphs is called a directed distance. A graph is dened as connected if there is a walk between any selected pair of vertices. A digraph, in turn, is weakly connected if its underlying graph is connected. Additionally, strongly connected digraph is the one with a directed walk between any selected pair of vertices.

A subgraph H of graph G is a graph where a set of vertices V

H

is a subset of V

G

and a set of edges E

H

is a subset of E

G

, i.e. V

H

⊂ V

G

and E

H

⊂ E

G

. Moreover, the induced subgraph of graph G on a selected set of vertices W = {w

¹

, ... , w

k

}, denoted as G(W ), is a subgraph of graph G which has W as its vertex set, and contains every edge of G whose endpoints are in W . Furthermore, a connected subgraph H of graph G constructed in a way that there is no other vertex in G connected to H is called a connected component.

In other words, a connected component is a maximal connected subgraph. Similarly, a strongly connected component is dened as a connected subgraph H of digraph D such that there is no other vertex in D that is bidirectionally connected to H. These notions are especially important in the context of Section 5.

However, probably the foremost kind of graphs in this thesis are those possessing Hamil- tonian paths. A Hamiltonian path in graph G is dened as a path traversing all the vertices of G. Correspondingly, a Hamiltonian cycle in graph G is a cycle containing all the vertices of G. In the case of digraphs the denitions are formulated in the same way, only directed path and cycle are considered there. A graph that has an Hamiltonian cycle is called a Hamiltonian graph, and was named after a famous mathematician William Rowan Hamil- ton who in 1857 invented the icosian game, nowadays also known as Hamilton's puzzle.

Nevertheless, it should be stressed that it was Thomas Kirkman who studied the more general problem prior to Hamilton.

As already noted, many practical problems, especially in the area of operations research

and bioinformatics, may be represented and actually solved with graph models and cor-

related algorithms. Many of these problems are typically modeled with so-called weighted

(28)

graphs. A weighted graph is a graph in which every edge is associated with a weight (usually an integer or a real number) representing its importance, length, cost or anything else rele- vant to a given model. The weight of a walk is the sum of all the weights of edges traversed by this walk. In this context, a prime example is the Travelling Salesman Problem [App06], which can be stated as follows: given a weighted graph the task is to nd the Hamiltonian cycle with the lowest weight. This problem, despite being rst formulated in 1930, is still one of the most intensively studied, and slightly modied appears as a subproblem in the whole genome assembly.

3.2 Computational complexity

This section is a brief introduction to the computational complexity of combinatorial prob- lems. More detailed information regarding this topic may be found in [GJ79, J.88, Dro97].

At the very beginning let us recall that the combinatorial problems are those that con- cern nite or at least enumerable collection of objects and parameters of integer type. In the combinatorial theory three classes of problems may be distinguished: decision, search and optimization ones. The decision problems may be formulated in a form of a ques- tion with yes or no being the only possible answers (e.g. Does a given graph contain a Hamiltonian cycle?) On the other hand, the aim of the search problems is to nd a feasible solution for dened parameters of the problem (e.g. nding a Hamiltonian path in a given graph). If no feasible solution exists, an algorithm designed to solve a given problem should report this fact. Finally, the optimization problem class is a special case of the search problem class. An optimization problem requires maximization or minimization of a given objective function (e.g. nding a Hamiltonian cycle with a minimum cost in a given graph). An exact algorithm solving the optimization problem should either return the optimal solution or report that no feasible solution exists. Additionally, for each opti- mization problem one may also formulate its corresponding decision and search versions (but not vice versa). In this case the question usually concerns the existence of a feasible solution with the objective function value greater (or smaller) than a predened value. In some combinatorial problems the search and optimization versions may be equivalent, as the only feasible solution is at the same time the optimal one.

Let π denote a decision problem with a nite set of parameters (the values of which

have not been set) and a question with possible answer yes or no. An instance I of

problem π is obtained by specifying particular values for all the problem parameters. As a

result, π may be dened as a set of all specic instances D

π

. Additionally, we may dene

a subset Y

π

⊆ D

π

containing all the instances I for which the answer is yes. Let us

also dene the size of a problem instance I, denoted by N(I). This size determines the

(29)

3.2. Computational complexity

27 length of a nite string of symbols x(I) needed to encode I-specic parameters according to an adopted encoding rule. All the symbols used for the encoding purposes belong to a predened alphabet σ. From the computational complexity perspective all the possible encoding schemes are equivalent given that:

numbers used to encode I are represented with base greater than 1 (e.g. binary encoding),

coding is not redundant, i.e. there are no unnecessary symbols in x(I).

A procedure solving a given problem is called algorithm. An algorithm may be given either in a form of an ordered set of steps to perform or in a form of a computer program.

One may say that algorithm A solves problem π if it nds a solution for each instance I ∈ D

^π

. A measure of the execution time of algorithm A solving problem π of size N(I) = n is the maximum number of elementary steps needed to solve any instance of that size. This measure is usually referred to as function of computational complexity η

A

(n) .

Denition 3.2.1. Algorithm A is of computational complexity O(f(N(I))) if for a given η

_A

(N (I)) there exists a constant number c > 0 such that η

A

(N (I)) ≤ c·f(N(I)) for almost all I ∈ D

π

. Function f(N(I)) is referred to as an upper bound of function η

A

(N (I)).

The algorithms may be in general divided into two groups depending on their computa- tional complexity. A given algorithm is polynomial in time if its computational complexity is O(f(n)) where f(n) = a

k

·n

^k

+a

_k−1

·n

^k−1

+...+a

₀

is a polynomial function and n = N(I).

All the algorithms with computational complexity that cannot be bounded this way are called exponential in time. Additionally, one may also distinguish the third class of so- called pseudopolynomial algorithms. These are the algorithms for which their execution time can be bounded by a polynomial function depending on the instance size N(I) and the maximum value of any problem parameter Max(I). In practice, the polynomial (including pseudopolynomial) algorithms are usually acceptable in terms of their run time. On the contrary, the exponential algorithms are usually considered as useless as their execution time is prohibitively long.

However, there are a number of problems for which no polynomial algorithms are

known. Moreover, for some problems such algorithms may simply do not exist. We usually

refer to these problems as intractable. It is of great importance to determine whether a

given problem is intractable or not to be able to decide what kind of algorithm needs to

be developed, i.e. exact or approximate/heuristic, to solve this problem. There are a few

classes of computational complexity of combinatorial problems. Yet, in order to introduce

them, it is necessary to choose a particular model of computing system. In this context

the most commonly used model is Deterministic T uring Machine (DTM), which is an

(30)

example of realistic computer system. Any algorithm that is polynomial in time on DTM may be also executed in polynomial time on any other realistic machine. On the other hand, NonDeterminictic T uring Machine (NDTM) will be considered as an example of non-deterministic model. In this model, a machine is capable of executing unbounded number of computations in a given time unit. Most importantly, if NDTM solves problem π in polynomial time, then there exists a polynomial function p such that DTM solves the same problem in time O(2

^{p(N (I))}

) , I ∈ D

π

. As a result, NDTM may be simulated by DTM in exponential time. It is worth noting that in general reverted statement is not true.

Denition 3.2.2. A decision problem belongs to computational complexity class P if there exists an algorithm solving this problem in polynomial time on DTM.

The above denition introduces complexity class P , which stands for P olynomial.

Another class of decision problems is called NP , which stands for Nondeterministic P olynomial . This class is considered to cover more problems. Its denition is given below.

Denition 3.2.3. A decision problem belongs to computational complexity class NP if there exists an algorithm solving this problem in polynomial time on NDTM.

Both denitions presented above indicate that P ⊆ NP . Even though, it is possible that P = NP , the majority of scientic community suspect that P 6= NP . However, despite some attempts [V.10], it has not been proved yet.

To dene next class of computational complexity, the denition of polynomial trans- formation needs to be introduced.

Denition 3.2.4. Polynomial transformation of problem π

2

to problem π

1

, denoted by π

₂

∝ π

1

, is a function f : D

π2

→ D

π1

satisfying the following conditions [BJJ83]:

for any instance I

2

∈ D

^π2

the answer is yes if and only if the answer for f(I

2

) ∈ D

^π1

is yes too,

execution time of function f on DTM is upper-bounded by a polynomial g(N(I

2

)) for any instance I

2

∈ D

π2

.

Denition 3.2.5. A decision problem π

1

is NP-complete if π

1

∈ NP and any decision problem π

2

∈ NP can be polynomially transformed into π

1

(π

2

∝ π

1

).

As a consequence of denitions presented above, in order to prove that a decision problem π

1

is NP -complete one needs to rst show that it belongs to the NP class of problems, and subsequently transform any known NP -complete problem π

N P

into π

1

(π

N P

∝ π

1

). To prove that a given problem belongs to the NP class an algorithm solving

this problem in polynomial time on NDTM should be constructed. The rst problem that

was proved to be NP -complete was the boolean satisfiability problem. Unless P = NP ,

(31)

3.2. Computational complexity

29 it is impossible to create a polynomial algorithm solving any NP -complete problem on DTM. And vice versa, a construction of a polynomial algorithm for DTM solving any NP - complete problem would prove that P = NP . In other words, if any NP -complete problem could be solved by a polynomial algorithm, all the problems from the NP -complete class would be solvable in polynomial time too. Additionally, some NP -complete problems may be answered by pseudo-polynomial algorithms. Those decision problems that cannot be solved in pseudo-polynomial time constitute strongly NP -complete class, given that P 6=

N P . More formally:

Denition 3.2.6. Let π

p

be a subproblem of decision problem π obtained by restriction of domain D

π

to only those instances for which Max(I) ≤ p(N(I)), where p is a polynomial.

Problem π is strongly NP-complete if π ∈ NP and π

p

is NP-complete.

Therefore, to prove strong NP -completeness of a given problem, one should nd a polynomial p such that π

p

is NP -complete. To simplify this process let us introduce a denition of pseudo-polynomial transformation.

Denition 3.2.7. Pseudo-polynomial transformation of decision problem π

2

into problem π

₁

is a function f : D

π2

→ D

π1

satisfying the following conditions:

for any instance I

2

∈ D

π2

the answer is yes if and only if the answer for f(I

2

) ∈ D

π1

is yes too,

execution time of function f on DTM is upper-bounded by the following polynomials:

g

1

(N (I

2

)) and g

2

(M ax(I

2

)) for any instance I

2

∈ D

π2

,

there exists a polynomial p

1

such that p

1

(N (f (I

2

))) ≥ N(I

²

) for each instance I

2

∈ D

_π₂

,

there exists a polynomial p

2

such that Max(f(I

2

)) ≤ p

²

(M ax(I

2

), N (I

2

)) for each instance I

2

∈ D

π2

.

It has been proved that if π

2

is strongly NP-complete, π

1

∈ NP and there exists a pseudo-polynomial transformation of π

2

into π

1

, then π

1

is strongly NP -complete.

Denition 3.2.8. The polynomial Turing transformation of problem π

2

to problem π

1

, denoted by π

2

∝

T

π

₁

, is an algorithm solving problem π

2

on DTM in polynomial time using some hypothetical polynomial-time procedure solving problem π

1

(also on DTM).

According to the above denition a search problem π

1

is dened as NP -hard if any

other NP -hard (or NP -complete) problem π

2

may be transformed using the polynomial

Turing transformation to π

1

, i.e. π

2

∝

T

π

₁

. Moreover, strongly NP -hard problems are

dened as follows:

(32)

Denition 3.2.9. Let π

p

be a subproblem of problem π obtained by restriction of domain D

π

to only those instances for which Max(I) ≤ p(N(I)), where p is a polynomial. Problem π is strongly NP-hard if π ∈ NP and π

p

is NP-hard.

Each optimization problem has its decision as well as its search versions. From the computational complexity perspective both of them, i.e. the decision and search versions of a given problem, are not computationally harder than the optimization version of the same problem. It is also crucial to know that the optimization version of a given problem is at lest as complex as the search version of the same problem, which in turn is at least as complex as the corresponding decision problem. As a consequence, an optimization problem is called (strongly) NP -hard if its decision or search version has been proved to be (strongly) NP -complete or (strongly) NP -hard, respectively. In general the inverted statement is not true, i.e. computational simplicity of the decision version does not imply computational simplicity of the search and optimization versions of the same problem.

This means that an existence of polynomial algorithm solving a given decision or even search problem does not determine the computational complexity of the corresponding optimization problem, which still can be NP -hard or even strongly NP -hard. In order to solve any (strongly) NP -hard problem in a reasonable time one needs to design a heuristic (or approximate) algorithm. Such an algorithm may, however, provide suboptimal solutions.

3.3 Dynamic programming

This section describes the idea of dynamic programming which is a popular method used to solve selected problems from the area of operations research or bioinformatics, among others. Let us start from a rather informal description of the method. The dynamic programming allows to solve a given problem optimally by dividing it into smaller sub- problems. Once these are solved, the results are iteratively combined to reach an overall solution. The interesting property of the method is that it remembers the result for each already computed subproblem and, therefore, the solution to every subproblem is com- puted only once. Sometimes this allows to save a lot of time, especially when the number of repeating subproblems tend to grow exponentially.

However, the dynamic programming addresses only a particular class of optimization

problems, so-called multi-stage decision processes with Markov property. One example of

such a problem considered in this thesis is nding an optimal alignment between two

biological sequences. In this class of problems the state of a process at any stage depends

on a decision which can be chosen from a set of feasible decisions D. Let s

1

, s

₂

, ..., s

_N

denote states of a process in consecutive stages and x

1

, x

2

, ..., x

N

decisions that lead to

(33)

3.3. Dynamic programming

31 the corresponding states. A course of a multi-stage decision process may be dened as follows:

s

_k

= T (s

_k−1

, x

_k−1

), f or k = 2, 3, ..., N (3.2) where T is a transformation depending on a given state s

k−1

and decision x

k−1

∈ D

k−1

. The process is deterministic if the result s

k

of transformation T is unequivocally determined by the pair (s

k−1

, x

k−1

).

A sequence of decisions made in a multi-stage decision processes is evaluated by a scalar function: f(s

1

, s

2

, ..., s

_N

; x

1

, x

2

, ..., x

_N

) . A desired course of such a process results in function f reaching its extremum (minimum or maximum). The following paragraphs assume that f is minimized, without losing the generality of the problem in question.

An optimal strategy is dened as a sequence of feasible decisions x

1

, x

2

, ..., x

N

that minimizes the value of function f. State s

1

is considered as initial state of the system.

A multi-stage decision process with Markov property is a process in which at any stage the decision that is to be made does not depend on the history, but on the current state only. In other words, such a process guarantees that at any stage the nal value of function f depends only on the current stage and future decisions.

An invention of dynamic programming is attributed to the famous mathematician Richard Ernest Bellman, who formulated the Principle of Optimality in 1953 [R.E53], which was also further described in [R.E54, R.E57]. This principle is presented below.

Denition 3.3.1. An optimal policy has the property that whatever the initial state and initial decision are, the remaining decisions must constitute an optimal policy with regard to the state resulting from the rst decisions.

To explain, let z = f

1

(s

1

, x

1

) + f

2

(s

2

, x

2

) + ... + f

N

(s

N

, x

N

) be a criterion function to be minimized. The Principle of Optimality implies that:

x1

min

,...,xN

{f

1

(s

1

, x

1

) + f

2

(s

2

, x

2

) + . . . + f

N

(s

N

, x

N

) } = min

x1

f

₁

(s

₁

, x

₁

) + min

x2,...,xN

{f

2

(s

₂

, x

₂

) + . . . + f

_N

(s

_N

, x

_N

) }

(3.3)

Let us dene q

k

as follows:

q

_k

(s

_k

) = min

xk,...,xN

{f

k

(s

_k

, x

_k

) + . . . + f

N

(s

N

, x

N

) }, for k = 1, . . . , N (3.4) Then, using the above denition:

q

N

(s

N

) = min

xN

{f

N

(s

N

, x

N

) } (3.5)

(34)

q

_k

(s

_k

) = min

xk

{f

k

(s

_k

, x

_k

) + q

_k+1

(s

_k+1

) }, for k = N, . . . , 1 (3.6) The Bellman equation, also known as a dynamic programming equation, is a result of Formula 3.2:

q

_k

(s

_k

) = min

xk

{f

k

(s

_k

, x

_k

) + q

_k+1

[T (s

_k

, x

_k

)] }, for k = N, . . . , 1 (3.7) where q

N +1

(s

N +1

) = 0.

The procedure of dynamic programming involves N stages, at which the corresponding values of q

k

are computed, as in Formula 3.7. Note that each consecutive q

k

is obtained by minimizing the sum of previously calculated value q

k+1

and the impact of possible decisions considered at a given stage. Finally, q

1

at the initial state of the system is the value of the optimal solution. By tracking back the calculations already performed one may deduce the optimal values of the decision variables, starting from the optimal decision x

^∗₁

until the whole sequence of optimal decisions x

^∗1

, x

^∗₂

, ..., x

^∗_N

is known.

It should be stressed that in dynamic programming any step requires nding the min- imum value of f

k

from the set of feasible decisions D

k

. As a result, the computational complexity of this method is exponential, unless the cardinality of D

k

can be bounded polynomially with respect to the problem size for each k. In the latter case the algorithm becomes pseudo-polynomial in time (see bounded knapsack problem in [GR79]).

3.4 Pairwise sequence alignment algorithms

One of the algorithms that is extensively used as a building block in this thesis and is worth mentioning here is an algorithm for pairwise sequence alignment (cf. Section 2.3). To be more precise, the literature describes three algorithms that are based on the same principle, i.e. dynamic programming, and therefore to some degree work in the same way. Namely, the Needleman-Wunsch [NW70] algorithm is used to compute the global alignment, its modication may be exploited in the case of semi-global alignment, and nally for the local alignment the algorithm of Smith-Waterman [SW81] is usually employed. It should be stressed that although these are not the only methods for pairwise sequence alignment, they are the most commonly used since they solve the outlined problem exactly according to a given mathematical model. Moreover, these methods are often accompanied by the Gotoh's enhancement [Got81] which additionally allows to model the ane gap penalties.

MichaªKierzynka GPUACCELERATEDGRAPHCONSTRUCTIONFORTHEWHOLEGENOMEASSEMBLY

Pozna« University of Technology Faculty of Computing Institute of Computing Science

and

European Center for Bioinformatics and Genomics

PhD thesis

GPUACCELERATED GRAPH CONSTRUCTION FOR THE WHOLE GENOME ASSEMBLY

Michaª Kierzynka

Promotor

prof. zw. dr hab. in». Jacek Bªa»ewicz Assistant promotor

dr in». Paweª Wojciechowski

Pozna«, 2014

I would like to thank:

both supervisors for guidance and useful remarks,

Wojciech Frohmberg for many years of joint work,

my parents for their love and support,

my ancée for motivation and her optimism,

and everyone who gave me a helping hand.

Contents 6

1 Introduction 9

1.1 Computational biology and the genome assembly . . . . 9

1.2 Goals and scope of the thesis . . . 12

2 Biological primer 15 2.1 DNA, RNA and genetic code . . . 15

2.2 Protein structure . . . 19

2.3 Sequence alignment . . . 20

3 Basic notions in algorithms theory and computational complexity 23 3.1 Graph theory . . . 23

3.2 Computational complexity . . . 26

3.3 Dynamic programming . . . 30

3.4 Pairwise sequence alignment algorithms . . . 32

3.4.1 The Needleman-Wunsch algorithm . . . 33

3.4.2 The Smith-Waterman algorithm . . . 35

4 Genome sequencing technologies 37 4.1 The Sanger method . . . 37

4.2 Sequencing by hybridization . . . 38

4.3 Next-Generation Sequencing . . . 39

4.3.1 454 sequencing . . . 39

4.3.2 Illumina sequencing . . . 40

4.3.3 SOLiD sequencing . . . 41

4.3.4 Ion Torrent sequencing . . . 42

4.4 Summary . . . 42

6

7

5 DNA assembly problem 43

5.1 Denition . . . 43

5.2 Methods . . . 44

5.2.1 Original DNA graph models . . . 44

5.2.2 DNA overlap graphs . . . 45

5.2.3 Induced subgraphs of de Bruijn graphs . . . 46

5.2.4 Discussion . . . 47

5.3 State-of-the-art algorithms . . . 48

5.4 Conclusions . . . 49

6 Pairwise sequence alignment on GPU 51 6.1 Background . . . 51

6.2 GPGPU and CUDA programming model . . . 53

6.3 GPU implementation of sequence alignment . . . 55

6.4 Computational tests . . . 61

7 Algorithm for the DNA overlap graph construction 65 7.1 K-mer characteristics of the sequences . . . 66

7.2 Selection of overlapping sequences and graph construction . . . 68

7.3 Improvements of the original algorithm . . . 74

7.4 Analysis of the algorithm with respect to its parallelization . . . 79

8 Computational tests 85 8.1 Description of data sets . . . 85

8.1.1 In silico data sets . . . 85

8.1.2 Whole genome data sets . . . 89

8.2 Quality of the graph . . . 92

8.2.1 Results on in silico data sets . . . 92

8.2.2 Results on whole genome data sets . . . 102

8.3 Testing environment . . . 105

8.3.1 Time and memory consumption . . . 106

8.3.2 Scalability of the parallel algorithm . . . 109

9 Conclusions 113 A Software usage 115 A.1 Capabilities of the program . . . 115

A.2 Parameters of the program . . . 117

A.3 System requirements . . . 118

A.4 Software availability . . . 118

Bibliography 119

Streszczenie

n + n), gdzie n to liczba sekwencji w danej instancji problemu. W dalszej kolejno±ci jako±¢ grafu jest podnoszona za pomoc¡ czterech dodatkowych procedur. W rezultacie dokªadno±¢ za- proponowanego algorytmu jest bardzo wysoka.

Przedstawiony w pracy algorytm byª przetestowany na ró»norodnych zbiorach danych:

pocz¡wszy od wygenerowanych syntetycznie, ko«cz¡c na danych pochodz¡cych z rzeczy-

wistych sekwenatorów. Dzi¦ki tym pierwszym mo»liwy byª dokªadny pomiar jako±ci kon-

struowanych grafów przy u»yciu macierzy pomyªek. Czuªo±¢ zaproponowanej metody wa-

haªa si¦ pomi¦dzy 97% a 99%, co w poª¡czeniu z wysok¡ precyzj¡ si¦gaj¡c¡ 99% daªo

bardzo dobre wyniki. Pokazano równie», »e metoda bardzo dobrze radzi sobie nawet w przy-

padku danych z bª¦dami sekwencjonowania. Ponadto, jako±¢ grafu zostaªa zwerykowana

MichaªKierzynka GPUACCELERATEDGRAPHCONSTRUCTIONFORTHEWHOLEGENOMEASSEMBLY

GPUACCELERATED GRAPH CONSTRUCTION FOR THE WHOLE GENOME ASSEMBLY

my ancée for motivation and her optimism,

5.1 Denition . . . 43

padku danych z bª¦dami sekwencjonowania. Ponadto, jako±¢ grafu zostaªa zwerykowana

Furthermore, the high complexity level of models as well as the large amount of data are the main reasons for constantly growing number of ecient bioinformatic algorithms and rened methods, which are indispensable in many dierent areas of molecular biology.

Historically, the rst popular sequencing method was developed by Frederick Sanger

wider scientic discussion regarding an eective reconstruction method from short DNA fragments [BS88, LFK

culty increased substantially. Additionally, in NGS methods the reads come from both strands of the DNA double helix and it is not known from which one. As a result the number of sequences to consider needs to be doubled.

However, due to a specic hardware architecture not every algorithm may be executed on a GPU eciently. Fortunately, recent research [LR09, BFK

The k-mer approach was selected as a method of alignment-free sequence comparison and it was employed to eciently preselect the pairs of sequences that are likely to overlap. The graph construction algorithm is studied in detail together with all proposed improvements.

Chapter 5, in turn, presents an informal denition of DNA assembly problem. It also discusses the two graph representations that may be found in literature. Finally, the pros and cons of selected state-of-the-art algorithms for graph construction are considered.