HMM-based Method for Identifying Enrichment in Signal from Sequencing-based Experiments

(1)

University of Warsaw

Faculty of Mathematics, Informatics and Mechanics

Anna Macioszek

Student no. 318992

HMM-based method for identifying enrichment in signal from

sequencing-based experiments

PhD's dissertation in COMPUTER SCIENCE

Supervisor:

dr hab. Bartosz Wilczy«ski Institute of Informatics

Warsaw, March 2022

(2)

(3)

Supervisor's statement

I hereby conrm that the presented thesis was prepared under my supervision and that it fullls the requirements for the degree of PhD of Computer Science.

Date Supervisor's signature

Author's statement

I hereby declare that the presented thesis was prepared by me and none of its contents was obtained by means that are against the law. The thesis has never before been a subject of any procedure of obtaining an academic degree.

Moreover, I declare that the present version of the thesis is identical to the attached electronic version.

Date Author's signature

(4)

I would like to thank my ancé Maciek: for his help, continuous moral support, advice, delicious meals and so much more. Thanks :*

(5)

Abstract

With the advent of so-called Next-Generation Sequencing (NGS), many protocols for NGS-based experiments emerged. For many of them, a crucial step in a downstream analysis is a procedure called peakcalling, that aims at identifying enrichment in signal over a genome. While there are many tools available for peakcalling, still in many cases they yield unsatisfactory results. In this dissertation a new tool for peakcalling is described, called HERON - HiddEn MaRkov mOdel based peakcalliNg. It is intended to work well with signal that contains long, weakly enriched peaks, similar to signals resulting from ChIP-seq against H3K27me3 or H3K9me3 modications. It is based on Hidden Markov Model with continuous emissions. HERON is compared to three other peakcalling tools - MACS, SICER and BayesPeak. Their performance is assessed on many simulated and two sets of experimental data, coming from healthy and cancerous human tissues. Additionally, a package for simulating various data from NGS-based experiments is shown. Using this package, the four peakcallers were tested in various scenarios and their strengths and weaknesses were identied. It is shown that HERON returns more reliable results than other peakcallers for signal with long peaks.

Streszczenie

Wraz z rozwojem tak zwanych technik sekwencjonowania drugiej generacji (NGS - Next Generation Sequencing) opracowano wiele protokoªów dla opieraj¡cych si¦ na nich ekspery- mentów. Kluczowym elementem analiz ich wyników dla wielu z nich jest procedura zwana po angielsku peakcalling, maj¡ca na celu zidentykowanie obszarów wzbogaconych w sygnale pokrywaj¡cym genom. Istnieje wiele narz¦dzi które wykonuj¡ t¦ procedur¦ (zwanych peakcallerami), jednak wci¡» w wielu przypadkach nie daj¡ one satysfakcjonuj¡cych wyników. W tej rozprawie przedstawiam nowe narz¦dzie do procedury identykowania wzbogace«, zwane HERON - HiddEn maRkov mOdel based peakcalliNg. HERON jest przeznaczony do analizy sygnaªu zawieraj¡cego dªugie obszary o sªabym wzbogaceniu;

przykªadem takiego sygnaªu jest ten pochodz¡cy z eksperymentu ChIP-seq przeprowad- zonego na modykacji H3K27me3. HERON opiera si¦ na ukrytych modelach Markowa z emisjami ci¡gªymi. W niniejszej pracy porównuj¦ go do trzech innych peakcallerów - MACSa, SICERa oraz BayesPeaka. Ich dziaªanie jest ocenione na wielu danych symu- lowanych oraz dwóch zbiorach danych eksperymentalnych, pochodz¡cych ze zdrowej i rakowej tkanki ludzkiej. Ponadto przedstawiony jest pakiet do symulowania danych z eksperymentów opartych na NGS. Za jego pomoc¡ wspomniane cztery peakcallery zostaªy przetestowane w ró»nych scenariuszach oraz zidentykowano ich mocne oraz sªabe strony.

Pokazuj¦, »e HERON daje bardziej wiarygodne wyniki ni» pozostaªe testowane peakcallery na danych zawieraj¡cych dªugie obszary wzbogacenia.

(6)

Keywords

peakcalling, NGS-based experiments, ChIP-seq, Hidden Markov Models

Thesis domain (Socrates-Erasmus subject area codes) 11.9 Other Mathematics, Informatics

Subject classication

Applied computing → Life and medical sciences → Bioinformatics

Tytuª pracy w j¦zyku polskim

Oparta na HMM-ach metoda identykowania wzbogacenia w sygnale z eksperymentów opartych na sekwencjonowaniu

(7)

List of Figures

1.1. DNA structure . . . 13

1.2. Chromatin structure . . . 16

1.3. Example mapping and coverage . . . 21

2.1. Example signal with ambigous peaks . . . 26

2.2. MACS - shifting reads after mapping . . . 28

2.3. BayesPeak overview . . . 31

3.1. Example Hidden Markov Model . . . 39

3.2. HERON's workow . . . 55

4.1. Data simulation package . . . 66

4.2. Example simulated signal fragment . . . 70

4.3. Example screenshot with samples divided into subpopulations . . . 71

4.4. Resolution vs peak width . . . 72

4.5. Peakcalling assessment depending on width of the true peaks . . . 74

4.6. Peakcalling assessment depending on the enrichment . . . 76

4.7. Peakcalling on multiple simulated samples . . . 78

4.8. Discovering subpopulations in simulated data; easier case . . . 80

4.9. Discovering subpopulations in simulated data; harder case . . . 83

5.1. Example signal from H3K27me3 ChIP-seq on fetal brain . . . 86

5.2. Length of peaks . . . 90

5.3. Expression of genes within peaks . . . 91

5.4. HERON on multiple samples vs on single sample . . . 92

5.5. Dierantiating tissues with HERON - means of emissions . . . 93

6.1. Example coverages from glioma sample . . . 97

6.2. Length of peaks . . . 98

6.3. Expression of genes within peaks . . . 99

6.4. Expression of genes within peaks for example patients . . . 100

6.5. Expression of genes within peaks; only genes diering between patients . . 101

6.6. Repeatability of peaks between patients . . . 102

6.7. Repeatability of transcripts between patients . . . 104

(10)

6.8. Peakcalling on multiple samples . . . 105

6.9. Peakcalling on subpopulations; dierentiating glioma grades . . . 108

6.10. Peakcalling on subpopulations; dierentiating experiments . . . 109

6.11. Fitting distribution to experimental data . . . 110

(11)

List of Tables

5.1. Results for Roadmap Epigenomics data . . . 88 6.1. Available data from cancer tissues . . . 96 6.2. Dierence between expressions of genes overlapping peaks detected using

one and many les . . . 107 6.3. Comparison of Gaussian and negative binomial distribution tted to ex-

perimental data . . . 112

(12)

(13)

Acronyms

ATAC Assay for Transposase-Accessible Chromatin.

bp base pair.

ChIP Chromatin immunoprecipitation.

DA Diuse Astrocytoma; identier of patients with II or III grade glioma in the data from cancer tissues.

DNA deoxyribonucleic acid.

EM Expectation Maximisation.

FDR False Discovery Rate.

GB Glioblastoma; identier of patients with IV grade glioma in the data from cancer tissues.

Gb giga bases or giga base pairs.

H3K27ac Acethylation of 27-th lysine (K) of histone H3.

H3K27me3 Triple methylation of 27-th lysine (K) of histone H3; similarly for H3K4me3 and other H3K*me3 modications.

HERON HiddEn maRkov mOdel based peakcalliNg; a novel program for peakcalling.

HMM Hidden Markov Model.

HTS High-Throughput Sequencing.

JAMM Joint Analysis of NGS replicates via Mixture Model clustering; a program for peakcalling.

(14)

kb kilo bases or kilo base pairs.

MACS Model-based Analysis of ChIP-Seq; a program for peakcalling.

Mb mega bases or mega base pairs.

MLE Maximum Likelihood Estimator.

NB Negative binomial (distribution).

NGS Next Generation Sequencing.

PA Pilocytic Astrocytoma; identier of patients with I grade glioma in the data from cancer tissues.

PG Pediatric Glioblastoma; identier of patients with IV grade glioma in the data from cancer tissues.

RNA ribonucleic acid.

RPKM Reads Per Kilobase per Million reads.

SICER Spatial clustering for Identication of ChIP-Enriched Regions; a program for peakcalling.

TF transcription factor.

TPM Transcripts Per Million.

TPR True Positive Rate.

TSS Transcription Start Site.

ZINBA Zero-Inated Negative Binomial Algorithm; a program for peakcalling.

(15)

Chapter 1

Introduction

In this dissertation I present HERON, a new tool that performs a procedure called peakcalling, i.e. procedure aiming at identifying enrichment in signal, on data from experiments based on sequencing. Before I describe how it works, I need to introduce some basic concepts and terminology that are used in the following chapters. In particular, in this chapter I give a brief introduction to the biological background. In section 1.1 I describe various phenomena happening in living cells; I focus especially on the ones regarding DNA and gene expression regulation. In section 1.2 I briey characterise the concept of DNA sequencing and its methods. Finally, in section 1.3 I introduce various experiments that are based on DNA sequencing, including experiments the results of which HERON is designed to analyse. After this introduction, in chapter 2 I move to the subject of peakcalling - I will characterise the procedure, present a couple of currently available approaches and various challenges that are still not fully addressed. In the next chapter I describe HERON - the concept behind it, its various features and details of the algorithms and implementation. The next three chapters contain an assessment of HERON's perfomance; I test it on various data, both experimental and simulated, compare it to other peakcallers and describe its strengths and limitations. Additionally, in chapter 4, I present my approach to generating simulated data, that allows us to evalu- ate peakcallers' results. Eventually, in the last chapter I present a brief summary of the subjects discussed here.

1.1. DNA and chromatin - structure and functions

An important role in living organisms is played by deoxyribonucleic acid (DNA). It is present in almost every cell in every organism and many viruses. Within it so called genetic information is stored; in a way it is a set of instructions that regulates life of the whole cell; its behaviour, mechanisms, resposnses to various stimuli, proliferation etc.

Structurally DNA is a long chain of molecules called nucleotides. Its length varies; currently the shortest known genome is 143 795 nucleotides long (in bacteria Hodgkinia

(16)

cicadicola), while the longest one is 150 bilion nucleotides long (in plant Paris japonica).

Every nucleotide in DNA is composed of a sugar molecule (specically deoxyribose), a phosphate and a nucleobase (often called simply a base). There are four basic types of nucleobases that build a DNA chain: adenine, cytosine, guanine and thymine, often denoted as A, C, G and T, respectively. It is worth noting that often, for the sake of simplicity, a nucleotide with an adenine nucleobase is wrongly referred to as adenine, and so on. Algorithmically speaking, a DNA chain can be viewed as a long string over the {A, C, G, T } alphabet. The whole genetic information can be represented as a sequence of those four letters.

DNA does not usually appear in cells as a single strand; rather there are two strands of DNA bound to each other. The sequences of the strands are complementary to each other - every adenine almost always pairs with a thymine in the other strand, and a cytosine almost always pairs with a guanine. Due to tensions in this double stranded structure it tends to coil around itself and form so called double helix. The structure of DNA is presented in gure 1.1. One can see that from the chemical point of view we can distinguish two ends of the DNA strand, called 5' end and 3' end ("ve prime" and

"three prime", respectively). The names come from how the atoms in nucleotides are conventioinally numbered: the carbon atoms in the sugar molecule are labeled 1', 2', 3', 4' and 5'; the 3' end of the chain is the one where the 3' carbon atom is available to potentially connect to the next nucleotide, while the 5' end is the one where the 5' atom is available. The two strands are antiparallel to each other, i.e. where one has the 5' end, the other has the 3' end.

Another important type of molecules in living organisms are proteins. Proteins are chains of amino acids and they serve many various functions in cells: they catalyse reactions, transport molecules, make up a cell's skeleton etc. Similarly to nucleotides in DNA, denoted by convention A, C, T or G, most amino acids known to appear in the protein sequences have their one-letter abbreviation. Based on the structure of the protein, one can distinguish between the two ends of the chain; one is called N-terminus and it has a free amine group (i.e. -NH2), while the other is called C-terminus and it has a free carboxyl group (-COOH). By convention, the sequence of a given protein is raported from N-terminus to C-terminus, usually using the one-letter abbreviations. When one wants to refer to a specic amino acid in the protein's sequence, usually its number (starting the count from N-terminus) and the abbreviation is reported; for example, 16S would denote the 16th amino acid in the sequence, which is a serine (denoted "S").

The entire sequence of DNA in a single organism is often called a genome. Within the genome there are sequences called genes that are usually dened as a sequence that encodes the synthesis of a gene product, either protein or ribonucleic acid (RNA), and as a basic unit of heredity. Whenever a product of a certain gene is needed, a complex of proteins called transcription machinery binds to the specic gene, locally unfolds the helix to access the single strand and transcribes its sequence into an RNA. RNA is similar to DNA, as it is a chain of nucleotides and has a 5' and a 3' end, but there are two main dierences: rst, RNA is built of ribonucleotides, while DNA is built of

(17)

Figure 1.1: Schematic representation of DNA structure. Light yellow line represents the so-called backbone of a strand. Green, yellow, blue and orange shapes represent nucleobases, while the purple pentagon is a deoxyribose. Deoxyribose together with phosphate and a single base form a nucleotide. In the last deoxyribose in each strand the carbon atoms are numbered. One can see that the end where the 5' atom is free to bind to a next nucleotide is called 5' end; similarly for the 3' end. The two strands bind to each other in antiparallel way, i.e. at the end of the helix one of the strands has 5' end, the other has 3' end. Note that in the legend below the DNA structure the nucleobases have additional hydrogen atoms where in structure they bind to the sugar molecule; similarly the sugar molecule has additional hydrogens and OH group where it binds to nucleobase or a next nucleotide. These additional atoms are noted in red. Figure created by me.

(18)

deoxyribonucleotides, which means they do not have one of the -OH groups present in a ribose in ribonucleotides. Second, instead of thymine in RNA there is uracil (U). The RNA containing the gene's sequence (translated into ribonucleotides, with U's instead of T's) is called a transcript; it can undergo various modications, including cutting out some parts of it called introns. Introns are sequences that are present in the gene but are not needed in the nal product of the gene; e.g. in the case of protein-coding genes they do not encode any protein fragment. The processed transcript can be then translated into a protein on a complex of proteins and RNA called a ribosome. The nucleotide sequence is translated into an amino acid sequence using a code of triplets, i.e. three nucleotides of RNA determine a single amino acid in a protein. The process of transcribing the gene's sequence to RNA and then potentially building a protein on this template is called gene expression. The term can also mean numerical assessment of how much RNA or protein is being created on the template of this gene. Currently the most widely used units to describe expression levels are based on the results of RNA-seq experiments, on which I expand in section 1.3. Genes that code sequences of proteins that are constantly needed in large quantities, for example proteins forming ribosomes, usually have high expression, while genes that code sequences of proteins that are needed only under specic circumstances or only in specic types of cells often have much lower expression.

It is worth mentioning that while both strands of DNA essentially contain the same information and in theory it could be possible to utilise both strands to create RNA chains from the same gene, usually only one strand is used in the transcription process of a single gene. In the context of a single gene, the strand where the gene is localised is called a sense strand, and the complementary one is an antisense strand. The antisense strand is the one that is used in the transcription process; RNA is created by synthesising a chain complementary to the antisense DNA strand, and hence its sequence is identical to the sequence in the sense strand (only with U's instead of T's). The transcription happens from the 5' end of the sense strand to the 3' end (reverse for the antisense strand); the transcript is elongated toward its 3' end. Sometimes (especially in bacteria) the same fragment of a DNA strand can be sense for one gene and antisense for another, because genes sometimes overlap with each other. A sense strand is sometimes called a positive-sense, positive, (+) or Watson strand, while an antisense is called a negative- sense, negative, (-) or Crick strand. In the context of a whole genome, the sense and antisense strand are dened purely by convention and I will expand more on it in section 1.2.

A single genome can have from hundreds to tens of thousands of protein-coding genes;

they may take up only a small portion of a genome, especially in more complex organisms.

For example the human genome is approximately 3 billion nucleotides long, but only approximately 1-2% of it are protein-coding genes; the estimations of number of non- coding genes vary more, but they are usually between 1 and 3% of the genome. The rest is called non-coding sequences and at least some part of it plays a role in regulating the expression of genes. Two types of sequences within a genome are worth mentioning

(19)

here: promoters and enhancers. Promoters are localised near genes; specically near the transcription start site (TSS), i.e. the place where the transcription of this specic gene begins. A promoter is a sequence to which the transcription machinery binds rst, so they are essential in the gene expression process. Their sequence is also often recognisable by proteins called transcription factors (or TFs for short); they can bind to some specic promoter and either enhance or repress expression of its gene. Similarly to promoters, enhancers can indirectly inuence expression level of genes via transcription factors that bind to them. Unlike a promoter, an enhancer is not necessarily localised near the gene's transcription start site; it can be far from the gene or even within it (specically within its introns). It can still indirectly interact with the promoter of the gene by localising close to it in 3D space.

While in the majority of simple single-cell organisms DNA is usually stored simply as a double-helix, often in a circular form, in more complex organisms - animals, plants, fungi and protists - DNA is stored in a more complicated way, shown in gure 1.2. Here, DNA is bound to proteins called histones. Eight histones, H2.A, H2.B, H3 and H4, each one in two copies, form a complex around which the DNA helix is wrapped. DNA together with all the histones form a nucleosome. It is often additionally stabilised by histone H1.

The DNA fragment wrapped in a nucleosome is usually approximately 146 base pairs (bp)¹ long. Stretches of DNA between nucleosomes are called a linker DNA and they can be up to 80 bp long. The whole complex of DNA, histones and various other proteins (like transcription factors) bound to it is called chromatin. It is stored in a separate compartment of the cell, called a nucleus.

Histones play an important role in regulating gene expression. Though every nucleosome consists of the same eight histones, they can be sligthly modied by attaching some molecule to a specic amino acid in their sequence, like a methyl group or an acetyl group.

Such modication changes chemical properties of the histone; for example, addition of an acetyl group to amino acid lysine in histone H3 neutralises its alkaline character, hence weakening its binding to the acidic DNA and causing loosening of the DNA helix and increasing its accessibility for various proteins, including transcription machinery. Other modications can act conversely and cause DNA to locally bind tighter to the histones, making this specic fragment less accessible. The regions of DNA that are available for proteins and not masked by histones are often referred to as open chromatin; in particular promoters of active genes (i.e. genes that are being expressed), often called simply active promoters, are localised in the open chromatin. When many nucleosomes in a row are modied in a way that increases accessibility to the DNA they are usually further away from each other, sometimes even some nucleosomes are depleted and only long linker DNA remains; such regions are often referred to as euchromatin. On the other hand when multiple nucleosomes have modication that causes them to bind tighter to the

1Base pair is a basic unit of a double-stranded helix. It is equal to the number of nucleotides in a single strand. It is often used with prexes (often without "p"): kb or kbp means kilo base pairs (1000 of base pairs), Mb or Mbp means mega base pairs (million) and Gb or Gbp means giga base pairs (billion).

The equivalent in a single stranded DNA/RNA is bases: b, kb, Mb or Gb. Whether "kb" means "kilo base pairs" or "kilo bases" usually can be deduced from the context; the same with Mb and Gb.

(20)

Figure 1.2: Schematic representation of chromatin structure. The yellow line is a double- strand DNA and it is wrapped around octamers of histones. The green, blue, purple and orange balls are the histones that form the octamer. The red shape is histone H1, that stabilises the whole nucleosome. On one of the histones a histone modication is shown - additional chemical group (methyl, acethyl etc.) attached to a certain amino acid in histone's sequence. There is also a transcription factor (pink shape) bound to one of the linkers between nucleosomes. Note that this is a very simplied schema, that does not reect histones' shape and many other details. Figure created by me.

DNA they are usually more densely packed, forming so-called heterochromatin. Using the described mechanisms a cell can control which genes are accessible and which are not; for example the regions of DNA containing genes important for functioning as a neuron are densely packed and inaccessible in muscle cells (forming heterochromatin), and loose and easily available in neurons (forming euchromatin).

Histone modications serve dierent roles, depending on what chemical group is being added and to which amino acid. Their names are usually constructed from three parts: the name of the histone they occur on, the number and one-letter abbreviation of the amino acid they aect, and the description of the chemical group that is added. For example, H4K16ac denotes acetylation ("ac") of the 16-th aminoacid ("16") of the histone H4, which is a lysine ("K").

Patterns of histone modications seem to be distinctive for various tissues and types of cells, suggesting that this type of gene expression regulation plays an important role in cell dierentiation, i.e. the process in which cell changes from one type to another, usually more specialised one. It can be also used as a way to regulate gene expression in response to some outside stimuli; for example if a cell is under a stress stimulus that requires specic genes to be expressed, it can activate a chain of various reactions that

(21)

result in a change in the histone modication pattern that will ease access to the genes that are neccessary to respond to the stimulus.

One example of histone modication would be triple methylation of the 4th lysine of histone H3, called for short H3K4me3. This modication regulates gene expression by attracting a complex of proteins called NURF, that remodels chromatin to enhance accessibility of the DNA fragment and thus enable transcription of the genes within the fragment. It is highly enriched near transcription start sites (TSSs) of active promoters and mostly absent anywhere else. It plays an important role in stem cells and their dif- ferentiation, but it is also present in dierentiated cells. H3K4me3 is also involved in the process of repairing DNA double-strand breaks. Another example would be acetylation of 27th lysine of histone H3, denoted H3K27ac. Similarly to H3K4me3, it is associated with higher average expression of genes. It is present at active promoters, but also - unlike H3K4me3 - active enhancers, so it can be found both close and far from TSSs. The 27th lysine of H3 can be also methylated; the triple methylation is called H3K27me3 and it usually acts antagonistically to H3K27ac. This tri-methylation is mediated by a complex of proteins called a polycomb repressive complex (PRC). H3K27me3, unlike H3K4me3 and H3K27ac, is associated with lower expression. It is found mostly in regions of heterochromatin. H3K27me3 and H3K4me3 sometimes co-exist on the same histones, even though they have an antagonistic eect on gene expression. Regions where such coexis- tence occurs are called bivalent domains. They are found mostly in embryonic stem cells and seem to be crucial to maintain cell's pluripotency, i.e. its ability to dierentiate into dierent cell types [5].

Another way of regulating gene expression is via transcription factors (TFs). They usually recognise some specic motif in the DNA sequence, for example a fragment of a promoter or an enhancer, bind to it and then either help to bind other proteins and initiate the transcription or, conversely, inhibit the transcription by repelling other proteins. Similarly to histone modications, they play an important role in the process of dierentiation and they can be used as a way to dynamically change gene expression patterns according to the cell's current needs. Sometimes the interaction is connected to changes in DNA's position in space; for example, if an enhancer is far from the gene whose expression it regulates (at least in terms of linear distance in the DNA sequence), the DNA sometimes forms a loop so that the enhancer is brought close to the gene in space, and so the TFs that bind to the enhancer or other proteins interacting with the TFs can also interact with the gene or its promoter and promote transcription. Patterns of transcription factors binding and the shape of DNA helix in 3D space are other important features, distinctive between dierent cell types.

Together various means of gene expression regulation create a complicated network of so-called regulatory elements. They regulate gene expression and control when every transcript is created, hence basically controlling vast majority of mechanisms and phenomena happening in cells. These processes determine the function of each cell, regulate the process of dierentiation and are responsible for answering to stimuli from the outside.

Disruption of this regulatory network is observed in many diseases, especially in various

(22)

cancers. Furthermore, it seems that specic features of this disruption are characteristic to dierent types of cancers. In conclusion, examining the whole network and learning in detail how it operates seems crucial to wholly understand how our organisms work;

it can also help us understand how various diseases develop, how to properly recognise them early and potentially how to ght them.

1.2. Sequencing

Learning the sequence of the genome is a key step in understanding various mechanisms concerning the genome, including gene expression regulation. The process of discovering the sequence of a DNA fragment (or a whole genome) is called sequencing. First DNA sequencing methods were developed in 1970; the major breakthrough came in 1977, with the development of a new method by Frederick Sanger. Sanger sequencing allowed to sequence single DNA fragment around 1000 nucleotides long in a matter of hours, with estimated accuracy of 99.95%. It quickly became a standard method for sequencing, as it was most ecient from all the available techniques. It still had important limitations, mainly its relatively high cost and low throughput. Various improvements were introduced, but still sequencing a genome as large as human genome (around 3,2 Gb) remained really dicult.

The second major breakthrough came with the invention of so-called next generation sequencing (NGS for short). It allowed sequencing millions of short DNA fragments at the same time. Similarly to Sanger sequencing, it is based on the elongation of the DNA strand carried out by polymerase, a protein capable of building a DNA or RNA strand basing on a DNA or RNA template. The polymerase used in NGS creates a DNA strand on a DNA template; another polymerase, that creates RNA on a DNA template, is a part of transcription machinery. The created strand's sequence is complementary to the template. To carry out the sequencing procedure, the DNA fragments to be sequenced are amplied and stabilised on a stable base (e.g. in the case of sequencing developed by Illumina they are ligated onto a ow cell). Next, the polymerase starts creating strands complementary to the stabilised ones, using provided nucleotides. In the case of Illumina sequencing, the nucleotides are modied by adding a uorophore; each type of nucleotide (A, C, G and T) has its own uorophore that can emit light of a dierent colour. Once a nucleotide is added to a newly created strand, it emits light registered by a scanner; the colour of the light tells us which nucleotide was added, and hence what nucleotide is at this position in the sequenced strand.

The added uorophore has another important function, apart from emitting a light characteristic for a specic nucleotide: it also blocks the addition of new nucleotides. Once the light is registered, it is washed away to allow the next round of adding nucleotides.

Thanks to that, we can ensure that in every copy of the same fragment the same nucleotide is being added at the same time; a typical mistake causes one of the fragments to be shifted relative to the rest as far as the position being sequenced is concerned, and adds noise to the light registered by scanner. The more fragments are shifted, the

(23)

harder it is to identify the current nucleotide. This is the main issue that restricts the length of the sequenced fragments - because errors inevitably happen, the later the cycle of sequencing, the more strands are shifted relative to the rest and the less reliable the read nucleotide is. Because of that, the sequenced fragment is usually tens, less often hundreds of nucletodides long.

While various NGS methods can dier in how exactly the nucleotides are marked and read, the basic principle remains the same. As a result of this procedure we get millions of short strings, called reads, representing the sequences of the DNA fragments present in the sequenced sample. Often along with the reads we also get some sort of estimation of how certain we are that at this specic position this specic nucleotide was indeed present; usually the later the nucleotide in the read, the lower the certainty. A possible improvement to the method is called paired-end sequencing - in this setting every DNA fragment is sequenced from both ends. The fragments are usually longer than the obtained reads - for example we can have fragments of couple of hundreds nucleotides long and from each of them we obtain a pair of reads, tens of nucleotides long each. In this setting, apart from the set of reads, we also get information which read was paired with which one.

The NGS techniques, sometimes called High-Throughput Sequencing (HTS), are much cheaper and faster than Sanger sequencing, and often more reliable - frequency of an incorrectly read nucleotide is estimated from 10⁻² to 10⁻³. Their main disadvantage is the length of the obtained reads. This poses substantial challenge, especially in the case of more complex genomes, where many repeated sequences are present. If we want to sequence the whole genome of an individual, it needs to be assembled from the short reads, based on how they overlap with each other. There are various factors that inuence how easy it is to sequence some fragment and then assemble it; such as its accessibility (euchromatin is easier to access and sequence than heterochromatin), repeatability (long repeats of the same short sequence, or worse - of a single nucleotide are hard to assemble) or GC-content (i.e. content of pairs GC in the DNA; both unusually high and unusually low GC-content makes the assembly harder) [12].

In spite of these challenges, genomes of many organisms were successfully sequenced, sometimes by pairing NGS with Sanger sequencing. In many cases, including the human genome, the sequence is incomplete; it lacks the sequence of regions especially hard to sequence or assemble. For many species so called reference genome sequence was constructed. It is not supposed to be a genome sequence of every individual of a given species, because individuals can have slightly dierent genomes from each other; in the case of the human genome, the intraspecies variability is estimated at 0.1%. It is simply a reference, or a standard; all research on a given organism's genome is done and described in a reference to this sequence, which substantially facilitates reporting and comparing obtained results.

Because sequences of both DNA strands are complementary to each other, only one of them is reported in databases. As both strands contain genes and regulatory elements, it

(24)

is impossible to choose one as the "more important" one; the choice is purely conventional.

For complex organisms, whose DNA is stored in linear fragments called chromosomes, the decision is usually made as follows: every chromosome has an element called centromere, that divides it into unequal parts. The strand that has 5' end at the shorter part is the one reported in the database. For simpler organisms like bacteria, that do not have linear chromosomes with centromeres, the choice is less standardised. By convention, the reported strand is referred to as the sense strand (or positive, (+), Watson), and the complementary one is the antisense strand (or negative, (-), Crick). Usually they both contain genes, and for many genes their sense strand (i.e. the strand in which its sequence is stored) is actually the antisense strand in the context of the whole genome, and the other way around. It is important to distinguish whether we are talking about sense and antisense strand in the context of a whole genome or a single gene.

The sequencing techniques are continually improved and new ones are being developed.

In particular some new techniques, named collectively third generation sequencing, allow sequencing of much longer fragments than NGS, up to hundred kb. They still are much more prone to errors than rst or second generation sequencing, and hence cannot be used in many applications, where low error rate is crucial. Despite this restriction, the length of reads obtained through these methods is a huge advantage and by pairing them with next generation or Sanger sequencing they already helped resolving some of the missing sequences in published genomes; for example, the current version of human reference genome, called hg38 [65], is partially assembled thanks to long third generation reads. Nevertheless for many applications NGS methods are still preferred.

1.3. NGS-based experiments

The fact that NGS allows sequencing millions of fragments at the same time opened lots of possibilities. Many technologies based on next generation sequencing, called NGS- based technologies, emerged. They allowed examination of many phenomena on the scale of the whole genome; for example DNA shape, gene expression or patterns of histone modications [47] [45] [24].

While the exact protocol of the experiment depends on what exactly we want to discover, there is one feature that all the NGS-based experiments have in common: as a result of the experiment on a cell or a population of cells we obtain a sample with lots of short DNA fragments, and we sequence them using some next-generation sequencing platform.

As a result we get millions of short reads that represent sequences of DNA fragments from the sample of interest. Usually the next step is to map those reads to the genome, i.e. nd where these DNA fragments come from in the genome of the analysed organism.

For that we need the reference sequence of the whole genome of the organism. Once the reads are mapped, for every position in the genome we can calculate the number of reads mapped there; this value is usually referred to as a coverage or signal. In gure 1.3 a schematic representation of example mapping and resulting coverage is shown.

(25)

Figure 1.3: Example mapping and coverage. At the bottom, a reference sequence of a genome is shown. In the middle there are reads mapped to this reference. At the top there is a signal (or coverage) calculated for every position as number of reads mapped to it.

One example of a NGS-based experiment is the ChIP-seq (Chromatin ImmunoPrecipi- tation followed by sequencing) protocol developed in 2007 by Barski et al. [2]. ChIP-seq allows identication of sites in DNA where a specic protein binds, for example some transcription factor or a histone with a specic modication. It is done in several steps:

1. Crosslinking - using formaldehyde, links between DNA and the proteins are xed.

2. Fragmentation - DNA is cut into short fragments, usually several hundred nucleotides long, commonly with sonication. The proteins that were bound in vivo to the DNA are expected to still be bound to the same fragments.

3. Immunoprecipitation - using antibodies specic to the protein of interest, these proteins together with the fragments of DNA they were bound to are extracted.

4. Washing o the proteins and antibodies to get only the DNA fragments.

5. Sequencing the DNA fragments with NGS.

The experiment is done on a whole population of cells, so we can obtain many DNA fragments from the same place in the genome. After mapping the reads to the genome, we expect large sets of reads (and hence high coverage) in the regions where the protein of interest were frequently bound in the analysed cells. Theoretically we do not expect to observe any reads in the regions where the protein did not bind in any of the analysed cells; however in reality we will always get some noise on the whole genome. Further- more due to various factors, such as dierences in chromatin accessibility or systematic errors in mapping procedure, the noise is not uniform [78] and it is often not obvious how to distinguish it from the meaningful regions of enrichment, representing our sites of interest. Because of that, for many ChIP-seq experiments, control samples are also generated. Often they are prepared similarly to the ChIP-seq sample, but without the immunoprecipitation step; they are usually called "input" samples. There the obtained

(26)

signal is not biased towards the regions we want to discover, but if there are any universal biases they will be present in both the ChIP-seq and control sample. Downstream analysis that aims to identify regions of protein binding in the ChIP-seq signal can take advantage of the presence of the control sample by identifying those universal biases and distinguishing them from the meaningful enrichment, by only identifying regions where there is no enrichment in the "input" sample.

Another example of a NGS-based experiment is ATAC-seq (Assay for Transposase- Accessible Chromatin using sequencing), developed by Buenrostro et al. in 2013 [9] [10].

It is designed to nd regions of open chromatin, i.e. regions where chromatin is loose and DNA is more accessible to various proteins, such as transcription machinery. These regions are found wherever DNA needs to be accessed - mainly on active promoters or active enhancers.

ATAC-seq protocol uses a protein called transposase Tn5. It is a protein naturally oc- curring in some bacteria, responsible for integrating specic DNA sequence into various places of the bacterial genome. While naturally the transposase has very low activity, in ATAC-seq a hyper-active mutant is used. It inserts short sequences called adapters at every possible place in DNA it can access, hence marking loci of open chromatin. Next the DNA fragments tagged with the adapters can be extracted and sequenced.

Similarly to ChIP-seq, ATAC-seq results in a signal over the whole genome with peaks in the regions of interest (in this case, regions of open chromatin) and noise of varying strength across the whole genome. Again downstream analysis needs to take that into account and try to distinguish peaks emerging from the noise, from the meaningful peaks at the sites of interest. In the case of ATAC-seq, there is usually no control sample.

Another example of an experiment based on NGS is RNA-seq [82]. Its aim is to measure expression level of genes. As was described in section 1.1, when a product of a gene is needed its sequence is transcribed onto RNA strand, called transcript, which is then transported out of the cell's nucleus and potentially undergoes various modications, like cutting out introns. This process, as well as the numerical assessment of its frequency for a given gene, is called expression. To perform RNA-seq, we extract the transcripts from a population of cells and in vitro transcribe them back into DNA sequence (as the NGS technologies are designed to sequence DNA, not RNA). The DNA chains are fragmented, amplied and sequenced. The obtained reads are mapped to the genome². Then for every type of transcript number of reads that are mapped to its location in the genome can be calculated.

It is important to stress that the numbers of mapped reads do not correlate linearly with the numbers of transcripts that were produced in vivo in the analysed population of cells. Transcripts have varying length, from 20 bases for short, non-coding RNA that serve

2In the case of RNA-seq, the mapping procedure is more complicated. One need to take into account that the reads can potentially come from fragments where an intron was cut out; it means that two fragments that are next to each other in the sequenced read can be separated by a long gap in the DNA sequence.

(27)

mostly regulatory functions up to millions of bases (for example, the longest currently identied transcript in the human genome is 205 kb long, excluding introns). At the same time the sequenced reads usually have similar length in a single experiment, and they are usually much shorter than most of the analysed transcripts (from 20 to 100 bases). Because of that, the number of mapped reads is a misleading way of assessing expression. Let us assume two genes, A and B, have the same expression in vivo, i.e.

the same number of transcripts is created from them per minute and the same number of transcripts was present in the cell population when the RNA-seq experiment was conducted. A has length l and B has length 10l. Assuming the RNA-seq experiment was conducted without any errors or biases, the same number of transcripts was fragmented and sequenced; however, because the transcripts have dierent length, dierent number of reads was obtained from each of them. Assuming no further errors in the sequencing and mapping procedure, we can expect 10 times more reads mapped to gene B than to gene A, even though their expression is the same. Taking into account only raw counts of reads makes the comparison of expression between genes impossible. Additionally, if we want to compare the expression between dierent experiments we should take into account how the sequencing depth inuences the results; the sequencing depth in NGS-based experiments reects how often on average a single nucleotide of a genome is sequenced, and it depends mostly on DNA or RNA concentration in the obtained sample and the level of amplication. The higher the sequencing depth, the higher the obtained read counts will be. Because of these reasons some sort of normalisation method is usually used to obtain more robust and informative assessment. One of the widely used ways of reporting expression is using RPKM - Reads Per Kilobase of transcript, per Million of mapped fragments [51] (sometimes descibed shortly as Reads Per Kilobase Million). It is calculated as follows: all the raw read counts are divided by the scaling factor equal to the total number of mapped reads divided by one million. This way we obtain RPM, i.e.

Reads Per Million. Then for every transcript the RPM value is divided by the length of the transcript in kilobases. It can be written as follows:

RP KM_i = n_i

N 10⁶l_i^(1kb)

= n_i10⁶ N l_i^(1kb)

,

where RP KMi is the RPKM value for transcript i, ni is the number of reads mapped to i, N is the total number of mapped reads and l^(1kb)_i is the length of i in kb.

In 2012, Wagner et al. [80] argued that using RPKM is not an accurate and robust method of reporting expression. They claimed the transcription measure such as RPKM should be proportional to relative molar RNA concentration; they showed that the average relative molar RNA concentration should be the same between samples of the same species, but average RPKM varies. They proposed a new measure called TPM - Transcripts Per Million, and they showed that its average value is indeed constant between samples of the same species, as expected. TPM can be calculated as follows:

(28)

T P Mi= Ai10⁶ P

jA_j where A is dened as follows:

Ai= ni

li

where ni is the number of reads mapped to transript i and liis the length of i (in bases).

The sum in the denominator in the rst equation is calculated over all the transcripts.

The initial denition proposed by Wagner et al. additionally multiplied Aiby the average length of reads, but as this factor can be reduced in the rst equation it is usually omitted in the denitions.

There are many other NGS-based experiments. Some are similar to ATAC-seq, in the sense that they aim to nd regions of open chromatin, like DNase-seq [64] [7] or MNase- seq [19]. Some are similar to ChIP-seq, i.e. they aim to nd regions of protein binding, like CUT&RUN [67] and CUT&Tag [33]. Some try to capture the 3D structure of chromatin, like Hi-C [42]. Many of them can be performed not only on a whole population of cells, as was described in the examples above, but even on a single cell. For example, single- cell RNA-seq is gaining more popularity lately [73] [74]; there is also single-cell ATAC- seq [11] [14], single-cell ChIP-seq [62] [22] and many more [72]. They usually require dierent way of downstream analysis. In this dissertation I focus on the experiments performed on a population of cells, sometimes called bulk sequencing.

The NGS-based technologies provide not only possibilities, but also challenges. They produce lots of data that need careful analysis, creating a need for new bioinformatic tools. For example after performing a ChIP-seq experiment we are usually interested in

nding sites of protein binding; that means we need a tool that will be able to identify peaks emerging from stacking of reads representing the properly immunoprecipitated DNA fragments and distinguish them from peaks caused by noise. In general NGS-based experiments tend to produce a lot of data, where the meaningful informationis surrounded by inevitable noise; the task of separating them is not trivial. Tools that perform this task need to be able to handle large data, take into account various possible sources of noise, and preferably keep high level of both specicity and sensitivity. While there are many tools available for many NGS-based experiments, there are still cases when the available tools fail, and hence there is an ongoing need to create new ones, often designed to work only in some specic narrow cases.

(29)

Chapter 2

Peakcalling

In this chapter I describe the procedure of peakcalling. In section 2.1 I introduce the main idea behind the procedure and dene the terms that I use throughout this dissertation. In section 2.2 I present example programs that perform peakcalling, including MACS and BayesPeak. Finally, in section 2.3 I discuss various issues and challenges that concern peakcalling.

2.1. Introduction

Various NGS-based experiments, including ATAC-seq and ChIP-seq, aim at identifying sites of interest in the genome. They are designed so that the obtained sample is enriched in fragments of the genome coming from the sites of interest. However the sample never consists only of those fragments, due to imperfections in the protocol; for example in the case of ChIP-seq, the immunoprecipitation step does not neccessarily select only the fragments that were bound to the protein of interest, possibly because the antibodies were not perfectly specic. Furthermore, during the sequencing step and the subsequent mapping of the reads additional noise is usually introduced [78].

As was mentioned in section 1.3, after mapping the reads, for every position in the genome we can calculate the number of reads that mapped there; I refer to this value as a signal or coverage. We often want to identify the sites where signicantly more reads were mapped (or in other words, where the signal is signicantly higher) compared to the regions around them. These sites are often referred to as peaks and they are expected to represent the sites of interest (e.g. sites of protein binding in the case of ChIP-seq or sites of open chromatin in the case of ATAC-seq). More formally, we want a function or a procedure, that takes the signal over the whole genome as an input and returns a binary segmentation of the genome, where each position is labelled either as a "peak" or "not peak". The procedure of identifying peaks is usually called peakcalling, and I will refer to the programs that perform it as peakcallers.

(30)

Figure 2.1: Example signal along with two sets of peaks identied within it by two dierent programs - set 1 is identied by MACS (described in section 2.2.1) and 2 by my program HERON described in chapter 3. An example fragment of a genome is shown;

specically, approximately 330 kb long fragment of human chromosome 3. The coverage comes from ChIP-seq against H3K27me3 modication in fetal brain, performed as a part of Roadmap Epigenomics project. I write more about this data in chapter 5.

Peaks can have dierent characteristics, that make them easier to detect for some methods and harder for others; these include their width (or length), shape and height, both absolute (i.e. how many reads form this peak) and relative to the surrounding region (i.e. how much higher the peak is compared to the background). When a peak is very long it is sometimes referred to as a domain. The regions outside peaks are often called background or noise. I will also use these terms to describe numerical assessment of signal strength in these regions. The average height of peaks compared to the noise around them is sometimes referred to as signal-to-noise ratio¹. Usually the higher the ratio, the easier the peaks are to detect.

It is noteworthy that in this context a peak has no strict mathematical denition. Fur- thermore, depending on the experiment conducted, dierent patterns might be considered a peak. In particular, the same signal could be interpreted dierently depending on what experiment produced it - in gure 2.1 I present an example of such signal, along with two sets of peaks identied within it. If the signal comes from an experiment that gives short, well positioned peaks - such as ATAC-seq, or ChIP-seq against a transcription factor - the rst set of peaks could be considered more reliable. On the other hand, if the signal comes from an experiment that gives long domains - such as ChIP-seq against H3K27me3 (triple methylation ("me3") of 27-th lysine ("K") of histone H3) or H3K9me3 (triple methylation of 9-th lysine of histone H3) - the second set could be a more reason- able choice.

This poses an important question about validating results of peakcalling. How do we

1Mathematically speaking, signal-to-noise ratio is dened as a ratio between the strength of the meaningful signal in the observed signal and the strength of the noise, i.e. the rest. In the case of NGS data, various methods of estimating it were proposed (see for example [85], [1], [53]). Here we will not use any mathematical denition, as the general intuitive one will suce.

(31)

assess whether a peakcaller gives satisfactory results? One approach is to check whether the peaks have some characteristics that we expect from the sites of interest of the given experiment. For example, let us say that the target protein of the ChIP-seq experiment is some transcription factor, that recognises specic sequence motif in a gene's promoter, binds to it and enhances the gene's expression. In that case we expect the discovered peaks to be short, localised inside promoters, near the specic motif and to be correlated with enhanced expression of the nearby gene. On the contrary, if the ChIP-seq was performed against some histone modication that is found mostly in regions of closed heterochromatin we expect the peaks to be long and correlated with lowered expression of the genes within them. Another approach would be to generate simulated data, where we know exactly where the peaks are and we can check whether our method found them correctly. Simulation approaches allow to easily estimate specicity and sensitivity of any tested method. On the other hand, one can never be sure whether the simulated data properly captures all characteristics of a real data and specically of the noise within it.

I present the rst approach to peakcalling testing in chapters 5 and 6, and the latter in chapter 4.

There are various approaches to the peakcalling procedure itself. The easiest, naïve approach would be to dene a threshold and consider every region with signal above it a peak. Most signals from NGS-based experiments, however, are too noisy for this approach to be applicable. Below I present dierent methods, that use more sophisticated statistical frameworks.

2.2. Example approaches to peakcalling

2.2.1. MACS

MACS (Model-based Analysis of ChIP-Seq) [87] is one of the most widely used peakcallers. It was initially designed to work with ChIP-seq data, but it can be also used with ATAC-seq or many other NGS-based experiments. One of its main features is the ability to build a model that calculates the optimal value for shifting the reads prior to the peakcalling. The need for read shifting stems from the fact that DNA fragments are sequenced only from the 5' end, and the length of the sequenced fragments is limited by the used technology. It means that the place that corresponds to the actual localisation of the protein of interest is shifted in respect to the peak observed in the coverage of reads in the 3' direction of the strand they were mapped to (see gure 2.2). Since the fragments are equally likely to come from both strands, the peaks that we want to identify should have similar distribution of reads on the sense and antisense strand, but shifted relative to each other. To take this into account, MACS rst shifts all the reads towards 3' end by some number of nucleotides. The number can be either given by the user or estimated by MACS's model. The estimation is done as follows: the window of a xed size 2∗bandwidth is slided over the whole genome to nd windows with signal stronger mfold times than signal coming from randomly distributed reads. Both bandwidth and mfold can be provided by the user; the default vaules are 300 and 32, respectively. The bandwidth should

(32)

Figure 2.2: Simplied overview of a ChIP-seq experiment. The blue ball represents the protein of interest, whose position we want to determine. The waves represent the fragments after sonications and immunoprecipitation, coming from sense (blue) or antisense (red) strand. Only the bolded beginning of a fragment is sequenced and mapped. Be- cause of this, the reads mapped to both sense and antisense strand are shifted to 5' end in respect to the actual locus of the protein binding. The distance between the peaks mapped to dierent strands is denoted by d. After estimating its value, MACS shifts all the reads by d/2 toward 3' end, so that their localisation would be closer to the actual localisation of the protein of interest. Figure reprinted from paper Evaluation of Algo- rithm Performance in ChIP-seq Peak Detection by Wilbanks and Facciotti, published in PloS one in 2010 [84] under Creative Commons license.

be equal to an average size of fragments after sonication. MACS randomly selects 1000 of the enriched windows and separates reads from the sense and antisense strand. Then, the mean distance d between the reads on dierent strands is calculated and half of this distance is set as the numbers of bases all the reads will be shifted.

The parameter d, either estimated or provided by the user, is additionally used to determine the size of the window that is used to identify the enriched regions. After shifting all the reads by d/2, MACS slides a window of a size 2d across the genome. It is assumed that the signal from each of the windows comes from Poisson distribution with parameter λ. Poisson distribution reects probability of k random events occuring, assuming

(33)

each one occurs with the same probability; parameter λ represents expected value of the number of such events. Intuitively, such a setting seems to t the NGS data: the random event is a presence of a read in a given position, while the expected value of number of reads is the mean coverage. However, in practice, the signal from NGS-based experiments is much more dispersed than one would assume if it came from Poisson distribution, in which both the mean and the variance are equal to the same parameter λ. Experimental data tends to have much larger sample variance than sample mean, and so the Poisson distribution tends to t it poorly [30] [21].

For every window of size 2d, number of reads mapped within it is counted; I denote this value as k. The probability that k reads were mapped in this window is denoted Pˆλ(k); it is a density of Poisson distribution with parameter ˆλ in point k. To take into account various local changes in background intensity, the parameter ˆλ is estimated locally. Specically, for a given window it is calculated as follows:

λ = max(ˆˆ λ_1kb, ˆλ_5kb, ˆλ_10kb, ˆλ∞),

where ˆλx is a maximum likelihood estimator for parameter λ, estimated in a window of size x, centered in the given window; size x = ∞ means it is estimated from the whole genome. The maximum likelihood estimator in the case of Poisson distribution is simply an arithmetic mean of the values. If a control sample is available, the values used to estimate ˆλ parameters are the coverages in the control sample. If it is not, the coverages from the ChIP sample are used and the ˆλ1kbis not calculated.

Once the probability Pλˆ(k) is calculated, if it is below a given threshold p-value (10⁻⁵ by default) the window is considered a peak. The peak windows that are no further from each other than g are merged. By default, g parameter is equal to d, i.e. half of the window size used to identify enriched regions. Additionally the user can add "broad"

argument to denote that they expect broader peaks in their data; in this case, g = 4d.

The user can also specify a custom value for g.

MACS is capable of handling multiple samples at once, for example when we have many replicates of the same experiment. When provided with multiple les, MACS sums up the coverages, hence strenghtening signal at the sites of peaks that are common between the samples.

When provided with an input le, MACS can compute an empirical estimate of False Discovery Rate (FDR) for each peak. To do so, it runs the peakcalling procedure once again using parameter ˆλ with which the peak was called, but with samples swapped - i.e.

treating ChIP sample as control sample and vice-versa. Because in this setting we do not expect to observe any meaningful peaks, any called peaks are considered false positives.

The FDR is dened as the total number of peaks found in this way devided by the total number of peaks found with the classic procedure.

(34)

2.2.2. BayesPeak (HMM-based approach)

Another approach is to use Hidden Markov Model (HMM) to model the coverage. HMMs (described in more details in section 3.1) have often been used to model various genome- scale data, including results from NGS-based experiments (see for example Ji et al. 2005 [31], Munch et al. 2006 [52]). This approach is also implemented in BayesPeak [69], a Bioconductor package for R.

BayesPeak rst divides the genome into adjacent windows, as presented in gure 2.3;

windows are denoted as St, St+1 etc. Every window is assumed to be in one of two states: "peak" or "no peak". Every window emits two values, denoted Yt⁺ and Yt⁻, that represent the coverage of reads mapped to the positive and negative strand in this window. The window's length should be chosen depending on the reads' length, so that most reads would map to two neighbouring windows. As a result signal Yt⁺ and Y_t+1⁻ , i.e. number of reads mapped to the positive and negative strand in neighbouring windows has the same dependence on the sequence of states. Because of that, the states in the Hidden Markov Model used by BayesPeak actually represent two adjacent windows (denoted Zt, Z_t+1and Zt+2in the gure). While using states representing only individual windows would be more straightforward, such approach would mean that the state of any window Stdepends only on the previous window St−1. As the values emitted by the neighbouring windows are strongly correlated due to the way a window's size is chosen, such assumption might be an oversimplication and might make discovering biologically meaningful enrichment dicult. This is why the "double windows" Zt are used, allowing to take into consideration the additional dependence on the window St−2. Every such double window can be in one of four states: "both windows within it are peaks" (denoted 3), "the rst is a peak" (denoted 2), "the second is a peak" (1) or "both are background"

(0). The states 1, 2 and 3 are assumed to have the same eect on the enrichment. Signal from those states comes from negative binomial distribution. It is similar to the Poisson distribution, but it has two parameters, usually denoted p and r, allowing more exibility.

Specically, unlike in Poisson distribution, mean and variance are not modelled by the same parameter (section 3.1.4.2 gives more details on this distribution). Because of that, the negative binomial distribution is more suitable for modelling data with variance higher than mean, like coverage from NGS-based experiments.

Estimating parameters of negative binomial distribution can be challenging, which is why BayesPeak takes advantage of the fact that it can be expressed as a Poisson-Gamma mixture; that is, if a variable X has negative binomial distribution (noted X ∼ NB(p, r)) it can be expressed as a Poisson distribution with parameter λ, which itself is a random variable with distribution gamma (X ∼ P oisson(λ), λ ∼ Γ(α, β)). Instead of directly estimating parameters p and r of the negative binomial distribution, BayesPeak estimated the parameters α and β of the Γ distribution. If control sample is available, the actual distribution of reads diers for every window and is dened as P oisson(λγⁿ), where γ is an additional parameter, while n is a number of reads in control sample in the given window. More formally, it can be written as follows:

(35)

Figure 2.3: Schematic overview of BayesPeak's approach. Red and blue arrows represent reads mapped to the positive and negative strand, respectively. Regions St represent windows into which the genome is partitioned, while Zt represent "double windows"

created by merging two neighbouring Stwindows. HMM used by BayesPeak models the sequence of the Zt double windows. Figure reprinted from paper BayesPeak: Bayesian analysis of ChIP-seq data by Spyrou et al., published in BMC Bioinformatics in 2009 [69]

under Creative Commons license.

P (Y_t⁺, Y_t+1⁻ |Z_t= 0) ∼ P oisson(λ0γⁿ^t) P (Y_t⁺, Y_t+1⁻ |Z_t= 1, 2, 3) ∼ P oisson((λ0+ λ1)γⁿ^t)

λ₀∼ Γ(α₀, β₀) λ1∼ Γ(α₁, β1),

where nt is the number of 5' ends of reads mapped to either strand in windows t and t + 1, and γ is equal to 1 if the control le is not available.

The parameters of the HMM are estimated using Markov Chain Monte Carlo. Similarly to Expectation-Maximisation algorithm, that is described in section 3.1.4, the estimation is done in iterations until convergence. In the initial step, parameters are initialised with some prior distributions: α, β and γ with Gamma distribution, while p and r with Beta distribution. The distributions can be dened as follows:

(36)

x ∼ Gamma(a, b) ⇐⇒ f (x|a, b) = b^a

Γ(a)x^a−1e^−bx x ∼ Beta(a, b) ⇐⇒ f (x|a, b) ∝ x^a−1(1 − x)^b−1 where f is a density function, and Γ is a gamma function.

Each iteration consists of two steps - in the rst one, the sequence of states is simulated given the current parameters' values and the observed emissions (i.e. the signal from the analysed experiment). In the second one, the parameters are simulated given the complete data, i.e. the sequence of both emissions and states. At the nal step, each double-window Ztis assigned to each state of the HMM with some probability. It can be then translated into posterior probabilities for each window Stbeing either a peak or not.

To identify peaks, a threshold for these probabilities needs to be chosen, above which the window is considered a peak. By default the threshold is set to 0.5; it can be set to any other value, depending whether we want more or less probable regions, but the authors note that on the data BayesPeak was tested the majority of posterior probabilities were either very close to zero or to one, so any threshold from a wide range of values would yield very similar results.

2.2.3. SICER

SICER (Spatial clustering approach for the Identication of ChIP-Enriched Regions) [86]

is another peakcaller, that denes a spatial clustering method to identify enrichment. Its authors claim it is especially well suited for discovering long domains.

Similarly to BayesPeak, SICER rstly partitions the genome into non-overlapping windows of some xed length. Next, in every window score is calculated as:

score = −logP (number_of_reads, λ),

where P (x, λ) is a Poisson distribution. Parameter λ is the same for every window and is equal to the average number of reads in a window. In other words, the score represents probability that the observed number of reads happened to map in the specic window by chance, assuming uniform probability of mapping a read over the whole genome. It is worth noting that such approach does not take into account various local changes in background intensity (though they might be taken into account at the later stage, if control le is available). Based on the total number of reads in the sample and the desired p-value, a threshold l0 is calculated; windows with reads above the threshold are called "eligible". A sequence of eligible windows is called an island; the sequence can be interrupted by a sequence of ineligible windows, as long as the length of this gap does not exceed some predined threshold g (either with the default value or provided by the user). The authors note that p-value used to determine the threshold l0should not be too small, as the islands will be further ltered and there is no need for too rigorous initial criteria. The recommended value is 0.2.