MACS - Example approaches to peakcalling - HMM-based Method for Identifying Enrichment in Signa

2.2. Example approaches to peakcalling

2.2.1. MACS

MACS (Model-based Analysis of ChIP-Seq) [87] is one of the most widely used peak-callers. It was initially designed to work with ChIP-seq data, but it can be also used with ATAC-seq or many other NGS-based experiments. One of its main features is the ability to build a model that calculates the optimal value for shifting the reads prior to the peakcalling. The need for read shifting stems from the fact that DNA fragments are sequenced only from the 5' end, and the length of the sequenced fragments is limited by the used technology. It means that the place that corresponds to the actual localisation of the protein of interest is shifted in respect to the peak observed in the coverage of reads in the 3' direction of the strand they were mapped to (see gure 2.2). Since the fragments are equally likely to come from both strands, the peaks that we want to identify should have similar distribution of reads on the sense and antisense strand, but shifted relative to each other. To take this into account, MACS rst shifts all the reads towards 3' end by some number of nucleotides. The number can be either given by the user or estimated by MACS's model. The estimation is done as follows: the window of a xed size 2∗bandwidth is slided over the whole genome to nd windows with signal stronger mfold times than signal coming from randomly distributed reads. Both bandwidth and mfold can be pro-vided by the user; the default vaules are 300 and 32, respectively. The bandwidth should

Figure 2.2: Simplied overview of a ChIP-seq experiment. The blue ball represents the protein of interest, whose position we want to determine. The waves represent the frag-ments after sonications and immunoprecipitation, coming from sense (blue) or antisense (red) strand. Only the bolded beginning of a fragment is sequenced and mapped. Be-cause of this, the reads mapped to both sense and antisense strand are shifted to 5' end in respect to the actual locus of the protein binding. The distance between the peaks mapped to dierent strands is denoted by d. After estimating its value, MACS shifts all the reads by d/2 toward 3' end, so that their localisation would be closer to the actual localisation of the protein of interest. Figure reprinted from paper Evaluation of Algo-rithm Performance in ChIP-seq Peak Detection by Wilbanks and Facciotti, published in PloS one in 2010 [84] under Creative Commons license.

be equal to an average size of fragments after sonication. MACS randomly selects 1000 of the enriched windows and separates reads from the sense and antisense strand. Then, the mean distance d between the reads on dierent strands is calculated and half of this distance is set as the numbers of bases all the reads will be shifted.

The parameter d, either estimated or provided by the user, is additionally used to deter-mine the size of the window that is used to identify the enriched regions. After shifting all the reads by d/2, MACS slides a window of a size 2d across the genome. It is assumed that the signal from each of the windows comes from Poisson distribution with param-eter λ. Poisson distribution reects probability of k random events occuring, assuming

each one occurs with the same probability; parameter λ represents expected value of the number of such events. Intuitively, such a setting seems to t the NGS data: the random event is a presence of a read in a given position, while the expected value of number of reads is the mean coverage. However, in practice, the signal from NGS-based experiments is much more dispersed than one would assume if it came from Poisson distribution, in which both the mean and the variance are equal to the same parameter λ. Experimental data tends to have much larger sample variance than sample mean, and so the Poisson distribution tends to t it poorly [30] [21].

For every window of size 2d, number of reads mapped within it is counted; I denote this value as k. The probability that k reads were mapped in this window is denoted Pˆλ(k); it is a density of Poisson distribution with parameter ˆλ in point k. To take into account various local changes in background intensity, the parameter ˆλ is estimated locally. Specically, for a given window it is calculated as follows:

λ = max(ˆˆ λ_1kb, ˆλ_5kb, ˆλ_10kb, ˆλ∞),

where ˆλx is a maximum likelihood estimator for parameter λ, estimated in a window of size x, centered in the given window; size x = ∞ means it is estimated from the whole genome. The maximum likelihood estimator in the case of Poisson distribution is simply an arithmetic mean of the values. If a control sample is available, the values used to estimate ˆλ parameters are the coverages in the control sample. If it is not, the coverages from the ChIP sample are used and the ˆλ1kbis not calculated.

Once the probability Pλˆ(k) is calculated, if it is below a given threshold p-value (10⁻⁵ by default) the window is considered a peak. The peak windows that are no further from each other than g are merged. By default, g parameter is equal to d, i.e. half of the window size used to identify enriched regions. Additionally the user can add "broad"

argument to denote that they expect broader peaks in their data; in this case, g = 4d.

The user can also specify a custom value for g.

MACS is capable of handling multiple samples at once, for example when we have many replicates of the same experiment. When provided with multiple les, MACS sums up the coverages, hence strenghtening signal at the sites of peaks that are common between the samples.

When provided with an input le, MACS can compute an empirical estimate of False Discovery Rate (FDR) for each peak. To do so, it runs the peakcalling procedure once again using parameter ˆλ with which the peak was called, but with samples swapped - i.e.

treating ChIP sample as control sample and vice-versa. Because in this setting we do not expect to observe any meaningful peaks, any called peaks are considered false positives.

The FDR is dened as the total number of peaks found in this way devided by the total number of peaks found with the classic procedure.

2.2.2. BayesPeak (HMM-based approach)

Another approach is to use Hidden Markov Model (HMM) to model the coverage. HMMs (described in more details in section 3.1) have often been used to model various genome-scale data, including results from NGS-based experiments (see for example Ji et al. 2005 [31], Munch et al. 2006 [52]). This approach is also implemented in BayesPeak [69], a Bioconductor package for R.

BayesPeak rst divides the genome into adjacent windows, as presented in gure 2.3;

windows are denoted as St, St+1 etc. Every window is assumed to be in one of two states: "peak" or "no peak". Every window emits two values, denoted Yt⁺ and Yt⁻, that represent the coverage of reads mapped to the positive and negative strand in this window. The window's length should be chosen depending on the reads' length, so that most reads would map to two neighbouring windows. As a result signal Yt⁺ and Y_t+1⁻ , i.e. number of reads mapped to the positive and negative strand in neighbouring windows has the same dependence on the sequence of states. Because of that, the states in the Hidden Markov Model used by BayesPeak actually represent two adjacent windows (denoted Zt, Z_t+1and Zt+2in the gure). While using states representing only individual windows would be more straightforward, such approach would mean that the state of any window Stdepends only on the previous window St−1. As the values emitted by the neighbouring windows are strongly correlated due to the way a window's size is chosen, such assumption might be an oversimplication and might make discovering biologically meaningful enrichment dicult. This is why the "double windows" Zt are used, allowing to take into consideration the additional dependence on the window St−2. Every such double window can be in one of four states: "both windows within it are peaks" (denoted 3), "the rst is a peak" (denoted 2), "the second is a peak" (1) or "both are background"

(0). The states 1, 2 and 3 are assumed to have the same eect on the enrichment. Signal from those states comes from negative binomial distribution. It is similar to the Poisson distribution, but it has two parameters, usually denoted p and r, allowing more exibility.

Specically, unlike in Poisson distribution, mean and variance are not modelled by the same parameter (section 3.1.4.2 gives more details on this distribution). Because of that, the negative binomial distribution is more suitable for modelling data with variance higher than mean, like coverage from NGS-based experiments.

Estimating parameters of negative binomial distribution can be challenging, which is why BayesPeak takes advantage of the fact that it can be expressed as a Poisson-Gamma mixture; that is, if a variable X has negative binomial distribution (noted X ∼ NB(p, r)) it can be expressed as a Poisson distribution with parameter λ, which itself is a random variable with distribution gamma (X ∼ P oisson(λ), λ ∼ Γ(α, β)). Instead of directly estimating parameters p and r of the negative binomial distribution, BayesPeak estimated the parameters α and β of the Γ distribution. If control sample is available, the actual distribution of reads diers for every window and is dened as P oisson(λγⁿ), where γ is an additional parameter, while n is a number of reads in control sample in the given window. More formally, it can be written as follows:

Figure 2.3: Schematic overview of BayesPeak's approach. Red and blue arrows represent reads mapped to the positive and negative strand, respectively. Regions St represent windows into which the genome is partitioned, while Zt represent "double windows"

created by merging two neighbouring Stwindows. HMM used by BayesPeak models the sequence of the Zt double windows. Figure reprinted from paper BayesPeak: Bayesian analysis of ChIP-seq data by Spyrou et al., published in BMC Bioinformatics in 2009 [69]

under Creative Commons license.

P (Y_t⁺, Y_t+1⁻ |Z_t= 0) ∼ P oisson(λ0γⁿ^t) P (Y_t⁺, Y_t+1⁻ |Z_t= 1, 2, 3) ∼ P oisson((λ0+ λ1)γⁿ^t)

λ₀∼ Γ(α₀, β₀) λ1∼ Γ(α₁, β1),

where nt is the number of 5' ends of reads mapped to either strand in windows t and t + 1, and γ is equal to 1 if the control le is not available.

The parameters of the HMM are estimated using Markov Chain Monte Carlo. Similarly to Expectation-Maximisation algorithm, that is described in section 3.1.4, the estimation is done in iterations until convergence. In the initial step, parameters are initialised with some prior distributions: α, β and γ with Gamma distribution, while p and r with Beta distribution. The distributions can be dened as follows:

x ∼ Gamma(a, b) ⇐⇒ f (x|a, b) = b^a

Γ(a)x^a−1e^−bx x ∼ Beta(a, b) ⇐⇒ f (x|a, b) ∝ x^a−1(1 − x)^b−1 where f is a density function, and Γ is a gamma function.

Each iteration consists of two steps - in the rst one, the sequence of states is simulated given the current parameters' values and the observed emissions (i.e. the signal from the analysed experiment). In the second one, the parameters are simulated given the complete data, i.e. the sequence of both emissions and states. At the nal step, each double-window Ztis assigned to each state of the HMM with some probability. It can be then translated into posterior probabilities for each window Stbeing either a peak or not.

To identify peaks, a threshold for these probabilities needs to be chosen, above which the window is considered a peak. By default the threshold is set to 0.5; it can be set to any other value, depending whether we want more or less probable regions, but the authors note that on the data BayesPeak was tested the majority of posterior probabilities were either very close to zero or to one, so any threshold from a wide range of values would yield very similar results.

W dokumencie HMM-based Method for Identifying Enrichment in Signal from Sequencing-based Experiments (Stron 31-36)