SICER - Example approaches to peakcalling

2.2. Example approaches to peakcalling

2.2.3. SICER

x ∼ Gamma(a, b) ⇐⇒ f (x|a, b) = b^a

Γ(a)x^a−1e^−bx x ∼ Beta(a, b) ⇐⇒ f (x|a, b) ∝ x^a−1(1 − x)^b−1 where f is a density function, and Γ is a gamma function.

Each iteration consists of two steps - in the rst one, the sequence of states is simulated given the current parameters' values and the observed emissions (i.e. the signal from the analysed experiment). In the second one, the parameters are simulated given the complete data, i.e. the sequence of both emissions and states. At the nal step, each double-window Ztis assigned to each state of the HMM with some probability. It can be then translated into posterior probabilities for each window Stbeing either a peak or not.

To identify peaks, a threshold for these probabilities needs to be chosen, above which the window is considered a peak. By default the threshold is set to 0.5; it can be set to any other value, depending whether we want more or less probable regions, but the authors note that on the data BayesPeak was tested the majority of posterior probabilities were either very close to zero or to one, so any threshold from a wide range of values would yield very similar results.

The score of an island is equal to the sum of scores of all the eligible windows within this island. Islands with score above some threshold sT are considered peaks. If the control le is not available, the threshold is calculated so that the following condition is met: the expected number of islands considered peaks, assuming the eligible windows are distributed randomly and independently, is below some value E (by default equal to 1000). If we use a control le, the rst step is similar; using some not restrictive value E, SICER rst lters the candidate islands. Next, for every island we calculate a p-value as P^∞_n=n_sP (n, cnc), where ns, nc are numbers of reads in the ChIP sample and in the control, respectively; P (x, y) is a density of Poisson distribution in x with parameter y and c is a scaling factor equal to the ratio between the total number of ChIP reads and control reads. The islands with enrichment atypical comparing to control le will have atypically small p-value. As we are only interested in the islands that are signicantly enriched, not the ones that have unusually low coverage, the islands with ns ≤ cn_c are discarded. Finally, islands with statistically signicant enrichment are identied with a p-value threshold using Bonferroni correction for multiple testing.

In both cases, i.e. with and without control le, to calculate the p-value threshold we need to assume some distribution M(s) of scores for islands. It can be dened in a recursive manner, by partitioning an island into the last eligible window (for which the distribution of scores is easily dened) and the rest of the island (for which the distribution can be dened by repeating this procedure). To obtain an analytical expression in a closed form descibing this distribution, and specically its asymptotic behaviour (as we are mostly interested in probability of an island getting a high score), the authors additionally assume that its behaviour is similar to an exponential decay (M(s) ∼ αe^−βs). Using this assumption, they can derive the equation for M(s) in a closed form. Once the distribution M (s) is dened, one could calculate the threshold s above which the island should be considered eligible if we want to meet the conditions dened above. The authors assume that the enriched islands - that is, peaks - are rare and hence independent, and so they note that the number of islands of score s is simply equal to M(s) ∗ L, where L is the size of the whole genome.

2.2.4. Other approaches

There are many other approaches to peakcalling (see for example the following reviews and papers: Park 2009 [55],Pepke et al. 2009 [57], Wilbanks et al. 2010 [84] Micsinai, Parisi et al. 2012 [50]. Thomas et al. 2017 [76] and Jeon et al. 2020 [29]). In most cases, some distribution needs to be assumed; while Poisson distribution was one of the rst used (MACS [87], HPeak [59], SIPeS [81], SICER [86]), the negative binomial is currently the most popular one (CisGenome [30], BayesPeak [69], ZINBA [61]), but also sometimes binomial (Peak-seq [63]), Gaussian or Gaussian mixture (Sole-Search [6], JAMM [27]) or even exponential and log-normal (Genrich [21]).

As was mentioned in section 2.1, the observed signal depends on many factors. One of such factors is mappability, i.e. assessment of how certain we can be about a read mapping to the specic location. One way of taking this into account is to use a control le, but some

argue it might not be sucient and one should incorporate the additional factors directly in the model, whenever possible. Furthermore, the control le is not always available. An example peakcaller that more directly takes mappability into account is Peak-Seq [63]. It calculates a so-called mappability map; for some given k (by default equal 30), for every position in the genome, Peak-Seq calculates the number of occurences of a k-mer starting in this position in the whole genome sequence. I will refer to this number as nk,x, where xis the position it was calculated for and k is the length of the k-mer. nk,xequal 1 means that a k-mer starting at x is unique, i.e. there are no sequences of length k identical to the one starting at x anywhere else in the genome. If one wants to map a read that has a sequence of length k identical to the k-mer starting at x, one can be sure that the read comes from this position, as there are no other positions in the genome with the same exact sequence; because of that, mapping to such positions is unambigous. On the other hand, high value of nk,x means there are more positions in the genome where exactly the same sequence of length k occurs. Mapping a read with such sequence is ambigous;

we can never be sure which of the nk,x positions the read comes from. For every read mapped to some position x, the higher the nk,xthe less certain we can be about the read being mapped correctly. For the sake of simplicity of the given examples I assumed that all the reads have the same length and that during mapping procedure we are looking for an exact match (i.e. we do not take into account sequencing errors or dierences between the sequence of the analysed genome and the one used as a reference); however in the cases when these assumptions are not true the basic idea behind the mappability map stays the same: the higher the nk,x, the more ambiguous the mapping of reads is to the given position x. Peak-Seq identies regions where signal is substantially higher that one would expect from a random distribution of reads; the mappability map is used in this step to more accurately model this random background.

ZINBA (Zero-Inated Negative Binomial Algorithm) [61] allows to use various additional variables; not only mappability, but also GC-content or other characteristics dened by user. ZINBA divides the genome into windows and assigns every window to one of three components: zero-signal, unenriched and enriched regions. The assignment is performed iteratively using Expectation Maximisation algorithm.

Many peakcallers divide the genome into windows for the analysis, and hence they need some resolution, either calculated by the peakcaller itself or given by the user. To perform that task, JAMM (Joint Analysis of NGS replicates via Mixture Model clustering) [27]

uses method developed by Shimazaki et al. [66] for choosing optimal bin size of a time histogram.

Some peakcallers do not work on raw coverage calculated as number of mapped reads (potentially elongated or shifted, like in MACS). F-seq [8] uses Gaussian kernel density estimator to smoothen the signal. The bandwidth parameter that controls how smooth the resulting signal will be is dependent on the expected length of the peaks we want to discover, which is provided by the user.

W dokumencie HMM-based Method for Identifying Enrichment in Signal from Sequencing-based Experiments (Stron 36-39)