Multiple samples - HMM-based Method for Identifying Enrichment in Signal from Sequencing-based

3.4.1. Replicates

Sometimes for a single experiment multiple replicates are available; they can be either biological or technical replicates, i.e. either an experiment is performed on dierent sam-ples, that are expected to have the same characteristics (e.g. the same tissue, but taken from dierent patients) or it is performed multiple times on the same sample to assess the technical reproducibility of the biochemical procedure. In such cases we are often interested in discovering peaks from multiple replicates at once. Let us say we want to identify regions in DNA where a specic transcription factor binds; we conduct multiple ChIP-seq experiments to obtain more robust results. We are interested in a consensus set of peaks between the results, not necessarily a seperate set for every sequenced sam-ple. One approach would be to run multiple peakcallings, one for every sample, and then merge the results in some way; for example, take only regions identied as peaks

in every sample, or at least in some number of samples. Another approach would be to run peakcalling once on all the samples at once, and obtain one set of peaks for all the samples. Some peakcallers, including HERON and MACS, allow providing multiple

les as an input. The latter approach has the advantage of utilising more information at once; this might be especially benecial in the case of signals with low signal-to-noise ratio, where peaks are often dicult to identify for many peakcallers. Peaks with low enrichment can be hard to detect in a single sample and can be easily interpreted as noise; however, if multiple samples have a weak peak at the same position, the chance that the peak emerged randomly at the exact same position is very small, and we might be more certain that it is a real peak, not an artifact. As noise is assumed to be mostly random, we expect only the actual regions of enrichment to be highly repetitive between replicates¹. Another advantage of this approach is purely practical: when one identies seperate set of peaks for each replicate it is not obvious how to determine a consesus set of peaks; running peakcaller once removes this obstacle.

When multiple samples treated as replicates are provided, two approaches are available in HERON: one can either sum up the coverages from all the samples and analyse the single sample, or treat the samples separately, as multidimensional emissions. In theory, the latter approach could have the advantage over the rst one, as it retains the knowledge of how the signal is distributed between the samples, and it potentially can utilise more information. A weakly enriched peak present in all the samples in the rst approach is indistinguishable from a very high peak present in only one of them; as the rst one is common between the samples one could assume it is an actual peak, while the second one could be just a strong artifact. In such situation the rst approach would be favourable, as it would be able to distinguish between the two types of peaks. In practice, however, such situations rarely occur. As it will be shown in the following sections, the dierence between the two approaches seems to be negligible; furthermore, in some cases simply suming up the coverages yields better results. The possible explanation here would be that the additional information we retain by treating the samples separately in fact is not important for the segmentation of the genome into states and hence only introduces noise and makes it harder for the algorithm to converge.

When the samples are treated separately, the initial emission means are calculated seper-ately for every sample; i.e. the initial mean for state "background" will be the following vector: [qbackground(A), q_background(B), ...], where qstate(X) is a quantile appropriate for state state from sample X, and A, B, ... are the provided samples.

When multiple samples are provided, it is not always obvious which state should be considered a "peak" state after the EM procedure. In the simplest case, it is the state for which every sample has the maximum emission's mean. It may happen, however, that such state doesn't exist; in these cases, the state for which most samples has the

1As was already mentioned, noise is not entirely random; various biases might be present in every sample at the same position. Including control les in our analyses can improve the results, as it allows us discovering these regions of noise, repeatable between replicates.

maximum mean is considered "peak". For example, let us consider three samples: A, B and C, and the following set of means estimated for each of them:

A B C

State1 : 3 3 3 State2 : 2 4 1 State3 : 1 2 2

There is no state for which all the samples have the maximum mean emission: samples A and C have maximum value in state 1, while sample B has maximum value in state 2.

We choose the state with the most maxima - in this case state 1. If there is more than one state that satises this condition - that is, there are at least two states with the same number of samples achieving maximum - HERON chooses the one from these states for which the mean averaged over the samples is the highest. For example:

A B C

State1 : 2 0 8 State2 : 7 2 2 State3 : 1 9 0

In this case sample A has maximum mean in state 2, sample B in state 3, and sample C in state 1. Means averaged over the samples are equal to ¹⁰₃,¹¹₃ and ¹⁰₃, respectively; based on the condition stated above, we choose state 2. In the rare cases where this condition is also met by more than one state, we choose the one from the potential states which has the highest mean in any of the samples. For example:

A B C

State1 : 3 0 8 State2 : 7 2 2 State3 : 0 9 0

Based on the previously dened criteria, we consider states 1 and 2 as potential peak states, as they both have maximum mean in one sample (as does state 3) and they have the highest mean averaged over samples. We choose state 1, as 8 is the highest from the considered maxima. Note that we don't choose state 3, even though 9 > 8, because state 3 does not satisfy the previous criterium (i.e. the average mean is lower than for states 1 and 2).

While it is highly unlikely that this procedure will not be conclusive, it is theoretically possible; in such cases the "peak" state is chosen by random from all the candidate "peak"

states, i.e. the states that fulll all the previous conditions. In every case when the choice is not obvious, i.e. when there is no one state with maximum among all the samples, HERON issues a warning message, as the results in such a case should be treated more carefully.

3.4.2. Subpopulations

HERON provides an additional feature that allows to analyse multiple samples coming from dierent subpopulations; for example from dierent tissues, grades of tumour or conditions. If the provided samples are divided into groups, HERON will attempt to identify set of peaks common for all the groups and sets of peaks unique for every group.

I will present the idea behind it on an example where we have two groups, but it can be easily generalised to any number of populations.

When samples are divided into two groups, HERON uses HMM with 5 states. Their means are initialised so that they can be interpreted as:

1. no-signal in all samples 2. background in all samples

3. background in group A, peak in group B; i.e. peaks unique for group B 4. peak in group A, background in group B; i.e. peaks unique for group A 5. peak in all samples; i.e. peaks common between the groups

The exact values of means are calculated in a similar way to the simple case with one pop-ulation. For every sample quantiles representing "no signal", "background" and "peak"

are calculated. Next the vector of means for each state is made up from the corresponding values; for example, for state 4 the vector would be:

[q_peak(A1), q_peak(A2), ..., q_backround(B1), q_background(B2), ...],

where qstate(X) is a quantile appropriate for state state from sample X; A1, A2, ... are samples from group A and B1, B₂, ...are samples from group B.

For more groups, we have always one state for "no-signal" in all samples and then all possible combinations of states "background" and "peak"; for example for three groups we have "no signal in all", "background in all", "background in A, peak in B, C", "back-ground in A, B, peak in C" etc.

Additionally for Gaussian emissions covariance matrix can be initialised as "grouped";

it is a full covariance matrix, but in the rst step variances between groups are equal to zero. It is only for initialisation; during the training it changes to regular full matrix.

The tting process is not controlled in any way, i.e. we do not ensure that the states will retain their original interpretation. The states' means might converge to something dicult to interpret.

As is shown in Section 4.3.3, this feature gives satisfactory results only when the peaks are exceptionally well enriched; I was not able to achieve acceptable results on exper-imental data (see sections 5.2.3 and 6.2.3). Possible improvement would be to change the Expectation Maximisation algorithm in the training process from unsupervised to

supervised. By adding some restrictions on the emissions' means - for example, that in every step means that are supposed to represent background have to be below the ones that represent peak - we would ensure that the states would retain their original interpre-tation. On the other hand, such modications to the EM algorithm would mean that the estimates might converge to something far from any optimum. There could be various approaches to this issue, however I will not present them in this dissertation. Overall, while conceptually interesting and potentially useful, without further modication this feature is not particularly useful and as of today I do not recommend using it.

W dokumencie HMM-based Method for Identifying Enrichment in Signal from Sequencing-based Experiments (Stron 62-66)