Results - HMM-based Method for Identifying Enrichment in Signal from Sequencing-based Experimen

4.3.1. Single sample

I wanted to test how the resolution used during peakcalling, i.e. the width of the windows the genome is segmented to, inuences results. I also wondered how it depends on the width of the true peaks we want to discover. To assess it I ran multiple peakcallings with various values of resolution and on dierent data generated from peaks with various length. In gure 4.4 I present the results. To measure peakcalling quality I used Jaccard index between the set of the true peaks and the discovered ones, dened as the fraction between the number of bases common between the true and discovered peaks, and the total number of bases in both datasets. Formally it can be written as:

Figure 4.3: Example screenshots showing simulated data for case with subpopulations.

In the (a) panel coverages obtained for example "patients" from each group are shown:

three from the Light Side (blue tracks) and three from the Dark Side (red tracks). At the bottom, peaks used to simulate this data are shown: as one can see, some are common between the groups and some unique to one of them. While the peaks within a group have the same coordinates, they can dier as far as their intensity is concerned. In this example, peaks with width 5000 are shown; average enrichment from background is equal to 3, while average enrichment from peaks is equal to 10. In the (b) panel a comparison of dierent enrichments is shown: for one example patient (Anakin Skywalker) three tracks are shown, with enrichment from peaks equal to 3, 10 and 15, respectively. Enrichment from background is constant and equal to 3.

Figure 4.4: Jaccard index between true peaks and the called ones depending on the peak width and resolution. The x axis represents the width of the true peaks, and the y axis represents window width used for peakcalling. Colour represents Jaccard index.

Peakcalling is done using Gaussian distribution. The average enrichment from peaks is equal to 5, and in background 1.

J = |true peaks ∩ predicted peaks|

|true peaks ∪ predicted peaks|

where |regions| is a total number of bases in the regions. The closer the index is to 1, the more similar the obtained results are to the true peaks we want to discover.

I discovered that for very short peaks (between 50 and 600 bases) HERON gives very poor results regardless of used resolution (Jaccard index < 0.2 for peaks shorter than 200 bases, < 0.5 for shorter than 600 bases). On the contrary, for very long peaks (between 10 and 40 kb) the results seem very good for a very wide range of window widths (Jaccard index > 0.95 in almost every case). Perhaps not surprisingly, Jaccard index is low when window width is larger than peak width; in particular it tends to be close to or equal zero when resolution is at least 10 times bigger than the width of the sought peaks. In such cases even well enriched peaks are usually hard to notice for the algorithm, as their enrichment is distributed over a large region of noise and becomes undetectable. What is

less intuitive is that also too short windows cause the Jaccard index to drop to zero. It is probably caused by the fact that in such setting the algorithm needs to estimate many more transitions between windows, while, especially in the case of low enrichment or high noise, many of them introduce additional noise instead of meaningful information.

In summary, HERON gives very good results when used to discover long domains, up to thousands of bases. In such cases the choice of window width does not seem to be crucial.

Judging by various additional results on both simulated and experimental data, most of which are not shown here, we recommend using resolution between 500 and 2000, and the defualt value is set to 600.

I compared my approach to three other peakcallers, MACS, BayesPeak and SICER (de-scribed in sections 2.2.1, 2.2.2 and 2.2.3, respectively). I ran MACS in two settings: one with default parameters and one with an additional parameter "broad", that makes it search for longer domains. BayesPeak and SICER were run with default parameters.

HERON was run with resolution 1000 and two distributions: Gaussian and negative bi-nomial. In gures 4.5 and 4.6 I present results obtained for four dierent types of data.

Three measures are used to assess them: Jaccard index, True Positive Rate (TPR) and False Discovery Rate (FDR). True Positive Rate is dened as number of nucleotides cor-rectly predicted as belonging to a peak divided by total number of nucleotides belonging to true peaks. False Discovery Rate is dened as a number of nucleotides incorrectly predicted as belonging to a peak divided by total number of nucleotides predicted as belonging to a peak, both correctly and incorrectly.

In gure 4.5 I show how peakcalling quality depends on the width of the true peaks for each of the used peakcallers. Two settings are shown - with low and high average enrichment on peaks. When the enrichment is low, HERON calls more true positives and has a better Jaccard index than MACS and BayesPeak regardless of the peak width (for example Jaccard index for MACS and BayesPeak is always lower than 0.2, while for HERON it's always higher than 0.2) and better than SICER for peaks longer than 5 kb (for SICER it's lower than 0.75, while for HERON it's higher). It tends to have worse FDR than MACS and sligthly worse than BayesPeak for short peaks (for the shortest peaks, HERON's FDR is > 0.5%, BayesPeak's: ∼ 40%, MACS's: < 3%), but is comparable to them for long peaks (for peaks longer than 7500 all peakcallers have FDR lower than 10%). When the enrichment is high, HERON consistently has TPR close to 100%, but for shorter peaks it comes at a cost of higher FDR (up to 50%). MACS gives better results than other peakcallers for short peaks; while HERON seems to have slightly better TPR, MACS has noticeably better FDR (for the shortest peaks: < 5%

for MACS and > 40% for HERON) and Jaccard index (> 0.75 for MACS and < 0.55 for HERON). BayesPeak also seems better than HERON in terms of FDR and Jaccard index for very short peaks (1 kb), though not as good as MACS. However, the longer the true peaks, the lower the FDR for HERON; as a result, for peaks longer than 2 kb it already has better Jaccard index than MACS (> 0.75 for HERON and < 0.65 for MACS). While for MACS both TPR and Jaccard index consistently drop with rising peak width, for HERON and SICER the trend is reversed - the longer the peaks, the

Figure 4.5: Peakcalling results on simulated data. x axis represents width of the true peaks, and y axis represents three measures - (a) Jaccard index between the true peaks and the called ones, (b) True Positive Rate and (c) False Discovery Rate. In the left column, the average enrichment simulated on peaks is low and equals 3; in the right column, it is high and equals 15. In all the plots average noise is constant and equals 3.

better the results. SICER seems comparable to HERON for high enrichment, but for peak length equal to 30 kb it calls only 35 peaks, 33 of which are false positives; as a result its TPR and Jaccard index drop drastically, and FDR gets close to 100%. Overall, HERON yields resuts better than MACS and BayesPeak and comparable to SICER for long peaks (> 3 kb) and for low enrichment (average signal from peaks equal to 3), and much better than all the peakcallers for very long peaks (30 kb) regardless of enrichment and for peaks longer than 7.5 kb for low enrichment.

In gure 4.6 I show how the results depend on the average enrichment on peaks. This time the width of the true peaks is constant; results for short (1 kb) and long (20 kb) peaks are shown. As one may expect, the higher the enrichment, the better the results for every peakcaller, at least in terms of Jaccard index and TPR. As far as FDR is concerned, the trend is not that obvious: it tends to drop for BayesPeak and SICER but it slightly rises for MACS; for HERON, it depends on the data. Investigating the false positives called by MACS suggested a possible explanation: it probably stems from the fact that some reads that come from peaks can be mapped to other regions due to sequence similarity, and with the rising enrichment more such reads get wrongly assigned to a place where there was no peak. In that sense, some slight rise in FDR with the rise of enrichment might be inevitable for every peakcaller. One can try to lter the peaks after the peakcalling basing on the underlying sequence and remove the peaks with too similar sequence as potential artifacts; that could lower the FDR, at the cost of also lowering TPR. Because the rise in FDR is not very steep I decided not to perform such procedure. This gure shows similar trends to the ones presented in the previous one;

for long peaks, HERON gives consistently better results than other peakcallers, even for very low enrichment (for example, Jaccard index for HERON is higher than 0.8, while for MACS it's lower than 0.2). For long peaks SICER is comparable to HERON when enrichment is higher (they have very similar Jaccard index, TPR and FDR), but worse for less enriched data (for example, for the lowest enrichment HERON with Gaussian distribution has Jaccard index equal 0.86% and TPR equal 89, while for SICER these metrics are equal 0.6 and 61%, respectively). For short peaks HERON and SICER give similar results, at least in terms of the metrics shown here. For short peaks HERON tends to have better TPR than MACS or BayesPeak (usually from 10 to 20 percentage points higher), but also worse FDR (from 5 to 45 percentage points higher). For very low enrichment Jaccard index is better for HERON; however, it is still very low (below 0.25 for average enrichment equal to 2). It seems that for short peaks and low enrichment none of the tested peakcallers gives results of a reasonable quality, suitable for any downstream analysis. For short peaks and better enrichment, MACS gives the best results - close to HERON in terms of TPR (i.e. close to 100%) and much better in terms of FDR (close to zero) and Jaccard index (for average enrichment higher than 10, Jaccard index is > 0.8 for MACS and < 0.55 for HERON). Overall, HERON's performance is best for the long peaks regardless of enrichment; for shorter peaks, it outperforms BayesPeak and MACS when the enrichment is low (≤ 5).

To summarise, HERON tends to have better sensitivity than other peakcallers, at the cost of specicity. MACS tends to call very little artifacts, often at the cost of not calling many true positives either (in some cases MACS did not call any peaks at all).

SICER seems comparable to HERON in some cases, but not for very long peaks or poor enrichment. BayesPeak tends to give results comparable to MACS for shorter peaks and slightly better for longer (> 5kb), but not as good as HERON and SICER. While for short peaks and good enrichment MACS is a reasonable choice, for long peaks HERON denitely gives much better results. It also gives better results for short peaks and very low enrichment, though in the case of data with such low quality the more reasonable

Figure 4.6: Peakcalling results on simulated data. x axis represents the average enrichment simulated on peaks, and y axis represents three measures - (a) Jaccard index between the true peaks and the called ones, (b) True Positive Rate and (c) False Discovery Rate.

In the left column, the true peaks are short (1kb); in the right column, they are long (20kb). In all the plots average noise is constant and equals 3.

choice would be to repeat the experiment and get better enrichment, rather than depend on the results of the peakcalling.

Peakcalling on the simulated data did not show much dierence between using Gaussian and negative binomial distribution. In some cases Gaussian distribution yielded slightly better results, but overall it seems both distributions t equally well to the simulated coverage.

4.3.2. Multiple samples

In order to check how number of input les inuences the results. For four dierent peak widths I ran HERON providing it with dierent number of les, from 1 to 5. In gure 4.7a I present how the Jaccard index between true and predicted peaks depends on number of used les. Perhaps not surprisingly, the more les we use, the better the results. The biggest gain is from adding an additional le to a single one; in the case of longer peaks (>= 5kb) there isn't much dierence between using two les and more. For the shortest peaks (1000b) there is still a slight rise in Jaccard index up to 4 les used, but adding a 5th one does not make much dierence. This trend is similar for both distributions and for various signal-to-noise ratios (not shown here). It is worth noting that for short peaks the gain from using additional les is substantial - from Jaccard index around 0.25 we can go up to 0.5. The longer the peaks, the better the results for a single le, and hence there is less to gain from adding input les when the peaks are long. For example for peaks of width 5 kb we go from Jaccard index around 0.75 to 0.85, and even using 8 les doesn't give us better results (not shown here). Overall, it seems adding les is benecial, especially for cases that are harder for HERON - like short peaks and weak signal-to-noise ratio. However, the improvement of the results usually is negligible above 4 les.

As described previously in section 3.4.1, when multiple samples treated as replicates are provided, two approaches are available: one can either sum up the coverages from all the samples and analyse the single sample, or treat the samples separately, as multidi-mensional emissions. I compared results obtained on simulated data for various number of samples for the two approaches; the comparison is shown in gure 4.7b. As one can see, the dierence seems negligible; furthermore, in some cases simply summing up the coverages yields better results. As was already mentioned, it might stem from the fact that keeping the samples separated introduces too much noise, and there is not much information added compared to simply summing the signal.

4.3.3. Multiple samples with subpopulations

As described in 4.2, I generated datasets belonging to two subpopulations. The datasets are named after characters from Star Wars, and the populations they are divided into are called Dark Side and Light Side. There are some peaks common for both populations and some unique for one of them. Within the population, every generated sample has the same set of peaks as far as their coordinates are concerned, though they might dier in their height. I generated the datasets with dierent peak width, enrichment and level of noise. In gure 4.3 an example screenshot from genome browser is shown.

As explained in section 3.4.2, when HERON is provided with samples divided into sub-populations it initialises the parameters of emissions for the states of HMM so that the states would represent the following features: (1) no-signal in all samples, (2) background in all samples, (3) peaks unique for the rst group, (4) peaks unique for the second group and (5) peaks common between the groups. Then the tting process begins and it is not

Figure 4.7: Peakcalling with HERON on multiple samples. x axis represents number of samples used for peakcalling and y axis represents Jaccard index. TPR and FDR are not shown here, because the conclusions drawn from these metrics seem to agree with each other. Results for four dierent true peak widths are shown - 1 kb, 3 kb, 5 kb and 20 kb. (a) Comparison between Gaussian and negative binomial distribution.

(b) Comparison between two approaches to dealing with multiple les, merging them and treating separately. In (a) the latter approach is used, but the results obtained with the rst approach were comparable (for simplicity I am not showing them here).

For similar reasons, in (b) only results obtained with Gaussian distribution are shown.

Average enrichment in the data used here was equal to 3, and average signal from noise

controlled in any way, so it is possible that after reaching convergence the states could not be longer interpreted as in the initial step. Below I will present results for two example settings - the rst one will be an "easy" case for which the results look promising, and the second one will be more challenging.

In gure 4.8 I present results obtained for example datasets. HERON was provided with seven samples from group Dark Side and seven from group Light Side. Each sample was simulated using peaks with width 20 kb; average enrichment on peaks was set to 15, and average noise to 1. Such setting should be easy to analyse for HERON, as the signal-to-noise ratio is very high and peaks' length matches the length for which HERON achieves results close to perfect. In gure 4.8A an example screenshot from genome browser is shown (for clarity only four signals are shown, two from each population). As one can see, peaks are clearly visible in the signal and should be easily distinguished from noise;

furthermore, peaks unique for each of the groups and those common between them are easily detectable, at least for human eye. Below the screenshot, three plots are shown:

each presents mean signal emitted by every state in every sample. Each dot represents one sample and they are coloured according to their subpopulation. For readability, the dots are sligthly jittered along the x-axis, so that they will not overlap with each other too much. The rst dotplot represents the empirical mean level of signal if we divide the genome into ve groups: (1) no signal at all in any sample, (2) no peak in any sample, (3) peaks unique for Dark Side, (4) peaks unique for Light Side and (5) peaks common for both Sides. We hope that the nal genome segmentation into states will resemble the actual segmentation into those groups, and so we hope that the nal emissions' means of states will resemble the empirical means presented in this plot. Obviously, there always will be some dierences between these segmentations, mainly due to the resolution of results and the ambigous division between states "no-signal" (which doesn't neccessarily have no signal between all the samples) and "noise" (which could have signal equal to zero in some windows, sometimes even for all the samples). The second dotplot represents the initialised emissions' means; i.e. means of distributions that were estimated from the data before the training process. Finally, the third dotplot represents the nal emissions' means, obtained after algorithm's convergence. As one can clearly see, the results here are very satisfying. The initial means were already close to the empirical means, i.e. those that we hope the nal results will be close to. After the training step, the means got even closer. We can easily identify on the nal results which states represent peaks unique for Dark Side, unique for Light Side and common between them. In the screenshot in panel A of the gure, below the signals, two sets of intervals are shown: rst, the intervals used as "true peaks" in the simulation process, i.e. the peaks that we want to discover; second, the obtained segments belonging to each state. We can see that state 3 mostly overlaps the regions of peaks unique for Dark Side, state 4 overlaps peaks unique for Light Side, and nally state 5 overlaps peaks common for both groups. The Jaccard indexes between these respective regions are between 0.88 and 0.89.

Figure 4.8: Attempt at dierantiating between populations using HERON. Here the

"easy" case is shown; peaks are 20 kb long, average signal from background is equal to 1, and from peaks - 15. (A) Screenshot from genome browser presenting the data.

Four top tracks are the signals used in the analysis; for simplicity, only two samples out of seven are shown for each group. Below the signals, true peaks used to generate the simulated data are shown, separate for each population. Below them, there are nal states obtained by HERON. In panels (B)-(D), dotplots representing means are shown - each dot represents one sample, and is coloured according to its population. (B) Empirical means of the signal in the following regions, mimicking the states of the HMM: (1) - no signal in all the samples, (2) - background in all the samples, (3) - background in the Dark Side samples, peak in the Light Side ones, (4) - the other way around, and nally (5) - peak in all the samples. (C) Means of emissions in each of the states, as initialised by HERON before the EM step. (D) Means of emissions after algorithm's convergence.

In gure 4.9 another case is shown. Here the peaks were still 20 kb long, average simulated noise was equal to 1, but average signal from peaks was equal to only 5 instead of 15.

The dotplots are similar to the ones on the previous gure: the rst represents empirical means on regions with no signal, noise, peaks unique for any of the sides and common peaks; the second one represents initialised means, and the last one - means obtained with the Baum-Welch algorithm. As one can see, the rst two dotplots are quite similar to the previous ones. The third one, however, shows that the algorithm failed to retain the initial interpretation of states. The rst state still represents regions with no or very little coverage, the second one represents background in both groups, and the third one peaks unique for Light Side population; the fourth state, however, seems to represent both peaks unique for Dark Side and common between the populations, as can also be observed on the screenshot from genome browser in panel 4.9A. The last state again represents background, just like state two. In the screenshot it can be seen that the background regions switch quite often between state 2 and state 5. Overall, HERON failed to identify peaks unique for the Dark Side, and the nal results' interpretation is not as obvious as in the rst case.

Note that characteristics of the "dicult" case, that is the one with the worse signal-to-noise ratio for which I didn't obtain satisfying results, are not actually particularly uncommon as far as experimental data is concerned. On the contrary, ChIP-seqs against H3K27me3 often have very poor signal-to-noise ratio; average noise equal to 1 and average enrichment equal to 5 (as was simulated in the second case) might actually be considered an average signal from H3K27me3 ChIP-seq experiment. Enrichment equal to 15, as was simulated in the rst case, is unusally high and rather rare in experimental data.

Additionally, note that in both cases I provided HERON with seven les for each group;

it is more common to have fewer experiments performed on the same tissue, so as far as number of samples is concerned both cases represented exceptionally good data. Overall, the satisfying results were obtained for data with extremely good quality, which is very rare in the datasets currently available. The data that yielded unsatisfactory results had characteristics much more common among the currently published datasets.

I tested this feature in many more scenarios; with shorter peaks, fewer number of sam-ples, and dierent noise and enrichment levels, as well as with dierent HERON param-eters (dierent distributions, window width, initial ltering threshold and initial means).

Suprisingly, for some rare cases HERON yielded better results when provided with fewer samples; for example, for peak length equal to 20 kb, noise level equal to 1 and enrich-ment level equal to 5, HERON gives poor results for seven or six samples, but quite good for ve samples. For four samples the results are again far from satisfactory. Overall results were similar to the ones presented above - in most cases when the peaks are long and enrichment is very high HERON did distinguish between peaks unique for every population and common. In the other cases, however, the distinction failed.

In summary, currently the feature of recognising peaks unique for some subpopulations does not seem to be particuarly useful. It works well for data with exceptionally high quality, but in the case of such data any other approach - like running peakcalling for

every sample and then nding intersections and complements of sets - should give sim-ilar results. The obvious advantage of using HERON over other approaches would be the convenience of being able to complete the entire process in one step. However, as mentioned before, due to the constraint of requiring exceptionally high quality data for the algorithm to converge to reasonable results, I do not nd this to be a practical ap-plication of my tool, at least in its current implementation. Perhaps changing the EM algorithm to aid the process of convergence, so that the states will retain their initial interpretations, might improve the results.

W dokumencie HMM-based Method for Identifying Enrichment in Signal from Sequencing-based Experiments (Stron 74-86)