5.2.1. Single sample
In table 5.1 I present results obtained by dierent peakcallers; for each peakcaller number of peaks, their mean and total length is provided. Results for 10 samples are shown, and additionally at the end results averaged over all the 10 samples.
peakcaller number of peaks mean length total length Thymus
MACS 376 947 1 348 508 050 398
SICER 37 866 4 752 168 444 952
HERON NB 22 635 51 041 1 155 315 00
HERON Gaussian 11 911 11 386 135 624 571
HERON + control 8 643 13 599 117 532 571
Fetal brain
MACS 140 814 2 838 399 550 284
SICER 43 361 6 312 269 756 665
HERON NB 22 609 33 036 746 910 571
HERON Gaussian 22 918 10 934 250 586 571
HERON + control 24 988 18 669 466 492 571
Spleen
MACS 890 810 1 041 926 899 430
SICER 13 542 5 354 66 149 446
HERON NB 21 718 54 449 1 182 529 000
HERON Gaussian 9 307 7 790 72 499 571
HERON + control 78 566 5 074 398 615 483
Brain germinal matrix
MACS 362 179 1 078 390 491 953
SICER 31 763 5 960 186 721 069
HERON NB 22 260 29 197 649 931 000
HERON Gaussian 34 207 8 744 299 095 571
HERON + control 23 601 11 688 275 859 571
Adult liver 1
MACS 656 334 1 415 928 431 221
SICER 58 885 2 975 171 932 798
HERON NB 27 149 55 861 1 516 562 000
HERON Gaussian 38 167 36 662 1 399 297 571
HERON + control 9 153 125 103 1 145 070 072
Adult liver 2
MACS 340 775 835 284 557 333
SICER 33 494 3 860 123 817 922
HERON NB 20 982 52 376 1 098 946 000
HERON Gaussian 24 942 20 188 503 523 571
HERON + control 11 895 46 669 555 133 000
Adult liver 3
MACS 446 222 2 113 942 992 165
SICER 89 189 3 403 297 803 084
HERON NB 24 759 63 564 1 573 787 000
HERON Gaussian 28 500 56 655 1 614 676 571
HERON + control 53 307 30 067 1 602 788 000
Brain hippocampus middle 1
MACS 703 943 1 125 791 689 408
SICER 35 820 4 482 156 660 848
HERON NB 22 423 46 809 1 049 601 000
HERON Gaussian 28 304 21 331 603 763 571
HERON + control 13 135 65 063 854 600 00
Brain hippocampus middle 2
MACS 351 257 996 349 856 931
SICER 37 494 5 426 196 545 974
HERON NB 24 100 48 204 1 161 727 000
HERON Gaussian 33 312 20 825 693 711 571
HERON + control 13 526 21 880 295 942 571
Brain hippocampus middle 3
MACS 341 561 2 213 755 773 491
SICER 42 762 4 219 176 637 128
HERON NB 31 710 44 229 1 402 496 000
HERON Gaussian 29 914 12 166 363 932 571
HERON + control 15 051 102 752 1 546 521 693
averaged
MACS 461 084 1 500 627 830 261
SICER 42 418 4 674 181 446 989
HERON NB 24 035 47 877 1 153 780 457
HERON Gaussian 26 148 20 668 593 671 171
HERON + control 25 187 44 056 725 855 553
Table 5.1: Results of peakcalling performed on H3K27me3 ChIP-seq data from Roadmap Epigenomics project from ten dierent samples. In the nal section, the results are aver-aged over the samples.
As one can see, HERON returns 8 - 78 thousands of peaks (in 8 out of 10 cases less than
40 thousands); they have mean width between 11 kb and 125 kb. In 7 out of 10 cases, HERON with negative binomial distribution returns longer peaks than other variants; in the remaining cases, it is the Gaussian with control. MACS, however, returns between 140 to 890 thousands of peaks, and their mean width ranges from 835 to 2838 bases.
Total length of peaks, i.e. the number of nucleotides covered by peaks, seems to be similar between the two peakcallers. SICER calls between 12 and 89 thousands of peaks, which is sligthly more than HERON and much fewer than MACS; their mean length is between 2 kb and 6 kb, so sligthly higher than MACS's, but still signicantly lower than HERON's. As a result, total length of peaks is usually the lowest for SICER.
To better understand these results, I decided to compare the characteristics of the ob-tained peaks to the characteristics we expect H3K27me3 peaks to possess. This
modi-cation is known to form long domains, up to thousands of kb, and so we can expect peaks to have similar length. In gure 5.2 I present the length of peaks obtained by the peakcallers on all the tissues. As one can see, for almost all the tissues HERON returns peaks larger than SICER and MACS, both in terms of mean and median; in most cases the rst quartile of lengths for HERON is above the third quartile for SICER and MACS.
The only exception is spleen; on this tissue mean and median length of peaks returned by HERON with control le is not much larger than for MACS and smaller than for SICER. It is worth noting, though, that the variance of peak length is very high here and the widest outliers have as much as 30 Mb length. Therefore, I conclude that at least as far as length is concerned HERON returns more reliable results than both MACS and SICER.
Another important characteristic of H3K27me3 modication is that it is connected to repressive functions, and so one can expect genes that lie within its domains to have lowered expression. I checked the expression of genes within peaks discovered by each of the tested methods and compared it to the expression of all the genes. Results are presented in gure 5.3. It seems that all the peakcallers give quite reasonable results -in all the cases, expression of genes with-in the discovered peaks is lower than overall expression of all the genes. The dierence is statistically signicant according to both one-sided t-test [71] and one-sided Mann-Whitney test [49] [3] at the level of signicance α < 0.051. Interestingly, expression of genes within MACS peaks is substantially higher than of those within other peaks (again, it is signicant at the level α < 0.052). It might mean that some regions identied as peaks by MACS do not in fact represent regions where in vivo H3K27me3 modication was present, as the genes within them do not seem to have repressed expression.
1The p-values were estimated using R. For t-test I obtained the following estimates: MACS: 1.913−4, SICER: 2.379−38, HERON with negative binomial distribution: 6.903−17, with Gaussian: 1.49−17, with control le: 6.959−36. As far as Mann-Whitney test is concerned, for MACS I obtained value equal to 1.08−211; for the rest of the peakcallers it was equal to 0, which means it was less than 2.225−308(it is the lowest non-zero oat on most computers, including the one I ran calculations on).
2The p-values for t-test, as estimated using R, were equal: SICER: 5.902−15, HERON with negative binomial distribution: 2.187−4, with Gaussian: 6.326−5, with control: 3.285−13. For Mann-Whitney test they were all equal to 0, which means they were less than 2.225−308.
Figure 5.2: Boxplots with length of peaks obtained by the tested peakcallers on various tissues. On the left, whole boxplots are shown, with outliers as large as 30 Mb. On the right zoomed in boxplots are shown - here the x-axis is trimmed to range (0, 50 kb). For readability, outliers are not shown here.
Figure 5.3: Median expression of genes in RPKM. "HERON G" denotes HERON with Gaussian distribution; "G + C" means "Gaussian distribution + control le", and "NB"
means negative binomial distribution. The rst column represents all the genes; the fol-lowing columns represent genes that overlap with peaks identied by specic peakcaller.
Here the results from all the tissues at once are shown. As H3K27me3 peaks are known to have repressive eect on expression, we expect the "all genes" bar to be the highest;
indeed, all the peakcallers seem to return regions with lowered expression.
Figure 5.4: Peakcalling with HERON on multiple adult liver tissue samples from Roadmap Epigenomics. In each barplot, the rst three bars represent results of a peak-calling done on a single le (denoted A, B or C). The last bar represents results of a peakcalling done on all the three les at once. The left plot shows mean length of the obtained peaks, and the right one mean expression of genes overlapping the peaks. The results shown here are obtained by running HERON with Gaussian distribution; using negative binomial distribution yielded similar results, at least as far as the overall con-clusions are concerned.
5.2.2. Multiple samples
For two of the analysed tissues - adult liver and brain hippocampus middle - three samples were available. I decided to check if number of used input les inuences HERON's results in the case of experimental, not simulated data. I ran HERON on all the samples at once and compared the results to the ones obtained using only one sample. Specically, I compared the mean peak length and mean expression of genes within peaks for the single sample and for all three. In gure 5.4 I show the results for the liver samples. It can be seen that peakcalling on a single le returns longer peaks, but the genes within them have higher expression. It can be concluded that the peakcalling on all the samples gives a more robust set of peaks; they might be shorter, but we can be more cartain that they indeed correlate with lowered expression. The regions idenited with all the les can be viewed as a consensus set of regions between those identied on the single les; they are probably more reliable, though they are shorter and overall cover less bases in the genome. Results for hippocampus samples (not shown here) yielded similar conclusions.
It is relevant to note here the variance within the results, even for the same tissue. For example, for the liver tissue the total number of bases identied as within peaks by MACS varies from 284 Mb to 928 Mb, and by HERON with control le - from 555 Mb to 1.6 Gb.
Simmilar dierences can be observed for the brain hippocampus middle samples; other metrics, like number of peaks within genes or median expression also tend to vary. This observation makes evident how important it is to use multiple les whenever possible;
using only one can result in obtaining biased and untrustworthy results.
Figure 5.5: Attempt at dierantiating between two tissues using HERON. Each dot represents a sample and is coloured according to its tissue. x axis represents states, and y axis represents mean of distribution of emission in each of the states. The plot shows the nal means obtained by HERON as a result of EM algorithm.
5.2.3. Multiple samples with subpopulations
The patterns of H3K27me3 modication are distinctive for various tissues [37] [18]. I wanted to check whether HERON would be able to catch the diernces in peak patterns between two tissues; I chose adult liver and hippocampus for this test, as they both are available in three samples, which might help HERON to converge to reliable results.
In gure 5.5 I present the nal means of emissions, as estimated by HERON after EM algorithm's convergence. The plot is similar to the ones presented in the previous section (see gures 4.8 and 4.9); x axis represents states, and y axis - mean of distribution of emission in each of the states; each dot represents a single sample, and they are slightly shued along the x axis, so that they would not entirely overlap; they are coloured according to their tissue. We want to obtain states representing peaks unique for the liver, unique for the hippocampus and common between the tissues; i.e. one state in which all the blue dots would be much higher than the red ones, one reversed, and one where all the dots are high. As one might have expected after not so promising results on the simulated data (section 4.3.3), HERON did not return the desired states; we have simply ve states with ascending means, and the trend is similar for both tissues. I tried the same procedure with various parameters (number of used les, distribution, window width, initial means, including control les), but I did not manage to obtain satisfying results.