Results - HMM-based Method for Identifying Enrichment in Signal from Sequencing-based Experimen

5.2.1. Single sample

In table 5.1 I present results obtained by dierent peakcallers; for each peakcaller number of peaks, their mean and total length is provided. Results for 10 samples are shown, and additionally at the end results averaged over all the 10 samples.

peakcaller number of peaks mean length total length Thymus

MACS 376 947 1 348 508 050 398

SICER 37 866 4 752 168 444 952

HERON NB 22 635 51 041 1 155 315 00

HERON Gaussian 11 911 11 386 135 624 571

HERON + control 8 643 13 599 117 532 571

Fetal brain

MACS 140 814 2 838 399 550 284

SICER 43 361 6 312 269 756 665

HERON NB 22 609 33 036 746 910 571

HERON Gaussian 22 918 10 934 250 586 571

HERON + control 24 988 18 669 466 492 571

Spleen

MACS 890 810 1 041 926 899 430

SICER 13 542 5 354 66 149 446

HERON NB 21 718 54 449 1 182 529 000

HERON Gaussian 9 307 7 790 72 499 571

HERON + control 78 566 5 074 398 615 483

Brain germinal matrix

MACS 362 179 1 078 390 491 953

SICER 31 763 5 960 186 721 069

HERON NB 22 260 29 197 649 931 000

HERON Gaussian 34 207 8 744 299 095 571

HERON + control 23 601 11 688 275 859 571

Adult liver 1

MACS 656 334 1 415 928 431 221

SICER 58 885 2 975 171 932 798

HERON NB 27 149 55 861 1 516 562 000

HERON Gaussian 38 167 36 662 1 399 297 571

HERON + control 9 153 125 103 1 145 070 072

Adult liver 2

MACS 340 775 835 284 557 333

SICER 33 494 3 860 123 817 922

HERON NB 20 982 52 376 1 098 946 000

HERON Gaussian 24 942 20 188 503 523 571

HERON + control 11 895 46 669 555 133 000

Adult liver 3

MACS 446 222 2 113 942 992 165

SICER 89 189 3 403 297 803 084

HERON NB 24 759 63 564 1 573 787 000

HERON Gaussian 28 500 56 655 1 614 676 571

HERON + control 53 307 30 067 1 602 788 000

Brain hippocampus middle 1

MACS 703 943 1 125 791 689 408

SICER 35 820 4 482 156 660 848

HERON NB 22 423 46 809 1 049 601 000

HERON Gaussian 28 304 21 331 603 763 571

HERON + control 13 135 65 063 854 600 00

Brain hippocampus middle 2

MACS 351 257 996 349 856 931

SICER 37 494 5 426 196 545 974

HERON NB 24 100 48 204 1 161 727 000

HERON Gaussian 33 312 20 825 693 711 571

HERON + control 13 526 21 880 295 942 571

Brain hippocampus middle 3

MACS 341 561 2 213 755 773 491

SICER 42 762 4 219 176 637 128

HERON NB 31 710 44 229 1 402 496 000

HERON Gaussian 29 914 12 166 363 932 571

HERON + control 15 051 102 752 1 546 521 693

averaged

MACS 461 084 1 500 627 830 261

SICER 42 418 4 674 181 446 989

HERON NB 24 035 47 877 1 153 780 457

HERON Gaussian 26 148 20 668 593 671 171

HERON + control 25 187 44 056 725 855 553

Table 5.1: Results of peakcalling performed on H3K27me3 ChIP-seq data from Roadmap Epigenomics project from ten dierent samples. In the nal section, the results are aver-aged over the samples.

As one can see, HERON returns 8 - 78 thousands of peaks (in 8 out of 10 cases less than

40 thousands); they have mean width between 11 kb and 125 kb. In 7 out of 10 cases, HERON with negative binomial distribution returns longer peaks than other variants; in the remaining cases, it is the Gaussian with control. MACS, however, returns between 140 to 890 thousands of peaks, and their mean width ranges from 835 to 2838 bases.

Total length of peaks, i.e. the number of nucleotides covered by peaks, seems to be similar between the two peakcallers. SICER calls between 12 and 89 thousands of peaks, which is sligthly more than HERON and much fewer than MACS; their mean length is between 2 kb and 6 kb, so sligthly higher than MACS's, but still signicantly lower than HERON's. As a result, total length of peaks is usually the lowest for SICER.

To better understand these results, I decided to compare the characteristics of the ob-tained peaks to the characteristics we expect H3K27me3 peaks to possess. This

modi-cation is known to form long domains, up to thousands of kb, and so we can expect peaks to have similar length. In gure 5.2 I present the length of peaks obtained by the peakcallers on all the tissues. As one can see, for almost all the tissues HERON returns peaks larger than SICER and MACS, both in terms of mean and median; in most cases the rst quartile of lengths for HERON is above the third quartile for SICER and MACS.

The only exception is spleen; on this tissue mean and median length of peaks returned by HERON with control le is not much larger than for MACS and smaller than for SICER. It is worth noting, though, that the variance of peak length is very high here and the widest outliers have as much as 30 Mb length. Therefore, I conclude that at least as far as length is concerned HERON returns more reliable results than both MACS and SICER.

Another important characteristic of H3K27me3 modication is that it is connected to repressive functions, and so one can expect genes that lie within its domains to have lowered expression. I checked the expression of genes within peaks discovered by each of the tested methods and compared it to the expression of all the genes. Results are presented in gure 5.3. It seems that all the peakcallers give quite reasonable results -in all the cases, expression of genes with-in the discovered peaks is lower than overall expression of all the genes. The dierence is statistically signicant according to both one-sided t-test [71] and one-sided Mann-Whitney test [49] [3] at the level of signicance α < 0.05¹. Interestingly, expression of genes within MACS peaks is substantially higher than of those within other peaks (again, it is signicant at the level α < 0.05²). It might mean that some regions identied as peaks by MACS do not in fact represent regions where in vivo H3K27me3 modication was present, as the genes within them do not seem to have repressed expression.

1The p-values were estimated using R. For t-test I obtained the following estimates: MACS: 1.913⁻⁴, SICER: 2.379⁻³⁸, HERON with negative binomial distribution: 6.903⁻¹⁷, with Gaussian: 1.49⁻¹⁷, with control le: 6.959⁻³⁶. As far as Mann-Whitney test is concerned, for MACS I obtained value equal to 1.08⁻²¹¹; for the rest of the peakcallers it was equal to 0, which means it was less than 2.225⁻³⁰⁸(it is the lowest non-zero oat on most computers, including the one I ran calculations on).

2The p-values for t-test, as estimated using R, were equal: SICER: 5.902⁻¹⁵, HERON with negative binomial distribution: 2.187⁻⁴, with Gaussian: 6.326⁻⁵, with control: 3.285⁻¹³. For Mann-Whitney test they were all equal to 0, which means they were less than 2.225⁻³⁰⁸.

Figure 5.2: Boxplots with length of peaks obtained by the tested peakcallers on various tissues. On the left, whole boxplots are shown, with outliers as large as 30 Mb. On the right zoomed in boxplots are shown - here the x-axis is trimmed to range (0, 50 kb). For readability, outliers are not shown here.

Figure 5.3: Median expression of genes in RPKM. "HERON G" denotes HERON with Gaussian distribution; "G + C" means "Gaussian distribution + control le", and "NB"

means negative binomial distribution. The rst column represents all the genes; the fol-lowing columns represent genes that overlap with peaks identied by specic peakcaller.

Here the results from all the tissues at once are shown. As H3K27me3 peaks are known to have repressive eect on expression, we expect the "all genes" bar to be the highest;

indeed, all the peakcallers seem to return regions with lowered expression.

Figure 5.4: Peakcalling with HERON on multiple adult liver tissue samples from Roadmap Epigenomics. In each barplot, the rst three bars represent results of a peak-calling done on a single le (denoted A, B or C). The last bar represents results of a peakcalling done on all the three les at once. The left plot shows mean length of the obtained peaks, and the right one mean expression of genes overlapping the peaks. The results shown here are obtained by running HERON with Gaussian distribution; using negative binomial distribution yielded similar results, at least as far as the overall con-clusions are concerned.

5.2.2. Multiple samples

For two of the analysed tissues - adult liver and brain hippocampus middle - three samples were available. I decided to check if number of used input les inuences HERON's results in the case of experimental, not simulated data. I ran HERON on all the samples at once and compared the results to the ones obtained using only one sample. Specically, I compared the mean peak length and mean expression of genes within peaks for the single sample and for all three. In gure 5.4 I show the results for the liver samples. It can be seen that peakcalling on a single le returns longer peaks, but the genes within them have higher expression. It can be concluded that the peakcalling on all the samples gives a more robust set of peaks; they might be shorter, but we can be more cartain that they indeed correlate with lowered expression. The regions idenited with all the les can be viewed as a consensus set of regions between those identied on the single les; they are probably more reliable, though they are shorter and overall cover less bases in the genome. Results for hippocampus samples (not shown here) yielded similar conclusions.

It is relevant to note here the variance within the results, even for the same tissue. For example, for the liver tissue the total number of bases identied as within peaks by MACS varies from 284 Mb to 928 Mb, and by HERON with control le - from 555 Mb to 1.6 Gb.

Simmilar dierences can be observed for the brain hippocampus middle samples; other metrics, like number of peaks within genes or median expression also tend to vary. This observation makes evident how important it is to use multiple les whenever possible;

using only one can result in obtaining biased and untrustworthy results.

Figure 5.5: Attempt at dierantiating between two tissues using HERON. Each dot represents a sample and is coloured according to its tissue. x axis represents states, and y axis represents mean of distribution of emission in each of the states. The plot shows the nal means obtained by HERON as a result of EM algorithm.

5.2.3. Multiple samples with subpopulations

The patterns of H3K27me3 modication are distinctive for various tissues [37] [18]. I wanted to check whether HERON would be able to catch the diernces in peak patterns between two tissues; I chose adult liver and hippocampus for this test, as they both are available in three samples, which might help HERON to converge to reliable results.

In gure 5.5 I present the nal means of emissions, as estimated by HERON after EM algorithm's convergence. The plot is similar to the ones presented in the previous section (see gures 4.8 and 4.9); x axis represents states, and y axis - mean of distribution of emission in each of the states; each dot represents a single sample, and they are slightly shued along the x axis, so that they would not entirely overlap; they are coloured according to their tissue. We want to obtain states representing peaks unique for the liver, unique for the hippocampus and common between the tissues; i.e. one state in which all the blue dots would be much higher than the red ones, one reversed, and one where all the dots are high. As one might have expected after not so promising results on the simulated data (section 4.3.3), HERON did not return the desired states; we have simply ve states with ascending means, and the trend is similar for both tissues. I tried the same procedure with various parameters (number of used les, distribution, window width, initial means, including control les), but I did not manage to obtain satisfying results.

W dokumencie HMM-based Method for Identifying Enrichment in Signal from Sequencing-based Experiments (Stron 91-98)