Summary

Similarly to the results obtained on healthy tissues from Roadmap Epigenomics project, that I described in chapter 5, peakcalling on H3K27me3 ChIP-seq using HERON yielded much longer peaks than using SICER or MACS (see gure 6.2). Analyses of the expres-sion of genes that overlap the identied peaks suggest that all three peakcallers return reliable results, i.e. in all the cases the mean expression of those genes was signicantly lower than the mean expression of all the genes (gure 6.3). As in this chapter more sam-ples were available than in the case of Roadmap Epigenomics data, I could additionally identify genes that seem to have H3K27me3 modication only in a subset of patients; I checked whether in such cases the presence or absence of the modication can be used as a predictor of expression level. Indeed, the methylated genes tended to have lower expression than those without the modication (see gure 6.5).

Interestingly, as far as the peakcalling on multiple samples is concerned, I obtained dif-ferent results from those in the previous chapter. In chapter 5, peakcalling on multiple samples yielded shorter peaks, and the expression of peaks overlapping those peaks was lower than in the case of peakcalling on a single sample. Here the results were similar only as far as the peak length is concerned. The expression of genes that overlap peaks called using multiple les tended to be similar or even higher than the expression of those that overlap peaks called using a single sample. It might be due to the characteristics of the analysed samples - cancer tissues tend to be much more variable and have more aberrant expression than healthy ones, so assessing results of peakcalling merely based on expression might yield misleading conclusions.

In this chapter I also showed that the negative binomial distribution seems to t to the experimental data better than Gaussian distribution. The results of the peakcalling, however, suggest there is no obvious gain from using it. It may stem from the fact that

on peaks outside of peaks

patient ID

Wasserstein distance

Kullback-Leibler divergence Wasserstein distance

Kullback-Leibler divergence

NB Gaussian NB Gaussian NB Gaussian NB Gaussian

PA01 1.748 7.856 0.061 0.414 0.108 0.598 0.04 7.266

PA02 1.337 1.688 0.226 0.244 0.101 0.728 0.025 6.414

PA03 1.236 5.928 0.046 0.36 0.09 0.525 0.04 7.154

PA04 1.209 6.054 0.048 0.346 0.116 0.63 0.05 7.388

PA05 1.15 5.96 0.044 0.36 0.12 0.583 0.049 7.935

PA06 0.884 4.46 0.04 0.3 0.1 0.539 0.041 8.093

PA07 0.9 4.446 0.043 0.32 0.133 0.685 0.046 7.068

DA01 0.69 3.403 0.034 0.228 0.123 0.55 0.052 8.445

DA02 0.345 0.676 0.059 0.107 0.18 0.626 0.083 8.552

DA03 1.165 5.089 0.053 0.35 0.063 0.439 0.03 8.352

DA04 0.81 3.863 0.032 0.276 0.072 0.443 0.038 8.993

DA05 1.494 5.977 0.053 0.361 0.068 0.43 0.036 8.765

DA06 1.363 6.303 0.046 0.364 0.13 0.742 0.055 7.23

GB01 1.535 7.27 0.053 0.395 0.118 0.658 0.049 6.978

GB02 1.217 6.521 0.05 0.38 0.209 0.853 0.077 6.745

GB03 1.027 5.084 0.05 0.325 0.223 0.766 0.09 6.919

GB04 0.509 3.456 0.026 0.281 0.016 0.284 0.004 9.581

GB05 0.321 2.054 0.023 0.19 0.163 0.642 0.09 7.539

GB06 0.409 2.452 0.025 0.222 0.187 0.718 0.076 7.8

GB07 0.868 4.743 0.041 0.327 0.172 0.767 0.077 7.118

GB08 1.441 6.481 0.052 0.367 0.12 0.621 0.055 6.9

GB09 0.276 0.959 0.015 0.11 0.178 0.702 0.092 8.193

GB10 0.782 4.189 0.045 0.296 0.219 0.724 0.113 8.28

PG11 0.492 1.97 0.029 0.126 0.197 0.586 0.96 9.09

Table 6.3: Accuracy of t for two distributions: Gaussian and negative binomial ("NB" in the header), for data from H3K4me3 peaks and from outside of peaks for every patient.

For both measures the smaller the value, the better the t; in every pair of columns the smaller value is bolded. As one can see, it is always for the negative binomial distribution.

the negative binomial distribution is more dicult to t, and using it can cause various issues with parameters' estimation and algorithm's convergence.

Chapter 7

Development of High Throughput Sequencing (or Next-Generation Sequencing) and its raising availability at the 2000s opened a lot of new possibilities. During the last twenty years multiple various experimental protocols basing on these methods were developed, like ChIP-seq, ATAC-seq, RNA-seq, CUT&tag or Hi-C. Thanks to their high throughput, they allowed examining various phenomena in the living cells on a huge scale; for exam-ple expression of genes or localisation of proteins bound to DNA. Many projects that collect data from such experiments were launched, probably the most notable of which is ENCODE - Encyclopedia of DNA Elements [15] [13], but also EnhancerAtlas [20], FANTOM (Functional ANnoTation Of the Mammalian genome) [44] [43], TCGA (The Cancer Genome Atlas) [75], Roadmap Epigenomics [37], ENPG (Encyclopedia of Plant Genome) [17] and countless more. NGS-based experiments tend to produce a lot of data that require further downstream analyses; this in turn caused a need to create many new tools designed to work with such data. Many of such tools were created, sometimes as a part of the projects mentioned above; some of them are designed to work only with some specic type of data, some are meant to work equally well with any type of input, however in practice there always are cases in which these supposedly universal tools tend to fail. As a result, even though there are more and more tools created every year, there is still an ongoing need to develop and implement new ones.

In this dissertation, I've described several methods of performing peakcalling - a proce-dure aiming at discovering enrichment in signal from various NGS-based experiments.

I've shown that especially for very long and poorly enriched domains the existing tools often give unsatisfactory results. I've presented a new tool called HERON, that uses Hidden Markov Models with continuous emissions, from either Gaussian or negative bi-nomial distribution. HERON proved to work better than other tested peakcallers under the conditions described above. I've shown it on experimental data from two dierent published projects [37] [70], specically from ChIP-seq experiment against H3K27me3 modication performed on healthy and cancer human tissues.

Additionally, I've developed a package for simulating data from NGS-based experiments;

it allows assessing in a controlled way how various data features inuence peakcalling.

In particular, using this package I've shown for dierent tested peakcallers how quality of the obtained results depends on the width of the sought peaks and their enrichment.

The conclusions drawn from these simulations support the ones drawn from analysing experimental data - HERON outperforms other peakcallers when the data is poorly enriched and the peaks are long. Furthermore, the analyses show the importance of choosing the right tool for dierent types of data. I've tested four tools on a broad range of parameters regarding data quality and characteristics; as it turned out, various tools give satisfactory results only within some narrow subspace of parameters. While it is tempting to use just one tool for every kind of data, these simulations prove that in many cases such approach will yield poor and biased results.

The peakcaller HERON was published in 2021 in International Journal of Molecular Sciences [46]. Both HERON and the package for data simulation are publicly available on GitHub under beerware licence (see https://github.com/maciosz/HERON and https:

//github.com/maciosz/NGS_simulation).

Acknowledgements

The results presented in this dissertation were obtained as part of a grant "An atlas of brain regulatory regions and regulatory networks a novel systems biology approach to pathogenesis of selected neurological disorders" funded by National Science Centre in Poland (DEC-2015/16/W/NZ2/00314).

Figures 2.2 and 2.3 are reprinted from papers [87] and [69] respectively, where they are published under Creative Commons license. All other gures were created by me, using Inkscape [28], Integrated Genome Browser [54] and JBrowse [68] (for screenshots from a genome browser) and ggplot2 library [83] for R (for plots).

Bibliography

[1] Hayato Anzawa, Hitoshi Yamagata, and Kengo Kinoshita. Theoretical character-isation of strand cross-correlation in ChIP-seq. BMC bioinformatics, 21(1):119, 2020.

[2] Artem Barski, Suresh Cuddapah, Kairong Cui, Tae-Young Roh, Dustin E Schones, Zhibin Wang, Gang Wei, Iouri Chepelev, and Keji Zhao. High-resolution proling of histone methylations in the human genome. Cell, 129(4):823837, 2007.

[3] David F Bauer. Constructing condence sets using rank statistics. Journal of the American Statistical Association, 67(339):687690, 1972.

[4] Leonard E Baum, Ted Petrie, George Soules, and Norman Weiss. A maximization technique occurring in the statistical analysis of probabilistic functions of Markov chains. The annals of mathematical statistics, 41(1):164171, 1970.

[5] Bradley E Bernstein, Tarjei S Mikkelsen, Xiaohui Xie, Michael Kamal, Dana J Hue-bert, James Cu, Ben Fry, Alex Meissner, Marius Wernig, Kathrin Plath, et al.

A bivalent chromatin structure marks key developmental genes in embryonic stem cells. Cell, 125(2):315326, 2006.

[6] Kimberly R Blahnik, Lei Dou, Henriette O'Geen, Timothy McPhillips, Xiaoqin Xu, Alina R Cao, Sushma Iyengar, Charles M Nicolet, Bertram Ludäscher, Ian Korf, et al. Sole-Search: an integrated analysis program for peak detection and functional annotation using ChIP-seq data. Nucleic acids research, 38(3):e13e13, 2010.

[7] Alan P Boyle, Sean Davis, Hennady P Shulha, Paul Meltzer, Elliott H Margulies, Zhiping Weng, Terrence S Furey, and Gregory E Crawford. High-resolution mapping and characterization of open chromatin across the genome. Cell, 132(2):311322, 2008.

[8] Alan P Boyle, Justin Guinney, Gregory E Crawford, and Terrence S Furey. F-Seq: a feature density estimator for high-throughput sequence tags. Bioinformatics, 24(21):25372538, 2008.

[9] Jason D Buenrostro, Paul G Giresi, Lisa C Zaba, Howard Y Chang, and William J Greenleaf. Transposition of native chromatin for fast and sensitive epigenomic

pro-ling of open chromatin, DNA-binding proteins and nucleosome position. Nature methods, 10(12):12131218, 2013.

[10] Jason D Buenrostro, Beijing Wu, Howard Y Chang, and William J Greenleaf. ATAC-seq: a method for assaying chromatin accessibility genome-wide. Current protocols in molecular biology, 109(1):2129, 2015.

[11] Jason D Buenrostro, Beijing Wu, Ulrike M Litzenburger, Dave Ru, Michael L Gon-zales, Michael P Snyder, Howard Y Chang, and William J Greenleaf. Single-cell chro-matin accessibility reveals principles of regulatory variation. Nature, 523(7561):486

490, 2015.

[12] Yen-Chun Chen, Tsunglin Liu, Chun-Hui Yu, Tzen-Yuh Chiang, and Chi-Chuan Hwang. Eects of GC bias in next-generation-sequencing data on de novo genome assembly. PloS one, 8(4):e62856, 2013.

[13] ENCODE Project Consortium et al. An integrated encyclopedia of DNA elements in the human genome. Nature, 489(7414):57, 2012.

[14] Darren A Cusanovich, Riza Daza, Andrew Adey, Hannah A Pliner, Lena Chris-tiansen, Kevin L Gunderson, Frank J Steemers, Cole Trapnell, and Jay Shendure.

Multiplex single-cell proling of chromatin accessibility by combinatorial cellular indexing. Science, 348(6237):910914, 2015.

[15] Carrie A Davis, Benjamin C Hitz, Cricket A Sloan, Esther T Chan, Jean M David-son, Idan Gabdank, Jason A Hilton, Kriti Jain, Ulugbek K Baymuradov, Aditi K Narayanan, et al. The Encyclopedia of DNA elements (ENCODE): data portal update. Nucleic acids research, 46(D1):D794D801, 2018.

[16] Arthur P Dempster, Nan M Laird, and Donald B Rubin. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society:

Series B (Methodological), 39(1):122, 1977.

[17] http://www.plantseq.org/www.plantseq.org.

[18] Jason Ernst, Pouya Kheradpour, Tarjei S Mikkelsen, Noam Shoresh, Lucas D Ward, Charles B Epstein, Xiaolan Zhang, Li Wang, Robbyn Issner, Michael Coyne, et al.

Mapping and analysis of chromatin state dynamics in nine human cell types. Nature, 473(7345):4349, 2011.

[19] Yair Field, Noam Kaplan, Yvonne Fondufe-Mittendorf, Irene K Moore, Eilon Sharon, Yaniv Lubling, Jonathan Widom, and Eran Segal. Distinct modes of regulation by chromatin encoded through nucleosome positioning signals. PLoS computational biology, 4(11):e1000216, 2008.

[20] Tianshun Gao and Jiang Qian. EnhancerAtlas 2.0: an updated resource with en-hancer annotation in 586 tissue/cell types across nine species. Nucleic acids research, 48(D1):D58D64, 2020.

[21] https://github.com/jsh58/genrich.

[22] Kevin Grosselin, Adeline Durand, Justine Marsolier, Adeline Poitou, Elisabetta Marangoni, Fariba Nemati, Ahmed Dahmani, Sonia Lameiras, Fabien Reyal, Olivia Frenoy, et al. High-throughput single-cell ChIP-seq identies heterogeneity of chro-matin states in breast cancer. Nature genetics, 51(6):10601066, 2019.

[23] https://github.com/hmmlearn/hmmlearn.

[24] Radmila Hrdlickova, Masoud Toloue, and Bin Tian. RNA-Seq methods for tran-scriptome analysis. Wiley Interdisciplinary Reviews: RNA, 8(1):e1364, 2017.

[25] Xuedong Huang, Alex Acero, Hsiao-Wuen Hon, and Raj Reddy. Spoken language processing: A guide to theory, algorithm, and system development. Prentice hall PTR, 2001.

[26] Peter Humburg. ChIPsim: Simulation of ChIP-seq experiments, 2011. R package version 1.32.0.

[27] Mahmoud M Ibrahim, Scott A Lacadie, and Uwe Ohler. JAMM: a peak nder for joint analysis of NGS replicates. Bioinformatics, 31(1):4855, 2015.

[28] Inkscape Project. (2020). Retrieved from https://inkscape.org.

[29] Hyeongrin Jeon, Hyunji Lee, Byunghee Kang, Insoon Jang, and Tae-Young Roh.

Comparative analysis of commonly used peak calling programs for ChIP-Seq anal-ysis. Genomics & informatics, 18(4), 2020.

[30] Hongkai Ji, Hui Jiang, Wenxiu Ma, David S Johnson, Richard M Myers, and Wing H Wong. An integrated software system for analyzing ChIP-chip and ChIP-seq data.

Nature biotechnology, 26(11):12931300, 2008.

[31] Hongkai Ji and Wing Hung Wong. TileMap: create chromosomal map of tiling array hybridizations. Bioinformatics, 21(18):36293636, 2005.

[32] Leonid V Kantorovich. Mathematical methods of organizing and planning produc-tion. Management science, 6(4):366422, 1960.

[33] Hatice S Kaya-Okur, Steven J Wu, Christine A Codomo, Erica S Pledger, Terri D Bryson, Jorja G Heniko, Kami Ahmad, and Steven Heniko. CUT&Tag for e-cient epigenomic proling of small samples and single cells. Nature communications, 10(1):110, 2019.

[34] Joomyeong Kim and Hana Kim. Recruitment and biological consequences of histone modication of H3K27me3 and H3K9me3. ILAR journal, 53(3-4):232239, 2012.

[35] Robert M Kuhn, David Haussler, and W James Kent. The UCSC genome browser and associated tools. Briengs in bioinformatics, 14(2):144161, 2013.

[36] Solomon Kullback and Richard A Leibler. On information and suciency. The annals of mathematical statistics, 22(1):7986, 1951.

[37] Anshul Kundaje, Wouter Meuleman, Jason Ernst, Misha Bilenky, Angela Yen, Alireza Heravi-Moussavi, Pouya Kheradpour, Zhizhuo Zhang, Jianrong Wang, Michael J Ziller, et al. Integrative analysis of 111 reference human epigenomes.

Nature, 518(7539):317330, 2015.

[38] Ben Langmead and Steven L Salzberg. Fast gapped-read alignment with Bowtie 2.

Nature methods, 9(4):357, 2012.

[39] Heng Li. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. arXiv preprint arXiv:1303.3997, 2013.

[40] Heng Li. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics, 34(18):30943100, 2018.

[41] Heng Li, Bob Handsaker, Alec Wysoker, Tim Fennell, Jue Ruan, Nils Homer, Gabor Marth, Goncalo Abecasis, and Richard Durbin. The sequence alignment/map format and SAMtools. Bioinformatics, 25(16):20782079, 2009.

[42] Erez Lieberman-Aiden, Nynke L Van Berkum, Louise Williams, Maxim Imakaev, Tobias Ragoczy, Agnes Telling, Ido Amit, Bryan R Lajoie, Peter J Sabo, Michael O Dorschner, et al. Comprehensive mapping of long-range interactions reveals folding principles of the human genome. science, 326(5950):289293, 2009.

[43] Marina Lizio, Imad Abugessaisa, Shuhei Noguchi, Atsushi Kondo, Akira Hasegawa, Chung Chau Hon, Michiel De Hoon, Jessica Severin, Shinya Oki, Yoshihide Hayashizaki, et al. Update of the FANTOM web resource: expansion to provide additional transcriptome atlases. Nucleic acids research, 47(D1):D752D758, 2019.

[44] Marina Lizio, Jayson Harshbarger, Hisashi Shimoji, Jessica Severin, Takeya Ka-sukawa, Serkan Sahin, Imad Abugessaisa, Shiro Fukuda, Fumi Hori, Sachi Ishikawa-Kato, et al. Gateways to the FANTOM5 promoter level mammalian expression atlas. Genome biology, 16(1):114, 2015.

[45] Shaoqian Ma and Yongyou Zhang. Proling chromatin regulatory landscape: insights into the development of ChIP-seq and ATAC-seq. Molecular Biomedicine, 1(1):113, 2020.

[46] Anna Macioszek and Bartek Wilczynski. HERON: A novel tool enables identication of long, weakly enriched genomic domains in ChIP-seq data. International Journal of Molecular Sciences, 22(15):8123, 2021.

[47] Shaun Mahony and B Franklin Pugh. ProteinDNA binding in high-resolution.

Critical reviews in biochemistry and molecular biology, 50(4):269283, 2015.

[48] Brandon M Malone, Feng Tan, Susan M Bridges, and Zhaohua Peng. Comparison of four ChIP-Seq analytical algorithms using rice endosperm H3K27 trimethylation proling data. PloS one, 6(9):e25260, 2011.

[49] Henry B Mann and Donald R Whitney. On a test of whether one of two random vari-ables is stochastically larger than the other. The annals of mathematical statistics, pages 5060, 1947.

[50] Mariann Micsinai, Fabio Parisi, Francesco Strino, Patrik Asp, Brian D Dynlacht, and Yuval Kluger. Picking ChIP-seq peak detectors for analyzing chromatin modication experiments. Nucleic acids research, 40(9):e70e70, 2012.

[51] Ali Mortazavi, Brian A Williams, Kenneth McCue, Lorian Schaeer, and Barbara Wold. Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nature methods, 5(7):621628, 2008.

[52] Kasper Munch, Paul P Gardner, Peter Arctander, and Anders Krogh. A hidden Markov model approach for determining expression from genomic tiling micro arrays.

BMC bioinformatics, 7(1):111, 2006.

[53] Ryuichiro Nakato and Katsuhiko Shirahige. Recent advances in ChIP-seq analysis:

from quality management to whole-genome annotation. Briengs in bioinformatics, 18(2):279290, 2017.

[54] John W Nicol, Gregg A Helt, Steven G Blanchard Jr, Archana Raja, and Ann E Loraine. The Integrated Genome Browser: free software for distribution and explo-ration of genome-scale datasets. Bioinformatics, 25(20):27302731, 2009.

[55] Peter J Park. ChIP-seq: advantages and challenges of a maturing technology. Nature reviews genetics, 10(10):669680, 2009.

[56] Florian M Pauler, Mathew A Sloane, Ru Huang, Kakkad Regha, Martha V Koerner, Ido Tamir, Andreas Sommer, Andras Aszodi, Thomas Jenuwein, and Denise P Bar-low. H3K27me3 forms BLOCs over silent genes and intergenic regions and species a histone banding pattern on a mouse autosomal chromosome. Genome research, 19(2):221233, 2009.

[57] Shirley Pepke, Barbara Wold, and Ali Mortazavi. Computation for ChIP-seq and RNA-seq studies. Nature methods, 6(11):S22S32, 2009.

[58] https://github.com/pysam-developers/pysam.

[59] Zhaohui S Qin, Jianjun Yu, Jincheng Shen, Christopher A Maher, Ming Hu, Shanker Kalyana-Sundaram, Jindan Yu, and Arul M Chinnaiyan. HPeak: an HMM-based algorithm for dening read-enriched regions in ChIP-Seq data. BMC bioinformatics, 11(1):113, 2010.

[60] Aaron R Quinlan and Ira M Hall. BEDTools: a exible suite of utilities for comparing genomic features. Bioinformatics, 26(6):841842, 2010.

[61] Naim U Rashid, Paul G Giresi, Joseph G Ibrahim, Wei Sun, and Jason D Lieb.

ZINBA integrates local covariates with DNA-seq data to identify broad and

nar-row regions of enrichment, even within amplied genomic regions. Genome biology, 12(7):120, 2011.

[62] Assaf Rotem, Oren Ram, Noam Shoresh, Ralph A Sperling, Alon Goren, David A Weitz, and Bradley E Bernstein. Single-cell ChIP-seq reveals cell subpopulations dened by chromatin state. Nature biotechnology, 33(11):11651172, 2015.

[63] Joel Rozowsky, Ghia Euskirchen, Raymond K Auerbach, Zhengdong D Zhang, Theodore Gibson, Robert Bjornson, Nicholas Carriero, Michael Snyder, and Mark B Gerstein. PeakSeq enables systematic scoring of ChIP-seq experiments relative to controls. Nature biotechnology, 27(1):6675, 2009.

[64] Peter J Sabo, Richard Humbert, Michael Hawrylycz, James C Wallace, Michael O Dorschner, Michael McArthur, and John A Stamatoyannopoulos. Genome-wide identication of DNaseI hypersensitive sites using active chromatin sequence li-braries. Proceedings of the National Academy of Sciences, 101(13):45374542, 2004.

[65] Valerie A Schneider, Tina Graves-Lindsay, Kerstin Howe, Nathan Bouk, Hsiu-Chuan Chen, Paul A Kitts, Terence D Murphy, Kim D Pruitt, Françoise Thibaud-Nissen, Derek Albracht, et al. Evaluation of GRCh38 and de novo haploid genome assemblies demonstrates the enduring quality of the reference assembly. Genome research, 27(5):849864, 2017.

[66] Hideaki Shimazaki and Shigeru Shinomoto. A method for selecting the bin size of a time histogram. Neural computation, 19(6):15031527, 2007.

[67] Peter J Skene, Jorja G Heniko, and Steven Heniko. Targeted in situ genome-wide proling with high eciency for low cell numbers. Nature protocols, 13(5):10061019, 2018.

[68] Mitchell E Skinner, Andrew V Uzilov, Lincoln D Stein, Christopher J Mungall, and Ian H Holmes. JBrowse: a next-generation genome browser. Genome research, 19(9):16301638, 2009.

[69] Christiana Spyrou, Rory Stark, Andy G Lynch, and Simon Tavaré. BayesPeak:

Bayesian analysis of ChIP-seq data. BMC bioinformatics, 10(1):117, 2009.

[70] Karolina St¦pniak, Magdalena A Machnicka, Jakub Mieczkowski, Anna Macioszek, Bartosz Wojta±, Bartªomiej Gielniewski, Katarzyna Poleszak, Malgorzata Perycz, Sylwia K Król, Rafaª Guzik, et al. Mapping chromatin accessibility and active regulatory elements reveals pathological mechanisms in human gliomas. Nature Communications, 12(1):117, 2021.

[71] Student. The probable error of a mean. Biometrika, pages 125, 1908.

[72] Amos Tanay and Aviv Regev. Scaling single-cell genomics from phenomenology to mechanism. Nature, 541(7637):331338, 2017.

[73] Fuchou Tang, Catalin Barbacioru, Yangzhou Wang, Ellen Nordman, Clarence Lee, Nanlan Xu, Xiaohui Wang, John Bodeau, Brian B Tuch, Asim Siddiqui, et al.

mRNA-Seq whole-transcriptome analysis of a single cell. Nature methods, 6(5):377

382, 2009.

[74] Fuchou Tang, Kaiqin Lao, and M Azim Surani. Development and applications of single-cell transcriptome analysis. Nature methods, 8(4):S6S11, 2011.

[75] TCGA Research Network: https://www.cancer.gov/tcga.

[76] Reuben Thomas, Sean Thomas, Alisha K Holloway, and Katherine S Pollard. Fea-tures that dene the best ChIP-seq peak calling algorithms. Briengs in bioinfor-matics, 18(3):441450, 2017.

[77] Leonid Nisonovich Vaserstein. Markov processes over denumerable products of spaces, describing large systems of automata. Problemy Peredachi Informatsii, 5(3):6472, 1969.

[78] Vinsensius B Vega, Edwin Cheung, Nallasivam Palanisamy, and Wing-Kin Sung.

Inherent signals in sequencing-based Chromatin-ImmunoPrecipitation control li-braries. PloS one, 4(4):e5241, 2009.

[79] Andrew Viterbi. Error bounds for convolutional codes and an asymptotically opti-mum decoding algorithm. IEEE transactions on Information Theory, 13(2):260269, 1967.

[80] Günter P Wagner, Koryu Kin, and Vincent J Lynch. Measurement of mRNA abun-dance using RNA-seq data: RPKM measure is inconsistent among samples. Theory in biosciences, 131(4):281285, 2012.

[81] Congmao Wang, Jie Xu, Dasheng Zhang, Zoe A Wilson, and Dabing Zhang. An eective approach for identication of in vivo protein-DNA binding sites from paired-end ChIP-Seq data. BMC bioinformatics, 11(1):18, 2010.

[82] Andreas PM Weber. Discovering new biology through sequencing of RNA. Plant physiology, 169(3):15241531, 2015.

[83] Hadley Wickham. ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York, 2016.

[84] Elizabeth G Wilbanks and Marc T Facciotti. Evaluation of algorithm performance in ChIP-seq peak detection. PloS one, 5(7):e11471, 2010.

[85] Guanjue Xiang, Cheryl A Keller, Belinda Giardine, Lin An, Qunhua Li, Yu Zhang, and Ross C Hardison. S3norm: simultaneous normalization of sequencing depth and signal-to-noise ratio in epigenomic data. Nucleic acids research, 48(8):e43e43, 2020.

[86] Chongzhi Zang, Dustin E Schones, Chen Zeng, Kairong Cui, Keji Zhao, and Weiqun Peng. A clustering approach for identication of enriched domains from histone modication ChIP-Seq data. Bioinformatics, 25(15):19521958, 2009.

W dokumencie HMM-based Method for Identifying Enrichment in Signal from Sequencing-based Experiments (Stron 115-130)