Workow

Chapter 4

Simulating data

To assess HERON's performance, we need to test it on simulated data. The advantage of such approach is that we know exactly where the peaks we want to discover are; hence we can precisely calculate specicity, sensitivity and other measures of any tested peakcaller.

Additionally, we can control the quality and other characteristics of the simulated data, and therefore assess how they inuence obtained results.

In this chapter I present my approach to simulating data from NGS-based experiment and results obtained on it by various peakcallers. In section 4.1 I describe the whole workow of simulating data. In 4.2 I show what datasets I decided to generate and run peakcalling on. Eventually, in 4.3 I show the results obtained by four tested peakcallers:

HERON, MACS, BayesPeak and SICER; in particular I present that these peakcallers produce results of varying quality basing on characteristics of the given data and I show which peakcaller seems to be best suited for every kind of tested data.

Figure 4.1: Idea behind the simulation package. On the left, all the steps of the

work-ow are shown, together with the tools that perform them. On the right, an example screenshot from genome browser is presented: at the top, there is a single peak used to generate the reads. Two lower tracks, the red and the blue one, are the source les with reads sampled only from peaks and uniformly from the whole genome, respectively. The four bottom tracks are the two source tracks sampled and merged in dierent proportions.

with signal generated from the peak regions (peak le) and one with signal generated from uniform distribution over the whole genome (background le). By default Bowtie2 [38]

is used at this step, but in principle any program capable of mapping short reads, such as minimap2 [40] or BWA [39], could be used for this task as well. Thanks to generating reads and mapping them, instead of simulating a ready signal or already mapped reads, we can incorporate in our data various biases introduced during mapping, mimicking the real data better; for example reads coming from regions with very low mappability (see Section 2.2.4 for the denition of a mappability map) tend to get wrongly mapped. It is important to stress that because of that, the peak le does not necessarily have non-zero signal only in the peak regions; due to errors in read mapping some reads usually are mapped to the regions outside peaks. Similarly the background le is dierent than we would expect if there were no biases and errors introduced after the reads were uniformly sampled; some regions are usually more enriched and some less enriched than others, and these trends are conserved between independent replicates of the simulations.

In the nal step, les with custom signal-to-noise ratio can be generated by sampling proper number of reads from the two source les described above and merging them. The sampling and merging is done using bedtools suite [60]. If we want to have an average signal from noise equal to x, we sample x ∗ genome_size/read_length reads from the background le. To obtain average enrichment of peaks equal to y, i.e. to make peaks higher than background on average by y, we sample y ∗ total_peak_width/read_length reads from the peak le, assuming the true peaks together have total_peak_width width.

Note that this means that after combining both sets of reads the average coverage on

peak regions will be close to x + y. Note that it might not be equal exactly to x + y. As some reads from the peak le are actually mapped outside the peak regions, some reads sampled from the peak le might contribute to the signal outside peaks. Similarly, mean signal outside peaks will be close to x, not necessarily equal x.

The sampling can be done many times on the same source les, obtaining les either with dierent signal-to-noise ratio or with the same parameters, but sampled independently;

such les can be treated as technical replicates of the same experiment, and can be used in assessing how number of provided replicates inuences results of peakcalling. The number of reads simulated at the beginning should be large, so that the obtained mean coverage of the two source les will be much bigger than the desired mean coverage of the nal

les. If the two values were close, every new sampling done on source les would reuse most of the reads and hence every simulation would yield very similar results. High initial enrichment ensures more random sampling and minimises correlation between coverages obtained from the same source les.

The overview of the package together with an example simulated coverages obtained with it is presented in gure 4.1. The package is available at https://github.com/maciosz/

NGS_simulation. Apart from bedtools and ChIP-sim, it also uses samtools suite [41].

4.1.1. ChIP-sim

ChIP-sim [26] is an R package, a part of Bioconductor project. It is intented to simulate reads from ChIP-seq experiments so that they can be used in testing various other pro-grams, algorithms or entire pipelines, like programs that perform mapping reads to the genome. The authors also suggest that the researchers can use ChIP-sim to plan their experiments; i.e. one can run an analyses pipeline on simulated data to estimate the number of samples and their quality needed to obtained statistically signicant results.

ChIP-sim's behaviour can be divided into six stages:

1. generate feature sequence

Using Markov chain, sequence of so-called features covering the whole genome is generated. Features can represent, for example, nucleosome placement or simply peaks. They can have dierent stability and intensity. In my package I modify this step, as we want to provide ready coordinates of peaks instead of simulating them. Therefore, the sequence of features is generated deterministically from the coordinates given by the user.

2. compute feature density

The feature sequence is translated into a feature density; for every nucleotide the probability that a feature (e.g. nucleosome) is centred there is computed. This step is needed especially if one wants to simulate features with dierent stability, like fuzzy or stable nucleosomes.

3. compute read density

The read density is calculated for each strand and for every position. It represents probability that during the sequencing process a read starting at this position was generated.

4. sample read start sites

Using the calculated read density, the desired number of positions for reads' starts is sampled.

5. create read names

By default, read names have a format of "read_[read_number]".

6. obtain read sequence and quality

The generated reads' starts are translated into sequences using the provided genome sequence. The qualities can be either read from a provided le, or in any way generated; in my package we simply generate them randomly. The qualities are then used to introduce some errors in reads, mimicking errors in sequencing.

All the steps are performed by one function simChIP. One can easily modify them by providing one's own functions to simChIP. One can also provide intermediate results to skip some of the steps, e.g. if one wants to run multiple simulations on the same features, there is no need to generate new features or repeat the steps 2 and 3, which are deterministic; one could run them once and then sample reads multiple times using the same read density.

W dokumencie HMM-based Method for Identifying Enrichment in Signal from Sequencing-based Experiments (Stron 69-72)

Workow

Chapter 4

Simulating data