• Nie Znaleziono Wyników

Materials and Methods

RESPONSE PROFILES

3.3 Inter-omics data integration

The statistical p-value integration methods offered promising results in the attempt to perform combined analysis on different microarray platform datasets within the same transcriptomics space. This lead to the undertaking of a transfer of these procedures for the purpose of investigating two datasets from different fields: proteomics and transcriptomics on subgroups of workers from a nuclear production facility. The aim of this study was to further deepen the knowledge of molecular mechanisms related to radiation-induced human heart pathology.

3.3.1 Data sets Proteomics samples

Left ventricle cardiac samples were extracted post mortem from 29 male individuals as described in detail in (Azimzadeh et al., 2017). These were exposed to different external doses of ionizing radiation during their lifetime. Non-exposed individuals of the same

area were used as the control population. Both controls and exposed workers died of ischemic heart disease. The samples were categorized into four groups conforming to the total dose of ionizing radiation to which the individuals were exposed. The groups present themselves as follows:

• unexposed controls (3 samples),

• < 100 mGy low dose exposed (6 samples),

• 100 − 500 mGy medium dose exposed (10 samples)

• > 500 mGy high dose exposed (10 samples).

The low number of individuals in each group was caused by difficulties in obtain-ing human data from the workers who died of the very specific ischemic heart disease condition and time of tissue collection (maxiumum 4h post mortem) was paramount in this case. Nevertheless, small sample sizes present a certain problem in the power of statistical procedures and require careful and accurate choice of analysis workflows.

The samples were processed in an LC-MS/MS experiment as reported previ-ously (Azimzadeh et al., 2017). The clinical data available for the samples contained to-tal external dose, age, smoking habits, alcohol consumption, and BMI. However, while the dose and age factors differed between the workers, all of the workers were recorded to be smokers and drinkers.

RNA-seq samples

RNA samples were collected initially from 8 male subjects, a subset of the group pre-viously analyzed in the proteomics approach (Azimzadeh et al., 2017). They were di-vided into two groups: unexposed controls (3 samples) and > 500 mGy high dose (5 samples). The mirVana PARIS Kit (Ambion, ThermoFisher, USA) was used to isolate both native protein and total RNA. Total RNA was isolated from the lysate according to the Ambion, ThermoFisher manufacturer’s protocol. RNA integrity was assessed on the Agilent 2100 Bioanalyzer. The sequencing validation analysis supplied good quality data for 4 samples derived from the group of workers used in the proteomics experiment: 2 controls and 2 high-dose samples. The sequencing was executed on the Illumina NextSeq 500 desktop sequencer.

3.3.2 Confounding factor filtration

The first challenge posed by the data was the fact that there existed a strong correlation between age and dose factors in the studied individuals. As the total dose to which workers were exposed during their lifetime was the main point of interest in terms of the study of ischemic heart disease in this aspect, measures were introduced for the filtering of features that presented variability primarily due to the age factor. The dose-age correlation in the proteomics data was measured using Spearman’s rank correlation coefficient. Furthermore, to avoid bias caused by the confounding age factor, a regres-sion analysis with backward stepwise model building was performed. Each protein feature was modeled gradually excluding the factors of dose and age transformed us-ing Box-Cox algorithm (Box and Cox, 1964) and model selection was conducted on the basis of Akaike Information Criterion (Akaike, 1974). Separability of the proteomics samples with the selected features was investigated using hierarchical clustering with Spearman’s rank correlation as a similarity measure.

3.3.3 Statistical analysis of omics data sets

After the protein features explained mainly by the dose factor were extracted, primary analysis of the proteomics and transcriptomics data sets was carried out. Firstly, in the case of MS data, outlier detection was performed using Dixon’s criterion within the dose groups. Then, within group normality was tested using Shapiro-Wilk pro-cedure (Shapiro and Wilk, 1965) and based on the assumption notwithstanding, the Kruskal-Wallis test (Kruskal and Wallis, 1952) with Storey’s FDR multiple testing cor-rection (Storey, 2002) was used to assess differentiation among the dose groups. As a post-hoc method, Dunnett’s test (Dunnett, 1955) was selected for determining deregu-lated proteins among the dose groups in relation to the control samples. Significance in the above tests was assumed at the level of 0.05 in all of the above mentioned proce-dures.

Transcriptomic RNA-seq data preprocessing was performed using state-of-the-art methods. Alignment and mapping were accomplished with STAR software version 2.5.1 (Dobin et al., 2013) against the GRCh38/hg38 human reference genome. Sorting

and indexing was executed using SAMtools, version 1.3.1 (Li et al., 2009). The correla-tion between biological replicates within control and high-dose groups was assessed.

Further differential expression analysis was performed with the use of R DESeq2 pack-age (Love et al., 2014) with gene expression modeled based on the negative binomial distribution.

3.3.4 Functional analysis

Enrichment analysis of deregulated genes and proteins was carried out including Gene Ontology Biological Process terms and KEGG signaling pathways. Overre-spresentation was assessed using Fisher’s exact test. Moreover, gene and protein interaction and signaling networks were analyzed through the STRING search tool (http://string-db.org).

3.3.5 Statistical integration

Finally, after separate processing completion, an integrative multiomics analysis of data from the workers samples was conducted in the form of Fisher’s combined p-value transformation(Fisher, 1992) on the common in both data sets gene and pro-tein features (Figure 3.7). The combination is achieved by summing k log-transformed individual p-values. The inverse sum multiplied by 2 becomes the combined statistic, which follows the χ2 distribution with 2k degrees of freedom (Eq. 3.15).

X = −2

k

X

i=1

log(pi) (3.15)

X ∼ χ22k (3.16)

This method is adequate for sequencing data, as the read counts cannot be ap-proximated with a Gaussian distribution. Regarding the proteomics data, high-dose group samples and controls were only taken into account for combined analysis to as-certain compatibility with the transcriptomics data. Bearing in mind the non-specific

nature of gene-protein coupling (multiple genes may correspond to a single protein), in such cases genes with the minimum p-value were considered. The combined p-values were afterwards corrected for multiple testing using the Benjamini-Hochberg method (Benjamini and Hochberg, 1995).

Figure 3.7: Illustration of the Fisher’s p-value combination method. The values on the axes represent p-values potentially obtained in two experiments for matching features - in the case of this study proteins and transcripts. The color depicts the combined p-value level. The white line illustrates the 0.05 threshold for the resulting combined p-value. The features with

combined p-value below the white line are considered statistically significant.