Algorithmic Aspects of RNA Structure Similarity Analysis

Pełen tekst

(1)Algorithmic Aspects of RNA Structure Similarity Analysis. ˙ Tomasz Zok P H .D. T HESIS Supervisor: Marta Szachniuk, Ph.D., Dr. Habil. Supporting supervisor: Mariusz Popenda, Ph.D.. Poznan University of Technology Institute of Computing Science. Poznan, ´ 2017.

(2) Preface During my undergraduate and graduate studies, I encountered plenty of helpful and amazing people. Of course, I would like to thank first and foremost my supervisor Marta Szachniuk for support, knowledge, patience and kindness during all these years. I am also thankful to professor Ryszard Adamiak, who was the principal investigator in the MAESTRO grant I had luck to be part of, but also more generally for his support and help. And I thank to all my collaborators, especially Maciej Antczak and Mariusz Popenda who I admire for knowledge and hard work. Special thanks are reserved for Paulina, who understands and supports me more than anyone else, my parents, who believe in me all the time, and my brother, who is always there for me when I need him. I am also grateful to all my friends who motivated me. None of this would be possible without your support and understanding. T OMASZ Z˙ OK Poznan, December 2017.

(3) Abstract Bioinformatics is an interdisciplinary field in which molecular biology problems are solved with methods from computer science and operations research. Special attention belongs to development of methods aimed to understand how structures of molecules influence their function in vivo. This is becoming more and more important for RNAs, which are found to play vital roles in living organisms and take structural forms of high complexity and diversity. In my research, I developed new models and methods to assess RNA structure similarity. Structural comparison-based similarity estimation is needed at various stages of structure processing, e.g. prediction, modelling, clustering or ranking. Simultaneously, it is challenging due to complexity of RNA structures and the volume of data to process. The first part of my research was devoted to RNA secondary structure. I developed a new model named RNA mixed graph and proposed a graph-based algorithm to find similar structure fragments. It was proved efficient in finding similar fragments unavailable for other methods. These fragments were shown to positively influence other elements of RNA structure analysis pipeline. The second part of my research was dedicated to RNA 3D structure. I developed a method to transform it from algebraic to trigonometric representation, upon which I proposed a similarity metric. It is a core of RNA 3D structure comparison algorithm, which was designed to assess similarity in various setting: 1-vs-1, 1-vs-n and n-vs-n. The algorithm was evaluated on a number of RNA 3D models. Since 2014, it is one of major similarity measures to evaluate predictions in RNA-Puzzles competition. The similarity measure allowed to compare, cluster and rank the models, but also to locate fragments of highest and lowest similarity. I also proposed a new type of chart on which various comparison methods were juxtaposed. They reveal importance of trigonometric measure in the full assessment of RNA structure similarity..

(4) Streszczenie Bioinformatyka to interdyscyplinarna dziedzina, w której problemy biologii molekularnej rozwiazywane ˛ sa˛ metodami zaczerpni˛etymi z informatyki i badan´ operacyjnych. Szczególnie istotny jest rozwój metod pozwalajacych ˛ zrozumie´c wpływ struktur czasteczek ˛ na ich funkcje in vivo. Ma to coraz wi˛eksze znaczenie dla RNA, które pełnia˛ istotne ˙ złozono´ ˙ funkcje w komórkach i zwijaja˛ si˛e w struktury o duzej sci. W ramach badan´ opracowano nowe modele i metody oceny podobienstwa ´ struktury RNA. Ocena podobienstwa ´ strukturalnego na ˙ podstawie metod porównawczych jest konieczna na róznych etapach przetwarzania struktur RNA np. podczas przewidywania, modelowania, klastrowania lub klasyfikacji. Jednocze´snie jest to trudne zadanie ze ˙ wzgl˛edu na złozono´ sc´ struktur RNA i ilo´sc´ danych do przetwarzania. Pierwsza cz˛es´ c´ badan´ po´swi˛econa była strukturze 2D RNA. Stworzono nowy model mieszanego grafu RNA. Na tej podstawie zaproponowano algorytm do znajdowania fragmentów struktury. Został on zweryfikowany poprzez odnalezienie fragmentów niedost˛epnych dla ˙ ze ˙ fragmenty te skutecznie innych metod. W pracy wykazano równiez, poprawiły inne elementy analizy strukturalnej RNA. Druga cz˛es´ c´ badan´ dotyczyła struktury 3D RNA. Opracowano metod˛e przekształcenia reprezentacji algebraicznej na trygonometryczna˛ i zaproponowano katow ˛ a˛ metryk˛e podobienstwa. ´ Stanowi ona podstaw˛e algorytmu do porównywania struktury 3D RNA działajacego ˛ dla scenariuszy: 1-vs-1, 1-vs-n oraz n-vs-n. Algorytm został zweryfikowany na kilkudziesi˛eciu modelach 3D RNA. Od 2014 r. zaproponowana mia˙ ra jest uzywana do oceny modeli w konkursie RNA-Puzzles. Pozwala ona na porównanie, klastrowanie i tworzenie rankingu modeli, oraz na wskazanie fragmentów najmniej i najbardziej podobnych. Zapropo˙ metody nowano tez˙ nowy typ wykresu, na którym zestawiono rózne porównywania. Ujawniaja˛ one znaczenie metryki katowej ˛ w pełnej ocenie podobienstwa ´ struktury RNA..

(5) Contents 1 Introduction. 1. 1.1. RNA Similarity Analysis as Bioinformatics Challenge . . . . .. 1. 1.2. Scope of the Thesis . . . . . . . . . . . . . . . . . . . . . . . .. 5. 2 Notions and Definitions. 7. 2.1. Significance and Roles of RNA . . . . . . . . . . . . . . . . .. 7. 2.2. RNA Structure . . . . . . . . . . . . . . . . . . . . . . . . . . .. 8. 2.2.1. Primary Structure . . . . . . . . . . . . . . . . . . . . .. 8. 2.2.2. Secondary Structure . . . . . . . . . . . . . . . . . . .. 9. 2.2.3. Tertiary Structure . . . . . . . . . . . . . . . . . . . . .. 16. 2.3. RNA Structure Similarity . . . . . . . . . . . . . . . . . . . . . . 21 2.3.1. Similarity of 2D Structures . . . . . . . . . . . . . . . . . 21. 2.3.2. Similarity of 3D Structures . . . . . . . . . . . . . . . .. 3 New Methods for RNA 2D Structure Similarity Assessment. 23 27. 3.1. Syntax for Motif Description . . . . . . . . . . . . . . . . . . .. 28. 3.2. Dot-Bracket Limitations in Motif Search . . . . . . . . . . . . . 31. 3.3. New Model of Motif Search Problem . . . . . . . . . . . . . .. 34. 3.4. Algorithms to Search for Similar Motifs . . . . . . . . . . . .. 38. 3.5. Computational Experiment . . . . . . . . . . . . . . . . . . .. 44. 3.5.1. Finding Missing RNA Motifs . . . . . . . . . . . . . .. 44. 3.5.2. Enhancing RNA Structure Prediction . . . . . . . . . .. 48.

(6) CONTENTS. 4 New Methods for RNA 3D Structure Similarity Assessment. 55. 4.1. Mean of Circular Quantities . . . . . . . . . . . . . . . . . . .. 56. 4.2. Median of Circular Quantities . . . . . . . . . . . . . . . . . .. 60. 4.3. Comparison of Structures with Different Sizes . . . . . . . .. 63. 4.4. Comparison of Multi-Chain Structures . . . . . . . . . . . . .. 66. 4.5. MCQ4Structures and Visualization of Structure Similarity . .. 67. 4.6. Computational Experiment . . . . . . . . . . . . . . . . . . .. 75. 5 Summary. 82. List of Figures. 84. List of Tables. 90. List of Algorithms. 91. Bibliography. 92. Major Publications. 103. Appendix. 105.

(7) Acronyms DNA. deoxyribonucleic acid . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1. RNA. ribonucleic acid . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2. mRNA messenger RNA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 tRNA. transfer RNA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2. rRNA. ribosomal RNA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2. PDB. Protein Data Bank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2. 3D. tertiary structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3. 2D. secondary structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5. miRNA micro RNA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 ncRNA non-coding RNA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 lncRNA long non-coding RNA. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7 IUPAC. International Union of Pure and Applied Chemistry . . . . . . . . . . . . 8. BPSEQ base-pair-sequence format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 CT. connect format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15. INF. interaction network fidelity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22. NMR. nuclear magnetic resonance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23. RMSD. root-mean-square deviation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24. DI. deformation index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25. CAD. contact area difference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25.

(8) PMMG perfect matching mixed graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 IMMG imperfect matching mixed graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 JPA. Java Persistence API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41. MCQ. mean of circular quantities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54. MedCQ median of circular quantities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54.

(9) Introduction. 1.1. 1. RNA Similarity Analysis as Bioinformatics Challenge. Bioinformatics – simply speaking – is an interdisciplinary field which combines biology and computer science. In a narrow definition, it is a discipline which primarily deals with biomolecular data in silico. However, it also addresses the question of biologically inspired modelling, in particular computational modelling and analysis of biopolymer structures, in terms of operations research problems. Initially, classical bioinformatics approaches were focused on sequential data. They date back to 1960s when the deoxyribonucleic acid (DNA) sequence encoding was established as nature’s alphabet and first computers started being available publicly (Dayhoff and Ledley, 1962; Dayhoff, 1965; Needleman and Wunsch, 1970). Nowadays this area of bioinformatics is more about genomics studies on a large scale. However, with the advent of new technologies and inventions, separate subdisciplines emerged such as. 1.

(10) 1.1. RNA Similarity Analysis as Bioinformatics Challenge. structural bioinformatics which is about modelling and analysing protein and nucleic acids structures in silico. One of the first papers that could be assigned to structural bioinformatics is the one analysing correlations between φ and ψ torsion angles in proteins in a form of now famous Ramachandran plots (Ramachandran et al., 1963). In the following years, the efforts to push these results forward were made by many researchers. Protein structures on different levels of details were extensively studied throughout the decades (Richardson, 1981). In parallel with that, researchers were also touching the subject of ribonucleic acid (RNA) structures, although the story here is more complex. For many years, it was thought that RNA is needed only to create proteins: the DNA was transcribed to messenger RNA (mRNA), which was read by small adapters called transfer RNA (tRNA) and the information acquired this way was passed further to ribosomal RNA (rRNA) which built the final product – the protein. From these three types of ribonucleic acids, mRNA was analysed mainly from the angle of the gene it is a transcript of. In the case of tRNA, researchers predicted how it could look like (Holley et al., 1965) and confirmed that prediction ten years later (Ladner et al., 1975). The rRNA, which is generally much larger and occurs in complex with various proteins, was much harder to study. First works towards establishing its structure started in 1980s (Yonath et al., 1987; Hope et al., 1989), but due to numerous obstacles the first atomic resolution structure of rRNA was solved only in the beginning of 2000s (Wimberly et al., 2000; Ban et al., 2000; Yusupov et al., 2001). Due to a difficulty in the analysis of RNA and underestimation of its significance by the scientific community, the gap between our knowledge of proteins and RNAs has grown for many years. It is evident when one compares the growth of Protein Data Bank (PDB) database (Berman et al., 1999; Rose et al., 2017), which is a common location where all experimentally 2.

(11) 1.1. RNA Similarity Analysis as Bioinformatics Challenge. RNA. Year. 1996 1991. 1996 1976. Yearly Total 0. 20000. 40000. 60000 Number. 80000. 100000. 120000. Yearly Total. 1976. 1981. 1981. 1986. 1986. 1991. Year. 2001. 2001. 2006. 2006. 2011. 2011. 2016. 2016. Protein. 0. 200. 400. 600. 800. 1000. 1200. Number. F IGURE 1.1: Growth of PDB content by molecule type (2017-06-25): protein structures (left) and RNA structures (right). solved tertiary structure (3D) structures of biopolymers are deposited (Figure 1.1). The absolute number of available data for proteins is two orders of magnitude bigger than for RNAs (120k vs 1.2k), and this distance is still widening because the yearly growth exhibits analogous tendency (8k vs 80). The same observation can be made when looking at the number of computational tools and libraries which process either proteins or RNAs. In a database of bioinformatics tools (Ison et al., 2016), there are 480 applications tagged with a protein topic and only 115 dedicated to RNA (as of 2017-06-25). The classical view on the RNA role in cells has changed thanks to a very important discovery: RNAs take highly complex, functional 3D shapes. The work of Thomas Cech showed that RNAs can play an active, catalytic and regulation roles in the cell (Cech et al., 1981; Kruger et al., 1982; Cech, 1990). Nowadays, we know many types of RNAs which have sophisticated structures, extremely important in their cellular function. The list includes RNAs responsible for gene expression regulation, viral RNAs and other types with huge potential for application. On the market, there are already available RNA-based drugs which cure diseases such as spinal muscular 3.

(12) 1.1. RNA Similarity Analysis as Bioinformatics Challenge. atrophy (Ottesen, 2017). However, these breakthroughs would never be possible without advances in computational methods for RNA 3D structure prediction. This field of structural bioinformatics has developed significantly to address the growing need for computational models of yet experimentally unsolved structures. Notably, at the core of every structure prediction, there is a similarity measuring method involved. Some prediction algorithms have to cluster large number of models to propose their representative subset at the output. Thus, a procedure is needed, based on structure comparison and similarity assessment, to decide how to perform clustering. The other predictors compare structure fragments to assemble final 3D model, so they require to rank candidate fragments according to criteria which usually refer to similarity. And naturally, final models are assessed with respect to each other and against a native 3D structure if possible. Structure evaluation procedures are usually based on various similarity or distance measures. Therefore, it is crucial to understand how a choice of similarity measure affects the whole process, from computational prediction to selection of the best models. Such choice is not easy, because different measures have advantages and shortcomings, so they may be useful for one kind of objective but meaningless for the other. Assessing the similarity of two 3D structures remains a challenge. RNAs can be compared manually by experts who use programs for molecule visualization and try to identify similarities and differences by just viewing the structures. However, in practice we need computational processing to assess structural similarity. Bioinformatics methods allow to consider a variety of factors and structure features, the number of which constantly increases as our knowledge and understanding of RNA molecule grows. These factors include a range of aspects, both global (e.g. shape or volume) and local (e.g. whether particular atoms are positioned accordingly). Only computational processing is capable of their holistic analysis. Additionally, 4.

(13) 1.2. Scope of the Thesis. structural information is slowly, but steadily becoming abundant. More and more data are deposited in public servers, and can be pre-processed and used by knowledge-based computational methods. Finally, the problem of assessing similarity is usually hard in the classical meaning of computational complexity. Certain problems in structural bioinformatics, including similarity assessment related ones, are known to be NP-hard (Akutsu, 2000). The others, despite their polynomial complexity, are restricted in instance size due to processing time scaling of O(n6 ) (Rivas and Eddy, 1999). This clearly indicates that the problem of assessing structural similarity requires. computational processing and careful design of data structures and algorithms.. 1.2. Scope of the Thesis. In my doctoral research I aimed at design, implementation and validation of methods dedicated to the analysis of RNA secondary structure (2D) and 3D structure. This dissertation has been focused on measuring RNA structure similarity on different levels of details. It is organized as follows. Chapter 2 contains introductory material to the subject of research. It explains biological roles of RNAs and outlines aspects of structural analysis – interaction types, pseudoknots, modules and torsion angles. It also contains a description of state-of-the-art similarity measures for both 2D and 3D structure of RNA. Chapter 3 describes a new method dedicated to the problem of finding similar fragments on the secondary structure level. Proposed solution is evaluated on artificial and real data. Chapter 4 is about similarity analysis on the tertiary level of RNA structure. It contains formulation of new metrics based on trigonometric representation of RNA structures. Several modes of operations are described. Then the new proposition is evaluated on the real data. Chapter 5 summarizes the results and points out directions for. 5.

(14) 1.2. Scope of the Thesis. further development. The end of the monograph contains a list of figures, tables and algorithms as well as the bibliography. It is followed by the list of major publications I have co-authored and an appendix in which those strictly connected to the results described in this thesis are put in extenso.. 6.

(15) Notions and Definitions. 2.1. 2. Significance and Roles of RNA. RNA is a biopolymer found in every living cell. For a long time, researchers have been focused only on RNA’s multiple roles in the protein synthesis process. First, the mRNA is a copy of genetic code in reverse direction. Second, the tRNA is a mediator which matches an amino acid – protein building block – with a selected triplet in mRNA called a codon. Finally, the rRNA is part of a huge complex which folds the protein. Recently, focus shifted toward non-coding RNA (ncRNA), which is a class of ribonucleic acids not translated into proteins (Cech and Steitz, 2014; Corley et al., 2015; Grosjean and Westhof, 2016; Sanbonmatsu, 2016). The already mentioned tRNA and rRNA are part of this large and diverse class of molecules. The class contains also many other types of RNAs, with lengths ranging from a dozen nucleotides in micro RNA (miRNA) to several hundreds in collectively called long non-coding RNA (lncRNA). They serve multitude of roles in the regulation of molecular processes. Currently more and more RNA sequences become available and this reveals that we do 7.

(16) 2.2. RNA Structure. not yet know neither the number of ncRNAs in human genome nor the biological function of some of them (Huttenhofer et al., 2005). To shed some light on these questions, it is now becoming more important than ever to support life scientists with algorithmic and computational approaches dedicated to the RNA study.. 2.2. RNA Structure. The shape and function of any molecule are very much related. If a special arrangement of nucleotides is present, then such RNA can bind to a small compound named ligand. This action involves structural changes in the RNA which are recognized by other biomolecules in the cell. Other processes as well rely on the dynamic shape which an RNA may take. Therefore, it is crucial to understand the structure-function relationship better and to look for novel and improved ways to analyse the structure.. 2.2.1. Primary Structure. RNA molecules are biopolymers made of nucleotides, in which we can distinguish: backbone part, ribose and nucleobase. RNA is described in the most basic way by a sequence of nucleobases it is made of (Figure 2.1). The sequence constitutes a primary structure of the corresponding RNA molecule. There are four main letters in the alphabet used to encode RNA nucleobases: A, C, G and U that represent Adenine, Cytosine, Guanine and Uracil, respectively. Sometimes it is necessary to describe a set of RNAs by a single primary structure common to all of them. For example, this is useful when analysing sequence variability in RNAs coming from different species. For this purpose, the primary structure can be encoded using letters of the alphabet known as International Union of Pure and Applied Chemistry (IUPAC) 8.

(17) 2.2. RNA Structure. O H. H N. O−. P O. O. A. N. O. N. N. N. O OH O. O−. P O. 5’ to 3’ direction. O. H. N. O. G. N H. N. N. N H. O OH O. O−. P O. O. H. N O. U. N O. O OH O−. H H. N P O. O N O. C. N O OH. F IGURE 2.1: The structure of RNA showing hydrogen bonds each nucleobase may form in a canonical base pair. codes (IUPAC-IUB Commission on Biochemical Nomenclature (CBN), 1971), such as R for a purine (A or G) and Y for a pyrimidine (C or U). Additionally, RNA nucleotides may be subject to post-transcriptional modification. Then they are represented by yet another symbols in the RNA sequence (Machnicka et al., 2013) e.g. adenine containing additional methyl group is called m1A and is represented by the quote character ” in the primary structure.. 2.2.2. Secondary Structure. When sequence information is supplemented with the network of interactions between nucleotides (i.e. base pairs), it becomes what is known as the secondary structure. The secondary structure is essential in understanding the role RNA plays in a cell, because it reveals the arrangement of nucleotides with respect to each other.. 9.

(18) 2.2. RNA Structure. F IGURE 2.2: Interactions between nucleotides: (a) base-phosphate, (b) baseribose, (c) base-base and (d) stacking. RNA is a flexible and dynamic molecule. In certain conditions, its strands fold or unfold and its fragments bend. These movements allow for creation of stacking interactions and stabilizing hydrogen bonds between nucleotides even very distant in the chain. Atoms involved in such bonds may be a part of the phosphate group, ribose or base (Figure 2.2), although the first two types are uncommon and play a rather auxiliary role. In a description of RNA shape, the most important role belongs to base-base interactions. For every base, three edges have been defined along which hydrogen bonds may form: Watson-Crick edge, Hoogsteen edge and sugar edge (Figure 2.3). In RNAs, every base may interact with the other one on any edge. Furthermore, we distinguish two modes of pairing called cis and trans which depend on the position of ribose atoms with respect to base atoms in 10.

(19) 2.2. RNA Structure. both nucleotides (Figure 2.4). In total, this gives us 12 families of interactions that may be formed by two RNA nucleotides: cWW, cWH, cWS, cHH, cHS, cSS, tWW, tWH, tWS, tHH, tHS and tSS, where W denotes Watson-Crick, H – Hoogsteen, S – sugar edge, t is for trans, and c is for cis (thus, for example, cWH corresponds to interaction along Watson-Crick edge of one base and Hoogsteen edge of the second base in cis orientation). In secondary structure visualizations they are depicted with different symbols which are shown in Figure 2.5. Base-base interactions in RNA can be classified as canonical or noncanonical. Canonical base pairs are found within the family of cWW (cis Watson-Crick / Watson-Crick). These are pairs: A-U (formed by 2 hydrogen bonds), G-C (formed by 3 hydrogen bonds), and G-U (formed by 2 hydrogen bonds). All other types of base pairs in cWW and the remaining 11 families of interactions are collectively called non-canonical ones. Canonical interactions are energetically favourable and stable, therefore they have the largest impact on the overall shape of an RNA. In fact, the term secondary structure is often used to describe a set of canonical interactions only. When looking at canonical secondary structure (i.e. containing only canonical base pairs), one may notice recurring motifs: stem, hairpin loop, bulge, internal loop or an n-way junction (Figure 2.6). A stem is made of consecutive base pairs. A hairpin loop is a motif in which unpaired nucleotides are enclosed by a single base pair. A sequential mismatch on one of stem’s strands is a bulge and on both strands is an internal loop. A branching point for several stems is called an n-way junction or a multi-branch loop. 2.2.2.1. Secondary Structure Representation. Secondary structure is often represented in visual form. The image consists of nodes that represent nucleotides and connecting lines of two types. The first type goes along the sequence from the beginning of RNA chain, denoted 11.

(20) 2.2. RNA Structure. F IGURE 2.3: Three edges on which base-base interaction may take place.. F IGURE 2.4: Example A-U pairing on Watson-Crick edges in cis and trans mode. cis. W. W. WW. H. H. H. WH. WS. S S. S. W. S. H. cWH. HH. SH. W. W. S. cHW. cWS. H. W. S. cHH. cSW. W. SS. S. cWW. H. W H. S. H. cHS. cSS. cSH. trans S. WW. H H. S H. WH. S. S. W. tWW. tWH. tHW. H WS. S. tWS. W. tSW. W. HH. W. SH. W. tHS. tSH. W. SS. W W. S. S. tHH. H. H. S H. H. tSS. F IGURE 2.5: 12 families of base-base interactions and their icons (Leontis and Westhof, 2001). 12.

(21) 2.2. RNA Structure. F IGURE 2.6: Example secondary structure motifs found in RNAs. as 5’ end to 3’ end. The other type connects nucleotides interacting with each other. It may incorporate icons to distinguish non-canonical pairs (Figure 2.5). Nodes and lines are placed on the image according to a certain layout. The classical one draws stems along a line and other motifs are spread out on circles (Figure 2.7a). In linear layout, also known as the arc diagram, the whole sequence is drawn along a line and base pairs are on top of it visualized as arcs (Figure 2.7b). Its reverse version, called the circular layout,. 13.

(22) 2.2. RNA Structure. (b) U. C. C C. 10. 17. G 10. U. G G. 1. G. G. 1. 10. C. A. (a). G G G A C C U U C C C G G U C U C. C. U. U. U. C. C. C. C. C. U. (c). C. G. C. G. A. 17. U. G. C G. G. C. 1. U. 17. F IGURE 2.7: Secondary structure visualization layouts: (a) classical, (b) linear and (c) circular.. (a). (b) i. i'. j'. j. i. j. i'. j'. i. i'. j. j'. F IGURE 2.8: Secondary structure: (a) nested and (b) pseudoknotted. is based on drawing the sequence on a circle and pairs as chords across the circle (Figure 2.7c). Different methods of drawing are useful in one wants to distinguish between the two types of RNA secondary structure: nested and pseudoknotted. In the first type, for every two base pairs (i, j) and (i0 , j0 ), one of the following applies: i < i0 < j0 < j or i < j < i0 < j0 (Figure 2.8a). Whereas, a pseudoknotted structure contains a pseudoknot i.e. at least two base pairs. (i, j) and (i0 , j0 ) that satisfy the following condition: i < i0 < j < j0 (Figure 2.8b). The arc diagram provides an unambiguous, visual distinction of the two types. In a pseudoknotted structure at least two arcs cross, while in the structure without pseudoknots the arcs are either nested or disjoint.. 14.

(23) 2.2. RNA Structure. Graphical representation of RNA secondary structure is very useful for a human, but for computational processing of structural data, a textual format is needed. The basic one, called base-pair-sequence format (BPSEQ), is a direct translation of the secondary structure to text. Each row in BPSEQ describes a nucleotide and it is organized into three columns that contain: index, character (letter) representing nucleotide in the sequence and index of pairing nucleotide (0 if unpaired) (Figure 2.9a). This format is sufficient to describe simple RNA secondary structure, but when multiple strands are involved, then a more complex format, called connect format (CT), is required. CT data begins with a number of nucleotides in the whole structure. After that, each line describing subsequent nucleotides is organized into the following columns: global index, letter from sequence, index of nucleotide predecessor in the chain, index of nucleotide successor in the chain, global index of paired nucleotide (0 if unpaired) and original index (Figure 2.9b). The number of indices may seem redundant, but they play important roles. Global indexation of nucleotides starts with 1 and is incremented in the following lines, thus allowing for unambiguous base pair indication. On the other hand, strand indices make it possible to distinguish the beginning (predecessor equal to 0) or the end (successor equal to 0) of a strand. The original index enables to map the secondary structure to some other data source e.g. an extracted fragment may retain numbering of the structure it is taken from. A different approach is present in dot-bracket notation (Hofacker et al., 1994), where secondary structure is stored as a string of the same length as the sequence of RNA molecule. In this notation, each unpaired nucleotide is represented by a dot, and each base pair is stored as an opening and closing bracket (Figure 2.9c). This format is not only compact, but it is also the only one to allow explicit indication of a pseudoknotted structure by assigning different symbols for base pairs e.g. ( ) and [ ]. 15.

(24) 2.2. RNA Structure. (a). (c). 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17. G G G A C C U U C C C G G U C U C. 0 16 15 14 13 12 0 0 0 0 0 6 5 4 3 2 0. 17 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17. G G G A C C U U C C C G G U C U C. 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16. 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 0. 0 16 15 14 13 12 0 0 0 0 0 6 5 4 3 2 0. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17. (b). GGGACCUUCCCGGUCUC .(((((.....))))).. F IGURE 2.9: Example RNA secondary structure represented in text formats: (a) BPSEQ, (b) CT and (c) dot-bracket.. 2.2.3. Tertiary Structure. The tertiary structure is a full-atomic representation of an RNA molecule. It covers all aspects of the previous two layers (sequence and interaction network) and additionally allows to look at the structure with all its details. As mentioned earlier, nucleotides contact each other in a variety of noncanonical, yet thermodynamically and evolutionary stable ways. If a group of nucleotides connected non-canonically can be found in different RNA structures, then it is called a 3D module/motif (Westhof et al., 1996). For example, kink-turn (Klein et al., 2001) is a 3D motif in which two strands are connected by a sophisticated set of non-canonical interactions which cause the structure to bend. The pattern has been repeatedly found in structures coming from different species which proves it to be an evolutionary stable 16.

(25) 2.2. RNA Structure. F IGURE 2.10: A kink-turn motif shown as a 2D diagram (left) and 3D visualization (right). building block. On Figure 2.10 the motif is depicted both in 2D and 3D view (Petrov et al., 2013). 2.2.3.1. Tertiary Structure Representation. The 3D structure of RNA molecule can be described in several ways. The basic manner is to enumerate 3D Cartesian coordinates of every atom that constitutes the three-dimensional structure. A set of atoms with their coordinates is called an algebraic representation of the 3D structure. A commonly used format to store algebraic representation is called PDB, since it has been defined for data stored in the Protein Data Bank (PDB). A fragment of kinkturn 3D motif in PDB format is shown in Figure 2.11a. There are also other ways of presenting and storing structural information which are useful in specific contexts. One of them – the trigonometric representation – describes RNAs with values of their torsion angles (Figure 2.12). An example for the same kink-turn 3D motif in this representation is shown in Figure 2.11b. To analyse the RNA structure in trigonometric representation, one needs to compute torsion angles from atom coordinates. Torsion angle is a dihedral, i.e. an angle between planes or – equivalently – between normals to those planes. In biomolecular structures it is used to quantify the rotation around a bond between atoms B-C in the A-B-C-D tuple as shown in Figure 2.13. Dihedral takes values in a range [0, π ], however, with torsion angles one must take into account the order of atoms to obtain value within the range 17.

(26) 2.2. RNA Structure. (a) ATOM. 1. P. U A. 17. 6.762. 14.336. 31.385. 1.00. 0.00. A. P. ATOM. 2. C5'. U A. 17. 4.360. 14.634. 30.424. 1.00. 0.00. A. C. ATOM. 3. O5'. U A. 17. 5.388. 13.744. 30.814. 1.00. 0.00. A. O. ATOM. 4. C4'. U A. 17. 3.744. 14.240. 29.093. 1.00. 0.00. A. C. ATOM. 5. O4'. U A. 17. 4.732. 13.676. 28.192. 1.00. 0.00. A. O. ATOM. 6. C3'. U A. 17. 2.692. 13.144. 29.141. 1.00. 0.00. A. C. ATOM. 7. O3'. U A. 17. 1.481. 13.613. 29.710. 1.00. 0.00. A. O. ATOM. 8. C2'. U A. 17. 2.587. 12.857. 27.649. 1.00. 0.00. A. C. ATOM. 9. O2'. U A. 17. 1.982. 13.899. 26.912. 1.00. 0.00. A. O. ATOM. 10. C1'. U A. 17. 4.077. 12.758. 27.327. 1.00. 0.00. A. C. ATOM. 11. N1. U A. 17. 4.580. 11.373. 27.576. 1.00. 0.00. A. N. ATOM. 12. C2. U A. 17. 4.026. 10.329. 26.865. 1.00. 0.00. A. C. ATOM. 13. O2. U A. 17. 3.150. 10.482. 26.030. 1.00. 0.00. A. O. ATOM. 14. N3. U A. 17. 4.537. 9.090. 27.160. 1.00. 0.00. A. N. ATOM. 15. C4. U A. 17. 5.524. 8.791. 28.081. 1.00. 0.00. A. C. ATOM. 16. O4. U A. 17. 5.880. 7.625. 28.239. 1.00. 0.00. A. O. ATOM. 17. C5. U A. 17. 6.049. 9.931. 28.787. 1.00. 0.00. A. C. ATOM. 18. C6. U A. 17. 5.566. 11.152. 28.516. 1.00. 0.00. A. C. ATOM. 19. OP1. U A. 17. 6.422. 15.449. 32.296. 1.00. 0.00. A. O. ATOM. 20. OP2. U A. 17. 7.623. 13.230. 31.864. 1.00. 0.00. A. O1-. (b) α. β. γ. 135.174. 81.464. δ 72.461. ε -102.703. ζ. χ. P. -82.736. 177.602. 13.227. -54.138. 149.726. 66.406. 84.275. -92.106. -144.854. -162.270. 9.098. 91.028. -100.345. -159.766. 167.059. -148.403. -149.670. -113.135. -172.846. -59.848. 70.945. -118.663. 166.418. -173.270. 10.562. -99.021. -167.863. 58.063. 139.794. 95.064. -119.459. 179.406. 78.076. -176.261. 32.380. 83.662. -159.381. -51.301. -156.199. 9.407. -85.659. -171.304. 52.351. 77.607. -134.698. -143.194. -161.655. 16.243. 31.165. 163.655. -53.671. 113.458. 63.657. 118.214. 34.771. -165.573. -179.873. -150.013. -171.048. -143.011. -148.461. 174.264. -79.136. 54.006. 86.545. -131.780. -69.066. 169.946. 6.961. -67.963. -178.12. 55.897. 82.909. -144.797. -58.075. -163.807. 15.726. -51.368. -179.405. 28.629. 87.32. -160.263. -5.496. F IGURE 2.11: Tertiary structure formats: (a) PDB and (b) torsion angle values. Angle α β γ δ e ζ χ ν0 ν1 ν2 ν3 ν4. Atoms O3’–P–O5’–C5’ P–O5’–C5’–C4’ O5’–C5’–C4’–C3’ C5’–C4’–C3’–O3’ C4’–C3’–O3’–P C3’–O3’–P–O5’ O4’–C1’–N1–C2 C4’–O4’–C1’–C2’ O4’–C1’–C2’–C3’ C1’–C2’–C3’–C4’ C2’–C3’–C4’–O4’ C3’–C4’–O4’–C1’. F IGURE 2.12: Torsion angles defined for RNA structure. 18.

(27) 2.2. RNA Structure. F IGURE 2.13: A dihedral (left) and torsion angle (right).. [−π, π ). This is possible, by observing that if the projection of a normal to the A-B-C plane has the opposite direction than the C-D vector, then the torsion angle value is negative (Burkowski, 2014). Taking this into account, the value of torsion angle is computed according to the following formula (Blondel and Karplus, 1996): ϕ = arctan(|~v BC |~v AB · [~v BC × ~vCD ], [~v AB × ~v BC ] · [~v BC × ~vCD ]). (2.1). The two-argument version of arctangent function is derived from programming languages and proves to be useful in this context. The result y. of arctan(y, x ) can be interpreted as arctan( x ) with special considerations. The two-argument version avoids division by zero when x = 0 and cory. −y. rectly adjusts the result to a valid quadrant i.e. arctan( x ) = arctan( − x ) but arctan(y, x ) 6= arctan(−y, − x ).. In structural biology, the values of torsion angles are grouped into mean-. ingful ranges (Figure 2.14). Initially the -gauche, +gauche and trans notation was used. Later it was replaced with a six term notation of syn periplanar, 19.

(28) 2.2. RNA Structure cis syn 0°. he uc. periplanar. 300°. 60°. he. - ga. c au. 30°. +g. 330°. sp +sc. -ac. +ac. clinal. 270°. clinal. -sc. 90°. ap 240°. 120°. periplanar. 210°. 150° 180°. anti. trans F IGURE 2.14: Ranges of torsion angle values with their corresponding symbolic names (Saenger, 1984). anti periplanar, +syn clinal, -syn clinal, +anti clinal and -anti clinal or shortly sp, ap, +sc, -sc, +ac and -ac (Saenger, 1984). This general grouping is superseded for specific angle types. The χ angle, which describes nucleobase position relative to the ribose ring, uses only two term classification of either syn or anti, but the ranges are slightly altered. The so-called high-anti (≈ −90°) is still syn and analogically high-syn (≈ 90°) is anti.. Trigonometric representation has its advantages over the algebraic one.. It is more concise and does not change when translating or rotating the whole structure. Additionally, the values of torsion angles are correlated with conformational features of the molecule which allows to compare them 20.

(29) 2.3. RNA Structure Similarity. directly or incorporate in a machine learning process. They also play a key role in all modelling approaches with classical potential energies. Nevertheless, prior to using 3D structure representation in torsion angle space, one more thing needs to be checked. In the interest of simplicity, let us define that the 3D structure S consists of a sequence of n atoms and is depicted as: S = { Ai = ( x, y, z)|i = 1, 2, . . . , n, x ∈ R, y ∈ R, z ∈ R}. For an arbitrary index k < n we can divide atoms into subsequence S<k =. { Ai |i < k} and S≥k = { Ai |i ≥ k}. If all atoms in S≥k are shifted along vector. ~v = Ak − Ak−1 , then all torsion angles retain exactly the same values. This. is because in both S<k and S≥k torsion angles remain the same, and rotation along Ak−1 ↔ Ak bond is also unchanged.. At a first glance, this looks like a disadvantage of trigonometric represen-. tation, which cannot seem to take into account such straightforward change in the 3D structure. On the other hand, the change itself is peculiar and highly unlikely to be discovered in real scenarios. This is because it requires to shift a whole subset of atoms and only along a very precise vector. If the vector is even slightly different or if the shift concerns some atoms but not consecutive ones in the sequence S, then torsion angles will differ and it will be easy to spot. Also, if there is a check for bonds’ lengths correctness prior to using trigonometric representation, then the situation above will not take place. In a way, using every kind of 3D structure representation requires checking if basic assumptions are even met to start with.. 2.3 2.3.1. RNA Structure Similarity Similarity of 2D Structures. The first approaches to compare RNA secondary structures date to late 1980s. At that time, the RNA tree data structure was proposed and used for comparison of two RNAs (Shapiro, 1988). Nodes of the tree represented 21.

(30) 2.3. RNA Structure Similarity. various motifs of RNA secondary structure. They were labelled according to Shapiro notation: H (helix), I (internal loop), B (bulge), M (multibranch loop), R (stem) or N (a special node to have the tree connected). The algorithm searched for a sequence of edit operations that transformed one tree into the other with a minimum cost (i.e. in a minimum number of operations). The following edit operations were allowed: changing the label, deleting or inserting a node. The minimum number of such operations was called a tree edit distance between two RNA secondary structures and contributed to measure their similarity. Further, an alternative similarity measure has been defined and named the RNA tree alignment score (Jiang et al., 1995). In this approach, RNA secondary structure has been also represented as tree graph. The algorithm aimed to compare two structures has been inserting new, empty nodes to the corresponding trees until obtaining two trees with equal topology. Then, labels of the corresponding nodes in both trees have been compared. The number of opposing values is called the alignment distance. The sequence of tree edit operations which has led to the minimum distance is called the optimal alignment. Yet another approach to compare the secondary structure model to the reference structure has been proposed in Parisien et al. 2009. It has been based on canonical base pairs, but may be used to compare stacking or non-canonical interactions as well. Having base pair list of the model and the reference structure, one can find which of the model base pairs are correctly or incorrectly classified (predicted). Thus, subsets of true positives, false positives and false negatives are identified. Next, positive predictive value (precision), true positive rate (sensitivity) and Matthews correlation coefficient – also known as as interaction network fidelity (INF) – are calculated (Parisien et al., 2009). The INF score between 0 and 1 describes how compliant are the sets of interactions for the two input structures (0 means 22.

(31) 2.3. RNA Structure Similarity. lack of even a single compliant interaction, 1 means that all interactions are the same). This similarity measure imposes a penalty on either surplus or absence of base pairs in one structure compared to the other one.. 2.3.2. Similarity of 3D Structures. In the literature, the term 3D structure similarity is often used to name different concepts. It may mean a quantitative score showing how closely two structures resemble each other. The score itself can be expressed as a single value, a matrix or in a more complex form. In other applications, the term 3D structure similarity refers to an alignment of a nucleotide residue subset with respect to some predefined criteria. And sometimes the two approaches are combined i.e. the score depends on the computation of alignment or vice versa. 3D structure similarity measures can be divided into general or domainspecific. The former make no assumption on the type of molecules being compared, while the latter ones are based on features specific to either proteins or RNAs. A different way to look at similarity measures is by their granularity. A global comparison method takes into account the whole structures. In contrast, a local approach emphasizes influence of smaller fragments. Both approaches are necessary. A global score provides an overview and is especially useful where many structures are considered. For example, computational tools sampling the 3D conformation space often produce tens of thousands of structures which are later clustered automatically based on the global similarity score (Cheng et al., 2015). The local measure allows to find regions demanding special interest. For example, nuclear magnetic resonance (NMR) structures in an ensemble may vary more in certain regions than in others which gives an important biological insight.. 23.

(32) 2.3. RNA Structure Similarity. The most common 3D structure dissimilarity measure is root-meansquare deviation (RMSD). Having two 3D structures, S1 and S2 , each of them composed of n atoms, we compute the RMSD between them according to the following formula: RMSD =. s. 1 n 2 δi n i∑ =1. (2.2). where δi is euclidean distance between the i-th pair of corresponding atoms (one atom in S1 , the second atom in S2 ), whose position in space is defined by three coordinates. The computed RMSD value is a measure of mean distance between all corresponding atoms. The result of Formula 2.2 depends on mutual positioning of the compared structures. This means that rigid transformations (i.e. translations and rotations), although leaving the RNA structures intact, influence the result of calculations according to Formula 2.2. In 1976 Kabsch proposed an algorithm to derive optimum transformation leading to obtain the least RMSD value. Given atom coordinates of structures S1 and S2 , named fixed and mobile respectively, the algorithm finds translation vector and rotation matrix. Applying that to the mobile structure, the algorithm positions it with respect to the fixed one in such a way, that the result of Formula 2.2 has a minimum value. Since usually in bioinformatics context, one needs to assess RNA structure dissimilarity irrespectively of the original positioning, the term RMSD is now used to denote both finding the optimum transformation and calculation of the mean distance afterwards. According to basic assumptions, RMSD is a global and general dissimilarity measure, although the set of atoms used in calculations may be parametrized which renders RMSD useful in local comparison. Several approaches incorporate RMSD to achieve high-level goals. For example, RNAssess (Lukasiak et al., 2015) uses RMSD internally multiple times for atoms belonging to sliding window in a sphere form. 24.

(33) 2.3. RNA Structure Similarity. RMSD is a classical and well-known measure. Nevertheless, it has its disadvantages such as context-specific interpretation. Depending on the size of structures being compared, the value of RMSD has different meaning. To deal with this problem, Hajdin et al. performed computational experiments by predicting thousands of decoys for non-trivial RNA targets of varying sizes Hajdin et al.. Then, they analysed the distribution of RMSD values and proposed a p-value score for significance checking (Hajdin et al., 2010). This way, knowing the length of an RNA 3D model and its RMSD to the native structure, one can tell if the prediction was significant (i.e. better than random) or not. RMSD alone cannot help to detect local, structural deviations like badly predicted planarity of base pairs. When the global fold is correct, this measure alone would be misleading by ranking the 3D model high. Thus, new measure called deformation index (DI) has been proposed which applies scaling of RMSD by the inverse of INF (Parisien et al., 2009). This concept is very practical in the context of in silico 3D model comparison. Based on a combination of RMSD and INF, one can properly assess the prediction of RNA 3D structure. Conversely, a 3D model may contain helices made out of correct bases paired with each other (i.e. INF close to 1) but still be invalid (very high RMSD). This scenario is possible if the unpaired bases were incorrectly predicted causing invalid kinks and other structural artefacts. Here, DI measure also combines the two basic measures and distinguishes badly predicted models. The other score, based on evaluation of interactions within RNA structure is contact area difference (CAD) score (Olechnoviˇc and Venclovas, 2014). In INF, the information about interactions is binary – they are either present or absent. INF score does not take into account the strengths of interactions. To address this, CAD similarity measure has been proposed. It compares structures by looking at their corresponding contact areas computed upon 25.

(34) 2.3. RNA Structure Similarity. distances from atoms’ spheres with radii equal to van der Waals force. This quantification of interaction strengths provides several advantages over INF, RMSD and DI. For example, stereochemical errors in some cases may paradoxically improve RMSD, but will always worsen similarity assessment as measured by CAD score.. 26.

(35) New Methods for RNA 2D Structure Similarity Assessment. 3. In this chapter, a new method dedicated to similarity of RNA secondary structures is presented. Initially, it has been proposed in response to the demand for new functionality of RNA FRABASE search engine (Popenda et al., 2008). However, the method has a much broader scope. Its primary purpose is to search for similar motifs within RNA structures on the secondary level. An occurrence of secondary structure motifs is typical for RNA molecules. Motifs are repeatable fragments often found in the structure. They share some common characteristics while retaining individual properties. Let us consider stem as example motif. All stems are made of consecutive pairs of nucleotides and this is their common feature. Each stem has also a set of individual features, such as its length (i.e. the number of base pairs) and sequences of two strands that make the stem. Additional properties of the motif can be considered at the 3D structure level. For example, if we take into account the 3D coordinates of stem atoms, we can measure a helical bend which can be treated as another feature of the motif. The complexity. 27.

(36) 3.1. Syntax for Motif Description. of motif description grows when we consider motifs incorporating unpaired nucleotides, like bulges, internal loops or n-way junctions. Motifs are analysed from many perspectives. They are stored and annotated in specialized databases. We search for them in newly determined structures. We try to identify the new ones, since they are very useful in understanding of RNA structures and discovering their functions. Motifs play a key role in fragment-based structure prediction methods, which reuse instances of motifs from native RNA 3D structures to construct the new models in silico. These methods rely on finding fragments that match some predefined structure patterns. However, certain complex motifs, such as n-way junctions, are naturally underrepresented in the repositories of known RNA 3D structures. For example, as of April 2009 there were only 62 4-way junctions among RNA 3D structures solved experimentally (Laing and Schlick, 2009), and they were quite diverse in terms of sequence and topological details. This means, that a search for 4-way junction that perfectly matched user-defined pattern, would likely provide few results if any. A solution to that problem would be to apply an imperfect matching procedure. Thus, it has been strongly recommended to improve fragment matching algorithms by introducing measure(s) to assess fragments’ similarity. To address this issue, a new method has been proposed that is described in the following paragraphs of this chapter.. 3.1. Syntax for Motif Description. The secondary structure of RNA molecule can be encoded in various machine representations. A commonly used one is a string of characters known as the dot-bracket (or parenthesis) notation (Hofacker et al., 1994). Its basic version applies a 3-character alphabet DB = ".()", where a dot symbol is used to represent an unpaired nucleotide and brackets are for a base pair. 28.

(37) 3.1. Syntax for Motif Description. (opening bracket corresponds to the first nucleotide in the pair, closing bracket - to the second one). Dot-bracket notation is a compact and humanreadable format. The whole secondary structure of RNA is described by a string of the same length as the RNA sequence, thus, finding a correlation between them is simple and natural. In extended version of dot-bracket notation (dot-bracket-letter) the alphabet is wider, and it includes also letters, DBL = ".(){}[]<>AaBbCcDd...". Here, first three characters are used in the same way as in the basic version. The remaining pairs of succeeding characters encode pseudoknot-involved base pairs. Extended dot-bracket is the only format to directly store information about pseudoknots by using a hierarchy of brackets and letters for different pseudoknot orders (Antczak et al., 2014, 2017). For these reasons, dot-bracket was selected to be used in RNA FRABASE (Popenda et al., 2008, 2010) – a database with specialized search engine. It contains all 3D structures of RNAs deposited originally in the PDB. Each entry in the RNA FRABASE database is supplemented with the 2D structure in dot-bracket notation, torsion angle values and some metadata. The most innovative and powerful is the search engine of RNA FRABASE. One of its advantages is support for syntax to look for similar secondary structure motifs. The search results contain matches and their correspondence to fragments from real RNA 3D structures. The search engine and database are fundamental components of RNAComposer (Popenda et al., 2012) – the fragment-based RNA 3D structure prediction method. A part of my doctoral study concerned RNA FRABASE and RNAComposer systems. The results in this chapter present improvements in the motif similarity search and their impact on quality of RNA 3D structure prediction. The search engine of RNA FRABASE can look for RNA structure motifs defined by the sequence, the secondary structure pattern, or both. The query pattern can have any length. The search engine locates all fragments 29.

(38) 3.1. Syntax for Motif Description. matching user-provided pattern. The queried sequence should be defined using IUPAC codes (IUPAC-IUB Commission on Biochemical Nomenclature (CBN), 1971). The secondary structure pattern should be input as a string in dot-bracket notation. The search engine finds all matches exhibiting exactly the same pattern. What makes RNA FRABASE special is the ability to use that syntax for multiple strands at once with inter-strand connections. Only thanks to this functionality, the user can look for motifs, because almost all of them are defined by a number of separate strands. RNAComposer queries the RNA FRABASE database while performing the RNA 3D structure prediction. These queries are built internally, without user interaction. Their syntax is based on RNA FRABASE one, however, the specifics of fragment-based prediction dictates a few differences. First of all, each nucleotide in RNAComposer fragment needs to be numbered accordingly, therefore query patterns contain explicit numbering. Additionally, each nucleotide in the sequence is precisely defined to be either adenine, guanine, cytosine or uracil, whereas in RNA FRABASE user can provide any IUPAC code. RNAComposer also explicitly translates the sequence using 2-letter alphabet {Y, R}, where Y denotes pyrimidine, R is for purine, thus, obtaining YR-encoded sequence pattern. Finally, the most important distinction lies in the fact, that RNAComposer strictly defines the fragment as ending with a single base pair (apart from stem which is a set of base pairs as a whole), while in RNA FRABASE this rule does not apply. Such strict handling of fragments is required by RNAComposer engine to know where and how to join fragments together in the fragment assembly step. The comparison of RNA FRABASE and RNAComposer syntaxes is provided in Table 3.1.. 30.

(39) 3.2. Dot-Bracket Limitations in Motif Search. Table 3.1: Example query patterns in RNA FRABASE and RNAComposer syntax. Green and orange refer to pattern description specific to RNA FRABASE and RNAComposer, respectively.. 3.2. Dot-Bracket Limitations in Motif Search. Thanks to the transparency of encoding, intelligibility, and ease of processing by human and machines, dot-bracket notation has been used in many bioinformatics systems. Both RNA FRABASE and RNAComposer, that were developed in our Laboratory of RNA Bioinformatics (European Centre for Bioinformatics and Genomics), have been based on dot-bracket notation in terms of data encoding in the repository (RNA FRABASE) and search for secondary structure motifs (RNA FRABASE, RNAComposer).. 31.

(40) 3.2. Dot-Bracket Limitations in Motif Search. The way of dot-bracket employment in RNA FRABASE and RNAComposer is essentially in looking (in the database of structures encoded in dotbracket) for substring(s) that define the queried structure motif(s). It must be ensured that matching strings have the desired properties. Namely, every pair of brackets (opening and closing bracket) from the query pattern has an exact matching (the corresponding pair of brackets) in the resulting fragment derived from some real 3D structure. The same matching rule applies to dots that represent unpaired nucleotide residues. This powerful idea yields great results, which is evident by looking at how well RNA FRABASE and RNAComposer are established in the scientific community. However, during my research, I made an observation that dot-bracket notation combined with exact matching rule imposes certain implicit assumptions and limitations. Their elimination could significantly improve the search by extending the motif search space, i.a. via allowing the instances being differently oriented or similar to a given motif. The limitations of dot-bracket notation are due to various factors. First of all, RNA sequence is oriented, i.e. it is always specified in the direction from 5’- to the 3’-end. In dot-bracket string this is reflected by the fact, that in a base pair there is always the first nucleotide (closer to 5’-end) encoded by the opening bracket and its counterpart (closer to 3’-end) encoded by the closing bracket. While being necessary to describe the whole RNA secondary structure, this distinction loses meaning if only a fragment (motif) of the structure is considered. For example, if we want to find 3-way junction motif in the set of structures, we do not care how it is oriented within each structure (i.e. which one of its ends is closer to the 5’-end of its parent structure). Practically, this means that for any secondary structure motif involving n strands and encoded as a query pattern in dot-bracket notation, we should consider n different search queries: the first one starting with strand 1, the next one starting with strand 2, etc. (Figure 3.1). A solution 32.

(41) 3.2. Dot-Bracket Limitations in Motif Search. (a). 5' 3'. (b). (((..((((....))))...(((.....))).))). (c). ✗ (.(. )..(. )...). ✓ (..(. )...(. ).). ✗ (...(. ).(. )..). F IGURE 3.1: Secondary structure with 3-way junction motif in (a) graphical and (b) dot-bracket (motif on blue background) formats. To find the motif regardless of its orientation in the structure, (c) three queries generated by strand shift operation should be checked. If this option is not used, only the middle query (green checkmark) finds the motif. to this problem was introduced in RNA FRABASE 2.0 as an option named strand shift operation (Popenda et al., 2010). The second implicit assumption in dot-bracket notation concerns pseudoknots. These tertiary structure motifs are represented in dot-bracket-encoded secondary structure by symbols other than regular parentheses, i.e. square or curly brackets, or even letters. In fact, all of these symbols represent the same class of base pairs and usage of different characters is required only when the whole secondary structure is encoded (Antczak et al., 2014). If only the motif or structure fragment is considered, then it is not necessarily the case, despite when the user explicitly looks for motifs involving pseudoknots. In practice, this means that a dot-bracket-based search method should look for patterns with no differentiation between bracket symbols (Figure 3.2). This. 33.

(42) 3.3. New Model of Motif Search Problem. (a) .(((...[[[..)))....((.]]]..))... (b). (c). >strand1 (...( >strand2 )..( >strand3 ).( >strand4 )..). F IGURE 3.2: RNA secondary structure with pseudoknotted base pairs as (a) arc diagram with dot-bracket encoding, and (b) graphical representation. Analysis of its base pairs, regardless of their involvement in pseudoknot formation, reveals a 4-way junction motif. Its encoding in RNA FRABASE format is shown in (c). In (b), distinct strands of the motif are coloured in red, purple, green, and blue. is only partially handled by RNA FRABASE search engine, and ignored by RNAComposer.. 3.3. New Model of Motif Search Problem. To overcome the limitations of RNA secondary structure notations (including dot-bracket), I have proposed the alternative representation which can be efficiently used in motif search. It draws from graph theory and represents the secondary structure as a mixed graph. Definition 3.3.1 RNA mixed graph Let G = (V , E , A), where V is a set of vertices, E is a set of edges, A is a set of. arcs, be a mixed graph. Let S denote a secondary structure of an RNA molecule.. We will call G an RNA mixed graph representing structure S if the following conditions are satisfied:. 34.

(43) 3.3. New Model of Motif Search Problem. 1. Every vertex v ∈ V represents a nucleotide from S and is labelled with a letter from IUPAC code.. 2. A number |V | of vertices in graph G equals to the number of nucleotides in S . 3. Every arc (vi , vi+1 ) ∈ A represents a connection between consecutive nucleotides in S .. 4. Every edge [vi , v j ] ∈ E represents a base pair in structure S . Such a graph can be constructed from RNA secondary structure given in dot-bracket notation or in some other format. The RNA sequence is translated directly into a set of vertices and a set of arcs. The base pairing information is translated into a set of edges. An example graph is presented in Figure 3.3. RNA mixed graph can also be constructed from the query pattern encoded in RNA FRABASE or RNAComposer format. If the queried motif is multi-stranded, then the vertex set and the arc set of RNA mixed graph are unions of sets generated based on each strand separately. The set of edges is created to represent base pairs between vertices standing for strands’ ends. An example graph is presented in Figure 3.4. Let us now formulate a problem of motif inclusion in RNA secondary structure if both, the motif and entire structure are represented as RNA mixed graphs. (a). G. C. (b). A. G. U U. G. U. A. U. C. U. U G. C. A. U. U. A. G. C. G. C. G. G. U. A. G. U. U. U. U. G. G. C. A. U. G. A. C. U. C. U. A. C. C. F IGURE 3.3: (a) Secondary structure of miRNA 20b pre-element (PDB id: 2N7X) and (b) its RNA mixed graph.. 35.

(44) 3.3. New Model of Motif Search Problem. (a). G. A. (b). U. C A. U. A. G. C C. A C. G. C. C. G. U. C. G. A. U. A. A. C. A. G. F IGURE 3.4: (a) Secondary structure of a 3-way junction motif and (b) its RNA mixed graph. Theorem 3.3.1 Let G be an RNA mixed graph of RNA structure S , and let Q. denote an RNA mixed graph of secondary structure motif M.. If there exists subgraph G 0 ⊂ G , isomorphic to Q, then RNA structure S. contains motif M.. Proof 3.3.1 If G 0 = (VG 0 , EG 0 , AG 0 ) is isomorphic to Q = (VQ , EQ , AQ ), then by definition the following is true: 1. VG 0 = VQ 2. AG 0 = AQ 3. EG 0 = EQ . Point 1 corresponds to the fact that every vertex in G 0 has a matching vertex. in Q with the same label. Point 2 represents the fact, that these equally labelled. vertices can be connected in the same order in both G 0 and Q. From these two points together it follows that there is a sequential match between M and fragment of S .. Point 3 says that every edge present in G 0 has a corresponding edge in Q. This. states that the structure of RNA (i.e. base pairs it is made of) is matched between. M and fragment of S .. A motif is defined by the secondary structure, i.e. its sequence and base pairing. topology. Thus, when both of these match, we can conclude that the RNA R contains motif M. . 36.

(45) 3.3. New Model of Motif Search Problem. The formulation of a motif search problem using RNA mixed graphs is an answer to limitations of dot-bracket-based approaches. Nucleotides forming single strand are ordered from 5’- to 3’-end what is represented by directed arcs in the graph. However, each strand initiates also a separate subset of vertices which are linked using undirected edges. The order in which all strands are encoded with respect to one another matters in the dot-bracket notation. However, this is not the case when RNA mixed graph is used to represent a multi-stranded fragment. In practice, this means that a strand shift operation is not needed in the case of this representation of RNA structure. Since each edge represents any type of canonical base pair, there is no problem with pseudoknot handling as well. The proposed model can be generalized to look for partial matching of motifs. Theorem 3.3.2 Let G be an RNA mixed graph of RNA secondary structure S , and let Q denote an RNA mixed graph of secondary structure motif M.. Any RNA mixed graph H which is isomorphic to both a subgraph of G and a. subgraph of Q is a partial matching of motif M in RNA structure S .. Proof 3.3.2 H is isomorphic to a subgraph of Q, so we can state that it covers at. least a part of motif M. Knowing also that it is isomorphic to a subgraph of G , we can repeat the reasoning stated in Proof 3.3.1 to conclude that it represents a fragment of RNA structure S . This concludes the proof. Proposition 3.3.3 Score for a partial matching A partial matching represented by RNA mixed graph H can be scored with. respect to RNA mixed graph of query pattern Q using the following formula: f (H, Q) =|{vi |vi ∈ VQ ∧ vi ∈ / VH }|+. |{ ai | ai ∈ AQ ∧ ai ∈ / AH }|+ |{ei |ei ∈ EQ ∧ ei ∈ / EH }|. (3.1) 37.

(46) 3.4. Algorithms to Search for Similar Motifs. Definition 3.3.2 Perfect motif matching An RNA mixed graph H∗:. H∗ = arg min f ( H, Q). (3.2). is a perfect motif matching iff f (H∗, Q) = 0. Otherwise, it is an imperfect motif matching. Theorem 3.3.1 constitutes the fundamental for a decision problem that tries to find whether a given motif exists in RNA secondary structure. The optimization version of the problem is based on Theorem 3.3.2 and scoring function defined by Equation 3.1.. 3.4. Algorithms to Search for Similar Motifs. Based on the RNA mixed graph G , a new algorithm aimed to search for similar secondary structure motifs has been developed. It uses two data. structures to store information about the graph: one for vertices and arcs, another for edges. Vertices from set V ⊂ G are kept in an ordered, indexed. list. Each strand of a multi-stranded RNA structure motif is associated with a separate list. This allows to automatically handle arcs from set A ⊂ G , due. to their property of being defined on sequentially consecutive nucleotides. Finally, edges from set E ⊂ G are kept on a doubly linked list which allows. to easily find connections between RNA strands.. Thanks to such data structure arrangement, the algorithm to find perfect (exact) matching is streamlined. It applies a sliding window procedure to go along the entire RNA structure S and checks for every consecutive. nucleotide whether it can be the first nucleotide of the queried motif M. The. verification process involves a repeated procedure for each strand separately. The validation is finished if sequence and structure match perfectly, and if the ending nucleotide of the candidate strand forms a base pair with the 38.

(47) 3.4. Algorithms to Search for Similar Motifs. nucleotide belonging to the next strand (this must be true for all strands of a motif). The final validation rests on finding whether the motif was fully looped and returned to the initial starting point. The idea of this perfect matching mixed graph (PMMG) algorithm is presented as Algorithm 1. The algorithm can be adjusted to find imperfect matching as well. This can be achieved by relaxing check points for sequence validity. Such operation is justified from biological point of view, since RNA molecules exhibit sequential variability even if they belong to the same family. It is also understandable from a utility point of view, because the process of in silico mutation of nucleotides in the RNA sequence is rather straightforward. Thus, even imperfect sequential matches can be transformed easily to the requested ones. It is especially the case when mutation involves purine-purine or pyrimidine-pyrimidine exchange. RNAComposer also allows the 3D structure fragments to have different sequences and it makes use of in silico mutations if needed. That is why its syntax explicitly uses the YR-encoded sequence to score fragments of RNAs. Another way of modifying the algorithm to find imperfect matching, is by allowing more base pairs in the motif than they are present in the query pattern. This proposition has two variants. In the first one, the additional base pairs are allowed if the paired nucleotides belong to different strands of the motif, or one of them belong to the motif while the other does not. In this variant, matching fragments might include disrupted parts, because in the motif they are expected to be unstructured but in the solution they come from conformation restricted by base pairing. In the second variant, any additional base pair is allowed. However, this approach should rather be avoided, because it allows intra-strand base pairing, which significantly changes the structure topology. Yet another way of including imperfect matches in the resulting set is by allowing strands to grow (by accepting nucleotide insertions) or to 39.

(48) 3.4. Algorithms to Search for Similar Motifs. shrink (by accepting nucleotide deletions). However, these modifications are discouraged, because they complicate scoring of matching fragments, and – more importantly – they render the results which are difficult to use. When user searches for a given motif and obtains a bigger or smaller result, then a manual intervention is required to fix the differences. Aforementioned modifications constitute the imperfect matching mixed graph (IMMG) algorithm presented as Algorithm 2. It follows the general idea of PMMG, but initially it gathers a result set ignoring sequence constraints and allowing any number of additional base pairs (excluding intra-strand ones). Each entry of the result set is assigned a similarity score, which is an ordered tuple that contains: (i) the number of inter-strand base pairs in the result, (ii) structure homology against the query pattern, (iii) YR-sequence homology vide supra, (iv) sequence homology vide supra and (v) experimental resolution of the parent 3D structure of the fragment. Entries in the result set are sorted by similarity which is expressed as a tuple. The sorting procedure is applied in n iterations, where n is the number of values in the tuple. In the first iteration tuples are sorted according to the first parameter. Then, in every i-th iteration, the procedure sorts according to the i-th parameter and preserves the order determined for all previous parameters. This means that e.g. fragment without inter-strand base pairs, but with a bad sequence homology will still be considered as better than another one with perfect sequence match and a positive number of inter-strand base pairs. The sorting order reflects preference of structural features over sequential matching, although that order can be changed in the algorithm or a more sophisticated scoring function can be applied. All criteria, except the last one, refer to the sequence and the secondary structure. The last criterion is used to break the ties. It means that among matches having the same sequence and structure, these are preferred which originate from the 3D structures of higher quality. One can observe that 40.

(49) 3.4. Algorithms to Search for Similar Motifs. the sorting criteria reflect penalties for imperfect matching. By reversing this reasoning, one can see that the IMMG algorithm implicitly finds perfect matches, by assigning them zero penalty and, in consequence, placing them at the top of the ranking. The IMMG algorithm covers also the case when the result set is empty. This is handled by a heuristics to find fragment with insertions. The heuristics is based on a greedy algorithm, which elongates the shortest strand in the query with a new nucleotide. After elongation of the shortest strand, the search is repeated. The heuristics is extremely simple and could be potentially improved by allowing also deletions or by making the insertion decisions in a more rational manner. However, the experiments show that the base IMMG algorithm without insertions can find almost any fragment not recognized by RNAComposer and if not, then a single insertion suffices to find a candidate in the next round of the search procedure. The IMMG algorithm has been implemented in Java programming language. It is directly linked to the RNA FRABASE database using the Java Persistence API (JPA) technology. A query pattern should be given in RNAComposer syntax. The program fetches all sequences and secondary structures from RNA FRABASE. Next, they are divided evenly into groups. Each group is processed in parallel on a separate CPU core. Match searching is relaxed to allow suboptimal solutions. However, in the end, a ranking of the results is made to select the best matching fragment (possibly a perfect match). The relaxation involves sequence and the first variant of structure checking. In other words, the program mutates bases in silico if needed and allows additional base pairs, but not the intra-stranded ones.. 41.

(50) 3.4. Algorithms to Search for Similar Motifs. Algorithm 1 PMMG: Finding perfect matches of motifs in RNA structure. Input: S is RNA sequence and secondary structure, M is a motif encoded by a query pattern. Output: A set of vertices in graph G representing structure S from which a perfect match of motif M begins. 1: G ← T RANSFORM I NTO R NA M IXED G RAPH(S ) 2: Q ← T RANSFORM I NTO R NA M IXED G RAPH(M) 3: f ound ← ∅ 4: for vertex ∈ VG do 5: begin ← vertex valid ← True 6: for strand ∈ strands of Q do 7: 8: end ← begin + |strand| . Implicit usage of AG 9: if sequence of strand 6= sequence in [begin, end] then valid ← False 10: 11: break 12: end if if structure of strand 6= structure in [begin, end] then 13: valid ← False 14: 15: break end if 16: 17: if begin ∈ EG ∧ end ∈ EG then 18: begin ← end 19: else 20: valid ← False 21: break 22: end if 23: end for 24: if valid = True ∧begin = vertex then 25: f ound ← f ound ∪ {vertex } 26: end if 27: end for 28: return f ound. 42.