• Nie Znaleziono Wyników

The mechanism of protein folding – in silico model

N/A
N/A
Protected

Academic year: 2021

Share "The mechanism of protein folding – in silico model"

Copied!
211
0
0

Pełen tekst

(1)

Jagiellonian University

Faculty of Physics, Astronomy and Applied Computer

Science

Barbara Kalinowska

The mechanism of protein

folding - in silico model

A thesis submitted for the degree of

Doctor of Philosophy

supervised by

prof. dr hab. Irena Roterman-Konieczna

(2)

Wydział Fizyki Astronomii i Informatyki Stosowanej Uniwersytet Jagielloński

Oświadczenie

Ja niżej podpisana, Barbara Kalinowska (nr indeksu: 480) doktorantka Wy-działu Fizyki, Astronomii i Informatyki Stosowanej Uniwersytetu Jagiellońskiego oświadczam, że przedłożona przeze mnie rozprawa doktorska pt. The mecha-nism of protein folding - in silico model jest oryginalna i przedstawia wyniki badań wykonanych przeze mnie osobiście, pod kierunkiem prof. dr hab.Ireny Roterman-Koniecznej. Pracę napisałam samodzielnie. Oświadczam, że moja rozprawa doktorska została opracowana zgodnie z Ustawą o prawie autorskim i prawach pokrewnych z dnia 4 lutego 1994 r. (Dziennik Ustaw 1994 nr 24 poz. 83 wraz z późniejszymi zmianami). Jestem świadoma, że niezgodność ni-niejszego oświadczenia z prawdą ujawniona w dowolnym czasie, niezależnie od skutków prawnych wynikających z ww. ustawy, może spowodować unieważnie-nie stopnia nabytego na podstawie tej rozprawy.

Kraków, dnia ... ...

(3)

Streszczenie

Stale nierozwiązanym problemem w biologii jest mechanizm w jaki białka przyj-mują swoją specyficzną i zarazem funkcjonalną strukturę trzeciorzędową. Po-mimo znajomości występujących w białkach oddziaływań oraz eksperymental-nych doniesień na temat przebiegu fałdowania, z powodu wyjątkowej złożoności struktury białek nie jest możliwe prześledzenie ani obliczeniowe odtworzenie całego procesu uzyskiwania przez białko struktury natywnej. Obecnie propo-nowane jest kilka modeli przebiegu fałdowania, które opierają się na różnych przesłankach eksperymentalnych i które kładą nacisk na odmienne czynniki fizy-kochemiczne dominujące w trakcie fałdowania. Odtworzenie procesu fałdowania motodą in silico stanowi jedno z podejść stosowanych w metodach przewidy-wania struktury białek na podstawie ich sekwencji aminokwasowej. Zarówno metody ab initio oraz metody korzystające z danych o znanych strukturach na-tywnych, w dalszym ciągu nie są wystarczające aby dostarczać trafnej informacji o strukturze. Zaprezentowany w nieniejszej pracy model podejmuje próbę utwo-rzenia metody przewidywania struktury białek, która zarówno wykorzystuje dane o dostępnych strukturach natywnych ale też opiera się na znanych mecha-nizmach fałdowania. Model ten dodatkowo wykorzystywany jest do wyjaśniania zjawisk związanych ze stabilizacją struktury białek. Główne założenie modelu stanowi wieloetapowość procesu fałdowania, polegająca na występowaniu ob-serwowanych eksperymentalnie stanów pośrednich. Z tego powodu wyróżniono dwa główne etapy strukturę Early Stage oraz strukturę Late Stage.

Etap Early Stage wykorzystuje informację o lokalnych preferencjach konfor-macji łańcucha głównego, które jak pokazano prowadzą do formowania się struk-tur drugorzędowych na pierwszych etapach fładowania. Na tym etapie ważne jest zebranie informacji o relacji sekwencji do struktury w formie pozwalającej na określenie jak najtrafniejszej lokalnej konformacji. W celu uproszczenia de-finicji struktury lokalnej wprowadzono siedmioliterową klasyfikację konformacji

(4)

łańcucha głównego. Klasyfikacja ta opiera się na ograniczonej podprzestrzeni konformacyjnej w formie eliptycznej ścieżki na mapie Ramachandrana oraz po-dziale tej mapy na siedem stref definiujących siedem kodów strukturalnych. Na podstawie kodów strukturalnych tworzone są biblioteki motywów struktural-nych, których prawdopodobieństwo wystąpienia w znanych strukturach wyko-rzystywane jest do wyznaczania najbardziej prawdopodobnej struktury wcze-snego pośrednika, będącej punktem startowym do kolejnych etapów wyznacza-nia struktury natywnej. W pracy przedstawiono dwie metody przewidywawyznacza-nia tej struktury, z których jedna opiera się na tabeli kontyngencji prawdopodobieństwa dla tetrapepetydów natomiast druga, bardziej efektywna, na słownikach staty-stycznych. Zbudowano również narzędzie do wizualizacji relacji struktury do sekwencji w formie tabeli kontyngencji.

W kolejnym etapie – Late Stage - przyjmuje się dominujący wpływ zjawiska hydrofobowego na rozmieszczenie reszt aminokwasowych w białkach ze względu na ich właściwości. Idealny przypadek opisuje model Fuzzy Oil Drop (FOD) zakładający teoretyczny rozkład hydrophobowości w molekule białka w formie trójwymiarowej funkcji Gaussa. Rozkład ten stanowi referencję, względem któ-rej modyfikowana jest struktura białka w trakcie symulacji procesu fałdowa-nia. Poprzez wyznaczenie odległości pomiędzy obserwowanym rozkładem dla znanej struktury białka z rozkładem według FOD poprzez obliczenie entropii relatywnej, możliwa jest ilościowa ocena stopnia uformowania w molekule jądra hydrofobowego. Procedura ta pozwala na analizę czynników stabilizujących natywne struktury białek. Przy pomocy modelu FOD pokazano znaczenie nie-ustrukturalnionych fragmentów łańcucha w procesie fałdowania oraz w tworze-niu kompleksów przez białka wiążące DNA. Przeanalizowano również współ-działanie efektu hydrofobowego oraz wiązań dwusiarczkowych jako czynników stabilizujących funckjonalną strukturę białek.

(5)

Abstract

One remarkable and unsolved problem in biology is the mechanism by which proteins obtain their specific functional three-dimensional structure. Despite the knowledge about the intra-protein interactions and the experimental results in the field of protein folding, the detailed observations or the computational reconstruction of the process is still not possible. Until now, several models of folding have been proposed. Each of them is based on the varied experimental prerequisites and emphesizes the different physicochemical factors which domi-nate during the process. Recreating the folding process by the means of in silico methods is one of the approaches applied to the prediction of protein structures from amino acid sequences. Both ab initio and the knowledge-based methods are currently not sufficiently successful in providing the correct structure of a protein. This doctoral thesis presents the model which was implemented as a method of protein structure prediction and involves the data regarding the experimentally solved native structures as well as the known folding mecha-nisms. The model is also applied in the explanation of some effects related to the tertiary structure stabilization. The main assumption behind the model is the occurrence of several intermediate stages during the folding process, which was confirmed in numerous experiments. For that reason, it was decided to dis-tinguish the two main types of the intermediate stages – the Early Stage (ES) and the Late Stage (LS) structures.

The ES model is based on the local conformational preferences of a backbone, which, as it has been shown, are an important factor in the secondary structures formations in the initial phase of the folding process. At this stage, gathering the information about the sequence-to-structure relation is required. The seven letter representation of the backbone conformation was introduced as a simple local structure classification. The definition of these structural codes is related to the seven zones in the limited conformational subspace forming an elliptical

(6)

pathway in the Ramachandran plot. These seven codes are applied to create the libraries of the structural motifs, whose occurrence frequencies are used to obtain the most probable structure of the early intermediates. Two methods of such structure determination are presented in this work. One of them is based on the contingency table of the occurrence frequencies for tetrapeptides and the other one, which is more effective, is based on the statistical dictionaries. A tool for visualization and analysis of the sequence-to-structure relation in a form of the contingency table was also built and made freely accessible.

The assumption underlying the second type of the intermediate model (LS) is that the hydrophobic effect is one of the strongest factors affecting the loca-tion of amino acid residues. The idealized effect of such interacloca-tions is described here with the Fuzzy Oil Drop (FOD) model, which involves the theoretical hy-drophobicity distribution in a protein molecule modeled by the three dimen-sional Gaussian function. This expected distribution provides a reference point for the energy minimization procedure during the structure prediction. The op-portunity for the estimation of a distance between the observed and the theoret-ical, consistent with the FOD, hydrophobicity distributions by means of entropy divergence calculation enabled the quantitative evaluation of the hydrophobic core formation status in a considered protein molecule. By such application of the FOD model the role of disordered chain fragments in folding and in DNA complexation was demonstrated. Another analyzed effect is the supplementary participation of the hydrophobic interactions and disulfide bonds in the protein structure stabilization.

(7)

Contents

1 Introduction 4

1.1 Protein folding in vivo . . . . 4

1.2 Protein structure principles . . . 5

1.3 Protein folding models . . . 7

1.4 Folding simulation in silico . . . 8

1.5 Protein structure prediction . . . 8

2 The model of protein folding based on two intermediate state types 10 2.1 Early Stage . . . 10

2.2 Late Stage . . . 16

2.3 Folding simulation . . . 21

3 Included papers: summary and comments 22 3.1 Simulation of the protein folding process . . . . 22

3.2 Hypothetical in silico model of the early-stage intermediate in pro-tein folding . . . . 24

3.3 Statistical dictionaries for hypothetical in silico model of the early-stage intermediate in protein folding . . . . 27

3.4 Contingency Table Browser – prediction of early stage protein structure . . . . 29

3.5 Intrinsically disordered proteins - relation to general model ex-pressing the active role of the water environment . . . . 31

3.6 Application of divergence entropy to characterize the structure of the hydrophobic core in DNA interacting proteins . . . 33

3.7 Role of disulfide bonds in stabilizing the conformation of selected enzymes — an approach based on divergence entropy applied to the structure of hydrophobic core in proteins . . . . 34

4 Final remarks 36

5 Bibliography 40

(8)

Content and reading

The main part of this doctoral thesis is the eight articles, which I coauthored. These publications are preceded by an introduction, which contains an outline of current knowledge about the protein folding process and the explanation of the in silico model which is the main subject of this work. The introduction includes also summaries of the articles’ contents. These papers read as follows:

[A1] Roterman, I., Konieczny, L., Banach, M., Marchewka, D., Kalinowska, B., Baster, Z., Tomanek, M., Piwowar, M. (2014) Simulation of the Protein Folding Process In Liwo A. (Ed.) Computational Methods to Study the Structure and Dynamics of Biomolecules and Biomolecular Processes (Vol. 1, pp. 599-638) Berlin, Heidelberg: Springer Berlin Heidelberg.

[A2] Kalinowska, B., Alejster, P., Sałapa, K., Baster, Z., Roterman, I. (2013) Hypothetical in silico model of the early-stage intermediate in protein folding. Journal of Molecular Modeling, 19(10), 4259-69,

[A3] Kalinowska, B., Fabian, P., Stąpor, K., Roterman, I. (2015) The statis-tical dictionaries for hypothestatis-tical in silico model of folding, early-stage intermediate in protein. Journal of Computer Aided Molecular Design, 29, 609-618,

[A4] Kalinowska, B., Krzykalski, A., Roterman, I. (2015) Contingency Table Browser - prediction of early stage protein structure. Bioinformation, 11(10), 486-8,

[A5] Kalinowska, B., Banach, M., Konieczny, L., Marchewka, D., Roterman I. (2014) Intrinsically disordered proteins-relation to general model ex-pressing the active role of the water environment. Advances in Protein Chemistry and Structural Biology, 94, 315-342,

[A6] Kalinowska, B., Banach, M., Konieczny, L., Roterman, I. (2015) Applica-tion of divergence entropy to characterize the structure of the hydrophobic core in DNA interacting proteins. Entropy, 17(3), 1477-1507,

(9)

[A7] Banach, M., Kalinowska, B., Konieczny, L., Roterman, I. (2016) Role of disulfide conds in stabilizing the conformation of selected enzymes — an approach based on divergence entropy applied to the structure of hy-drophobic core in proteins. Entropy, 18(3), 67,

In the relation to the two stages considered in the folding model presented in the thesis, the articles were grouped accordingly to the stages of their focus. Therefore, their order is not consistent with the chronology. The first publi-cation [A1] introduces both models in the context of the folding simulation. [A2]-[A4] concentrate on the Early Stage modelling. While [A2] and [A3] de-scribe the two proposed methods of the first intermediate structure prediction, [A4] presents a tool built in order to facilitate the analysis of data regarding the sequence-to-structure relation and gathered on the basis of the known pro-tein structures. The subject of [A5-A7] is the Fuzzy Oil Drop model, which is an important assumption for the Late Stage intermediate structure prediction and its application to the analysis of the mechanisms stabilizing the tertiary protein structure. The articles focus respectively on: intrinsically disordered proteins, DNA binding proteins and disulfide bonds. All assumptions behind the model are described in the introduction (Section 2). Section 2.1 explains the Early Stage structure determination and Section 2.2 presents the Fuzzy Oil Drop model. These sections provide a reference for articles’ summaries in Section 3, where the models are mentioned without their details.

(10)

1

Introduction

A strong dependence between the protein function and its native three dimen-sional structure is one of the main dogmas in the current biology. Since Anfinsen demonstrated that a ribonuclease molecule can fold spontaneously in vitro [1], it is claimed that the information included in an amino acid sequence is sufficient to determine the functional structure of a protein. Theoretical and experi-mental attempts to answer the question of how a protein molecule obtains its structure after translation have been undertaken for over sixty years. Neverthe-less, the folding process remains only partially understood [2, 3]. The certain fact is that the final structure cannot be a result of the random conformational space sampling. This problem was posed by Levinthal, who calculated that for a hundred amino acid long protein the process would require at least 1010 sec-onds [4]. In cells proteins need between microsecsec-onds and secsec-onds to fold [5, 6], which indicates the existence of mechanisms directing a molecule towards its final structure.

Explanation of the protein folding process is crucial for our understanding of diseases associated with protein aggregation, misfolding or decreased struc-tural stability. Protein aggregation accompanies a lot of diseases, like type II diabetes, cystic fibrosis, neurodegenerative diseases (like Prarkinson’s disease or Alzheimer’s disease) and prion diseases [7–11]. The understanding of protein folding can also support the methodology of protein structure prediction tools.

1.1

Protein folding in vivo

Proteins are folded in cytoplasm immediately after synthesis of a new peptide chain on the ribosome. One of most important conditions for this process is the aqueous environment, which by interaction with amino acid residues promotes certain specific structures. Another aspect of the cytoplasmic environment is the significant density of the other proteins, which can reach 200-400 mg/ml [12]. It results in the high probability of interaction between a newly synthetized peptide

(11)

chain and cytoplasmic proteins. The interactions can support the formation of native structure but on the other hand may disturb the folding process [13, 14]. Obtaining the correct molecule structure is necessary to avoid aggregation and dysfunctional (or even harmful) interactions. For that reason, some mecha-nisms supporting proper protein folding have evolved. It was shown that small, single-domain proteins can fold especially fast and therefore they do not require any additional controlling mechanisms [15, 16]. The folding of longer proteins involves the interaction with chaperon proteins (proteins from Hsp60 and Hsp70 groups), whose participation was demonstrated for both eukaryotic and prokary-otic cells. Although detailed mechanism of chaperon function have not been fully understood, it was shown that a synthetized amino acid chain is bound to a tunnel being a favorable environment, mainly because of its hydrophilic properties [17–20].

Another factor important for the understanding of the protein folding is a mechanism of translation on ribosomes. A protein chain is synthetized in eukaryotic cells at an average speed of 5 aa/sec [21]. Therefore, the folding of a former part of the chain can significantly precede the complete chain synthesis. This effect is called co-translational folding [22–24]. It was also demonstrated that the structure of tRNA molecules participating in translation can affect the local structure of a newly synthetized protein chain [25–27]. Studying structural changes of the protein in vivo faces enormous experimental difficulties. Despite the increasing number of publications on NMR [29,30] or fluorescence [31–34] in-vestigations of protein stability in cytoplasm, the most of the current knowledge is provided by in vitro experiments and computational methods.

1.2

Protein structure principles

There are four basic mechanisms affecting the protein structure stability: (1) the hydrophobic effect – according to which the hydrophobic residues are di-rected toward the inside of molecule and hydrophilic ones are exposed to water environment; (2) hydrogen bonds – both intramolecular and between amino

(12)

acid residues and the water environment; (3) electrostatic interactions and (4) entropic effects (interactions). Entropic effects, however, play a dual role in molecule stabilization – they promote the unfolded form of the molecule but also the decrease in the number of hydrogen bonds between protein molecule and water [35]. Protein structure cannot be also analyzed without taking into consideration covalent bonds geometry, torsional and van der Waals interactions. Notwithstanding the knowledge about the physical principles which determine the protein structure formation, tracing conformational changes during the fold-ing process imposes a considerable challenge to experimental and computational methods.

The basic model of the protein folding process assumes that protein confor-mation changes according to the decrease of internal molecule energy towards its global minimum. Conformations available for the molecule form the „energy landscape” which is presented graphically as the „energy funnel” built of all possible pathways from unfolded states of high energy on an edge towards low energy states in the local minima [35]. Because of the system complexity, indi-cating the global energy minimum, which is claimed to represent the expected native state, remains a still unsolved problem. Another arguable premise is whether there are created the specific transition states during the folding pro-cess, or if there are several possible different pathways of conformational changes directing towards a single functional structure [36, 37].

Currently, there are no experimental methods which can be applied to ob-serve the structural changes during the folding process. However, spectroscopic methods, mainly based on fluorescence (for example FRET or SCS) [31–34], NMR (especially proton-exchange) [28, 38–40] and circular dichroism [41], when accompanied by molecular engineering provide the insight into the folding ki-netics, molecule compaction changes and transition states. For example, it was demonstrated that the small domains are folded in a highly cooperative way, although for some of them the transition states were also observed [41].

Besides the unfolded and native states, another intermediate protein forms have been recorded experimentally. Their structure is more compact than the

(13)

unfolded state but not as fully structured as the native form [42]. Such struc-tures are called Molten Globule (MG) and some researchers distinguish two forms of them – a hydrated Wet Molten Globule (WMG) and a form with a de-hydrated hydrophobic core – Dry Molten Globule (DMG) [43–45]. It was even demonstrated for a chorismate synthase that the protein of such structure can be active [46]. The MG is characterized by the secondary structure occurrence. In the early stages of folding, α helixes and β strands were observed. It indicates a significant impact of local interactions on the folding process, especially on the early stages. Although the early α helixes are highly unstable and disappear quickly, the formation of the native secondary structures can be promoted by them [47].

1.3

Protein folding models

There have been postulated numerous models of protein folding, which assume different dominant factors. Firstly, ”the nucleation-condensation model” was proposed. It assumes the occurrence of a folding nucleus consisting of a small set of amino acid residues, which interactions direct the molecule structure to-wards its correct conformation [48]. In 1976, ”the diffusion-collision-adhesion model” was published. The model assumes that the folding involves chain frag-ments movefrag-ments causing that the domains meet each other and interact [49,50]. Then, Baldwin presented in 1982 ”the framework model” based on the process of early secondary structure emerging as a result of local interactions and back-bone properties which leads to the final structure formation [51]. Another – ”the hydrophobic collapse model” - emphasizes the dominant role of hydropho-bic interactions on the early stages of the protein folding process [52, 53]. The other attempts to explain the protein folding resulted in creating ”the zipping-and-assembly model” [54], ”the jigsaw puzzle model” [55] and the newest one – ”the stoichiometry-driven protein folding hypothesis” [56]. Amongst a va-riety of mechanisms which can be taken into consideration, the occurrence of intermediate states confirmed in numerous articles is especially important [57].

(14)

1.4

Folding simulation in silico

Because of the experimental difficulties, computational approaches have been developed since the beginning of the protein folding studies. All-atom simu-lations for small proteins (about 100 aa) in millisecond time scale were not available until last several years because of significant computational power needed. The methods leading to native structure folding observation during a molecule simulation required the application of parallel programing, Markov states methods [58] and even the supercomputer like ANTON built especially for the protein dynamics simulations [59,60]. Still, the methods have to simplify the simulated system, for example by limiting the properties of biological envi-ronment. In results, only a part of performed simulations end with structures which are significantly close to the expected native structure [61].

In order to limit the necessary computer power, there are also applied lattice and „coarse grained” models. In spite of the lower accuracy of such models, they take into considerations numerous parameters describing amino acid residues and environment, as well as they increase the speed of conformational sampling [62,63]. Despite all the applied limitations, such simulations can provide reliable information about dynamics and kinetics of the protein folding process.

1.5

Protein structure prediction

The lack of understanding of the relation between an amino acid sequence and a three dimensional structure is a central problem in the development of protein structure prediction methods. Because of the unsolved problem of the folding process and still limited all atom simulations capabilities, a considerable part of structure prediction methods is based on information derived from the known native protein structures [64]. In fact, despite the huge number of structures deponated in the Protein Data Bank (more than 35 000) [65, 66], they cover only a limited number of possible structures occurring in nature.

The most accurate results have been obtained by homology modeling, which means searching similar sequential patterns amongst the proteins of a known

(15)

na-tive structure. This approach, however, is quite unsuccessful in case of proteins without homologs or evolutionary closely related proteins of known structure. Ab initio methods which aim at folding process simulation are still unable to overcome the problem of a global energy minimum searching. In addition, the results strongly depend on starting and intermediate structures, which determi-nation is also difficult [67]. The most successful methods like Rosetta [68, 69] or I-Tasser [70–72] combine the knowledge-based approach with molecular dy-namics simulations. Therefore, the methods require the information about the sequence-to-structure relation gathered in a form of a known local backbone con-formation libraries. The local structure is defined by fragments whose length di-verges between 3 and 9 amino acid residues accordingly to a method [73]. Local conformation in a native structure depends not only on sequentially surrounding amino-acids but also on the interactions with the residues significantly distant in a sequence [74]. For that reason, the local conformation is only partially rele-vant to a local amino acid sequence and the information about the sequence-to-structure relation for short peptides is considerably ambiguous. Nevertheless, such libraries are the basis for prediction methods applying statistical poten-tials [70], Monte Carlo simulations [68] or artificial neural networks [75].

The structure prediction method proposed by Roterman [76, 77, A1] also combines statistical analysis of known backbone conformations and molecule energy minimization based on the Fuzzy Oil Drop (FOD) model. The method allows us to incorporate known facts about the protein folding process into the molecular simulation and in contrast to other prediction tools it aims at limiting the number of starting structures. The FOD model can be also applied to the analysis of known native structures in order to increase the understanding of the protein folding, which can further improve the prediction methodology.

(16)

2

The model of protein folding based on two

intermediate state types

The following thesis presents the theoretical model of the protein folding process proposed by Roterman. The model is based on the observation of the interme-diate states during the process. It defines two types of intermeinterme-diate structures – the Early Stage (ES) structure and the Late Stage (LS) structure. While, the standard approach includes an unfolded state (U ), intermediate states (Ii) and a native state (N ):

U → I1→ ... → In→ N

the model described here introduces intermediates as follows:

U → ES → LS1→ ... → LSn→ N.

2.1

Early Stage

The main purpose of the early structure definition is the limitation of the con-formational space in the beginning in order to select the most probable starting structure for further calculations. It was demonstrated that the result of molec-ular simulation strongly depends on the proposed starting structures [67]. The need for conformational space limitation was postulated in [78]. While many protein prediction methods solve the problem by performing the calculations for a large set of initial structures, this method aims at replacing them by only one theoretically validated structure.

The ES model was built on the assumption of the dominant role of a back-bone characteristic and its interactions on the earliest stages of protein folding, when a peptide chain remains extended [79, 80]. In order to describe the back-bone behavior the alanine heptapeptide geometry was analyzed [81, 82]. The authors defined two parameters: V – a dihedral angle between subsequent pep-tide bonds planes and R – a radius of curvature formed by the mass centers of five subsequent peptide bonds. The relation between V and lnR revealed

(17)

a quadratic polynomial dependence, described by the following equation [82]:

lnR = 0.00034V2− 0.02009V + 0.848 (1)

The projection of this function onto the Ramachandran plot (in φ and ψ angles coordinates) and comparison to low energy areas resulted in the selection of an elliptical shape, which conforms to the assumed conditions (Figures 4 and 5 from [82]). A curve fitted to this shape is described by the following system of parametric equations [82]:     

φ = −125 cos(45o) cos(t) − 84 sin(45o) sin(t)

ψ = 125 sin(45o) cos(t) − 84 cos(45o) sin(t)

(2)

This ellipse crosses the areas of the Ramachandran plot, which are typical for lo-cation of main secondary structures – left and right handed helixes and β-strands (Fig. 1a). Therefore, all such structures can be estimated by the means of a lim-ited conformational subspaces defined by this elliptical pathway. The proposed ES model utilizes the ellipse as a selected set of backbone conformations avail-able for early structures – providing initial structures for further simulations.

For a given three dimensional structure its conformation, defined by a se-quence of (φi, ψi) angles pairs, can be projected onto the ellipse to related (φei, ψei) points. These points are defined by the shortest distance between (φi, ψi) and the ellipse. As a result of such projection a hypothetical early in-termediate structure is generated. The early structure is characterized by an unfolded shape but with secondary characteristics remaining – α-helixes are still present and β-sheets are transformed to decomposed extended fragments.

The occurrence of local conformation in native structures was analyzed for a large nonredundant set of proteins available in the PDB by the projection of (φi, ψi) pairs onto the ellipse to (φei, ψei) pairs. This procedure resulted in creating the probability profiles of (φe, ψe) occurrence for each of the twenty amino acids (Fig. 2a). A common feature of all twenty profiles is the presence

(18)

of seven local maxima. According to the maxima each profile was divided into seven zones – each containing the one of the maxima (Fig. 2b). Having such seven fragments of the ellipse selected, the whole Ramachandran plot was di-vided into the seven zones, labeled by A-G letters and defining seven structural codes (Fig. 1b). These codes can be applied to describe the local backbone conformation extending beyond the standard secondary structure assignment.

To some extent, the structural codes (A-G) can be related to typical sec-ondary structures: the C code to α-helix, E and F codes to β-strands. However, while the E code describes flat β structures, the F code is assigned to a more specific type of structure – tilted and ending part of β-strands. Additionally, the structures described by the D code include closing fragments of helixes and the G zone covers left handed helixes. Loops are represented by the codes A and B but also they can be found sometimes in all other zones. Describing a protein structure by a sequence of structural codes may be useful for structure similar-ity estimation, because of the possible application of the tools for amino acid sequences comparison.

In order to build a library including information about the sequence-to-structure relation, amino acid sequences for a set of known protein sequences were divided into tetrapeptides, which are considered the smallest chain units defining the local secondary structure. By assigning the structural codes to subsequent amino acids, the structural motifs of four letter codes are created. For the nonredundant set of proteins available in the PDB, the number of struc-tural motifs occurrence was calculated for all possible tetrapeptides. Then, the contingency table gathering the probabilities of structural motifs occurrence for all possible tetrapeptides was built. The table size is 160000 (the number of tetrapeptides) x 2401 (the number of four letter structural motifs – 74).

Such form of the sequence-to-structure relation representation was applied in the first method of the early stage structure prediction. The method involved dividing an amino acid sequence, whose structure is predicted, into overlapping tetrapeptides. For each of them the most probable structural motif is selected from the contingency table. In results, four structural codes and its occurrence

(19)

Figure 1: a. The Ramachandran plot with the elliptical limited conformational subspace depicted, (φi, ψi) angles projection onto the ellipse. b. (φei, ψei) angles resulting from the projection of the initial conformations onto the ellipse.

Table 1: The structure of the contingency table for tetrapeptides (columns). Structural motifs defined by four-letter sequences of structural codes (A-G) are presented in rows.

Amino acid sequence - tetrapeptides

AAAA AAAC ... YYYY

Struc-tural motifs

AAAA p(AAAA|AAAA) p(AAAC|AAAA) ... p(Y Y Y Y |AAAA) AAAB p(AAAA|AAAB) p(AAAC|AAAB) ... p(Y Y Y Y |AAAB)

... ... ... ... ...

(20)

Figure 2: a. Probability profiles of (φei, ψei) occurrence for individual amino acids. (φei, ψei) represented by parameter t related to the ellipse definition, t increases in a clockwise manner along the ellipse from 0 degree in the right down end of its major axis to 360 degree. b. The probability profile of (φei, ψei) occurrence for asparaginian with seven zones depicted. c. The

(21)

probability (except for the first and last three positions), each from a different tetrapeptide, are assigned to a single position in an amino acid sequence. Then, one structural code, of the highest sum of probabilities, is chosen. The method and its evaluation is described in details in the article [A2]. In order to im-prove the ES prediction accuracy, the procedure was modified by replacing the tetrapeptides with 1-13 amino acid long fragments gathered in the form of sta-tistical dictionaries. Section 3.2. provides the summary of the publication [A3] which presents the later prediction method.

Independently of the structural code finding procedure, the sequence of structural codes is created to represent the proposed early structure. The pre-diction is limited to choosing one of the seven conformational zones in the Ra-machandran plot for each amino acid residue. For the purpose of one particular structure selection, a pair of (φmax, ψmax) angles equal to an appropriate max-imum of (φe, ψe) profiles for a given amino acid and a given zone (A-G) is assigned. Each of zones is represented by one pair of (φmax, ψmax) angles (Fig. 2c) but it should be bared in mind that such conformation is only a starting point for the further structure modelling. The general aim of the ES predic-tion is to suggest a probable secondary structure. The method is schematically depicted in Fig. 3.

(22)

Figure 3: The scheme of the sequence-to structure relation libraries building (the gray part) and the procedure of the ES structure prediction from an amino acid sequence of unknown 3d structure (the white part).

2.2

Late Stage

Hydrophobic interactions were claimed to be a dominant factor in the second stage of the folding simulation. Because a protein folds in an aqueous environ-ment, while hydrophobic amino acid residues are directed towards the center of a molecule, polar residues are exposed to the surface. In order to achieve such properties of a molecule, a force field with the three dimensional Gaussian hydrophobicity function is applied. Because of the function shape the model was called the Fuzzy Oil Drop (FOD), which is a reference to the Kauzmann’s model of Oil Drop [84]. The difference is that in spite of a sharp – two-state hydrophobicity definition introduced by Kauzmann, in the FOD model the hy-drophobicity is continuously increasing from the surface to the center of the molecule (Fig. 4). The theoretical hydrophobicity function is defined as follows:

Hti= 1 Htsum exp −(xi− ¯x) 2 2σ2 x  exp −(yi− ¯y) 2 2σ2 y  exp −(zi− ¯z) 2 2σ2 z  (3)

(23)

Figure 4: The representation of the theoretical hydrophobicity distribution for an exemplary protein structure. The hydrophobicity reaches the maximum value in the center of the molecule (red color) and decreases to zero on the surface (blue color). The panel plots depict X and Y dimensions of the Ht Gaussian function. The distances defined by standard deviations (σ) are also shown. The center of the coordination system was located in the geometrical center of the molecule, which is also equal to ¯x and ¯y.

(24)

The Hti is calculated for each amino acid residue represented by a point being a geometric center of a side chain. In the above equation, ¯x, ¯y, ¯z are defined by the geometric center of the whole molecule, σx, σy, σz denote the standard deviations figured as described below, Htsum expresses a normalizing constant. The Ht function parameters are determined by a molecule size. Maximum value should be assigned to the mass center of the molecule. Then, the point defines the center of the coordinates system in which X axis is located along the longest dimeter of the molecule. Next, the longest diameter of the molecule projected onto a plane orthogonal to the X axis defines the Y axis. The Z axis must be then orthogonal to the other two axes. The standard deviations – σx, σx, σx– are chosen as the lowest values for which the whole molecule with a 9˚Amargins is encapsulated in the area defined by: [x ± 3σx, y ± 3σy, z ± 3σz]. The construction is based on the three sigma rule which says that 99.7% of values is included within the distance of three standard deviations from the mean.

Having a theoretical hydrophobicity calculated, the function can be com-pared to hydrophobicity observed in a given molecule. The observed hydropho-bicity results from the properties of amino acid residues and pairwise interac-tions between them [85]. It is described as follows:

Hoi= 1 Hosum N X i=1 Hir+ Hjr            h 1 −127 rij c 2 − 9 rij c 4 + 5 rij c 6 − rij c 8i for rij ≤ c, 0 for rij > c (4)

where rijdenotes a distance between i-th and j-th amino acid residues geometric centers, Hriis amino acid hydrophobicity taken from a selected hydrophobicity scale [86], Hosum expresses a normalization factor. c means the cut-off radius for hydrophobic interactions and it is defined as 9˚A(according to [85]).

The comparison of Ht and Ho distribution throughout a protein molecule is a basis for the evaluation of molecule accordance with the assumed theoretical

(25)

hydrophobicity. During the folding simulation, minimization of: ∆H = N X i=1 (Hti− Hoi)2 (5)

which is a measure of divergence from the FOD model is performed.

The FOD model provides the representation of an ideal case of hydropho-bic effect for a protein chain. Real protein structures are characterized by the highly differentiated level of discrepancy between the observed hydrophobic-ity distribution and the FOD model. High accordance was observed for an-tifreeze [87] and downhill proteins [88]. In the first case, such agreement is related to their function which is protection against low temperatures. Their polar surface promotes the formation of the considerable number of hydrogen bonds with water molecules, which prevents the surrounding water from freez-ing. Downhill proteins are rather small and especially fast folding proteins. They fold spontaneously in water, without any additional support (like chaper-ons) which indicates a dominating role of hydrophobic collapse in the process. The FOD model cannot be applied currently to membrane proteins, because of the assumption about the aqueous surrounding of the molecule.

All differences between observed and expected hydrophobicity are crucial for protein function. Protein complexation and ligand binding sites require such discrepancies to be functionally active. It is claimed that the tendency to conform with an idealized, close to the FOD model, structure plays an im-portant role not only in protein folding but also in protein interactions with external molecules. A quantitative description of hydrophobicity distribution in a molecule can be a useful tool for the prediction of ligand binding and pro-tein complexation sites [89–91]. On a basis of the divergence entropy definition, which was introduced by Kulback and Leibler [92]:

DKL(p|p0) = N X i=1 pilog2(pi|p0 i), (6)

(26)

a quantitative measure of accordance between observed hydrophobicity and the FOD model was proposed. As an analogy to the DKL parameter determining a distance between two probability distribution (p and p0), a distance between observed and theoretical hydrophobicity can be evaluated be means of the O/T parameter: O/T = N X i=1 Hoilog2(Hoi|Hti) (7)

The distance is calculated between Ht and Ho distributions, which are defined by the equation (3) and the equation (4). The sole O/T value cannot be in-terpreted independently, because it depends on the length of a protein chain. Therefore, another value was defined:

O/R = N X

i=1

Hoilog2(Hoi|Hri) (8)

O/R expresses the distance between the observed hydrophobicity distribution and the unified distribution, in which each residue hydrophobicity is the same. The unified hydrophobicity value equals to Hr = 1/N , where N is a number of residues. The comparison of O/T and O/R indicates if the observed hydropho-bicity distribution is more similar to the theoretical or to the random one. The molecules for which O/T < O/R are claimed to be accordant with the FOD model. There was also a RD parameter introduced in order to facilitate this agreement estimation:

RD = O/T

O/R + O/T (9)

By means of the RD value, protein structures with RD < 0.5 are character-ized as consistent with the expected Gaussian hydrophobicity distribution. The divergence entropy, defined as described above, can be calculated for an entire molecule but also for its parts like a domain, an exon, a selected secondary struc-ture or the whole proteins with some residues excluded (for example catalytic ones). Such procedure aims at the analysis of the chosen fragments’ involvement in the structure stabilization.

(27)

2.3

Folding simulation

The ES and LS models are supposed to be applied in a protein structure predic-tion tool, which is built in the Cyfronet by means of the PL-Grid infrastructure. The program predicts tertiary protein structure for any amino acid sequence. The software is currently under tests and the estimation of simulation parame-ters is performed. The prediction procedure is following:

1. For a given amino acid sequence, a sequence of structural codes is deter-mined on the basis of amino acid sequence and the sequence-to-structure relation library in a form of the contingency table or the statistical dictio-naries.

2. Having the structural codes, a three dimensional model of the early stage structure is built. Occurring collisions are removed by applying algorithms presented in [93].

3. A three dimensional Gaussian hydrophobicity function is introduced in a way described in Section 2.2.

4. The internal energy of the molecule, which includes atom pair-wise in-teractions, like electrostatic, van der Waals and torsional inin-teractions, is minimized.

5. The structure optimization in relation to the theoretical hydrophobicity is performed. The parameter ∆H (equation (5)) evaluating the difference between observed and expected hydrophobicity is minimized.

6. The internal energy minimization is carried out again as in the fourth step.

7. The size and shape of the theoretical hydrophobicity distribution is ad-justed to the new features of the molecule.

8. The size of the Fuzzy Oil Drop is estimated, in order to stop the procedure when the size is as small as it is predicted by dependence on the chain

(28)

length on the basis of the results presented in [94]. Unless the expected size is obtained, steps 5-8 are repeated.

3

Included papers: summary and comments

During my PhD studies I focused on building and testing a tool for the Early Stage intermediate structure prediction. I also participated in the application of the FOD model to the analysis of hydrophobicity distribution in native protein structures. The analysis covered the characterization of intrinsically disordered proteins, DNA binding proteins and the protein structures with the substantial engagement of disulfide bonds.

3.1

Simulation of the protein folding process

The chapter in the book ”Computational Methods to Study the Structure and Dynamics of Biomolecules and Biomolecular Processes” introduces the proce-dure of the protein folding simulation which bases on two types of intermediates – the Early Stage and the Late Stage. The assumptions of these two models are explained and some aspects of their application are considered. Also, the introduction of the two stages in the simulation is validated from the point of view of the information theory.

The publication describes the representation of the early intermediate struc-ture by means of the seven structural codes and presents the ES prediction method based on the contingency table for tetrapeptides. The information re-quired to predict the ES was evaluated by dividing the Ramachandran plot into 5x5 degrees squares and calculating the probability of finding the amino acids in a given square for a nonredundant set of protein structures from the PDB. The calculation revealed that the amount of information carried by an amino acid sequence is comparable (therefore enough) with the information necessary for prediction of the ES. The detailed studies showed that the amount of informa-tion required by specific residues increases for residues which are conservative or

(29)

engaged in external interactions. Additionally, some proteins appeared to obtain the close to the ES native structure which does not require further simulation.

In the publication, the Fuzzy Oil Drop model was described in the con-text of the folding simulation. The model is applied to provide the external force field which forces the protein molecule to form a well-defined hydropho-bic core. Because the most of protein structures do not obtain a perfect hy-drophobic core several factors disturbing it were investigated by means of the divergence entropy. For a set of exemplary structures it was presented that positions characterized by a difference between Ht and Ho are collocated with residues engaged in ligand binding (guanidinoacetate methyltransferase), active site formation (3-oxoacyl-(acyl-carrier protein) synthase II) and participating in protein-protein complexation (hemoglobin). After removal of such residues from the FOD calculation the accordance with the model for these structures increases. Such procedure were applied also in further works for the purpose of determining if these functional residues contribute to a higher distance be-tween observed and theoretical hydrophobicity distributions. It was also shown that the fragments of protein chain coded by different exons can be character-ized by significantly different status of accordance with the FOD model. The potential application of the FOD model to determine the influence of point mutation on the hydrophobic properties of a protein was described. For a set of antifreeze protein mutants the observed hydrophobicity profile revealed how the point substitution affects not only the surroundings but also the distant sequence fragments.

The amount of information required to determine the native structure can be decreased by introduction of the ES intermediate. The remaining informa-tion is provided by internal and external (defined by the FOD model) force fields application. The final structure formation can be facilitated by participa-tion of addiparticipa-tional molecules like ligands or chaperons in the process of protein folding. Their presence is a factor modifying the shape of the hydrophobicity distribution in a way leading to formation of properties enabling the protein

(30)

biological function. Such changes in external force field should be considered in the procedure of folding simulation.

3.2

Hypothetical in silico model of the early-stage

inter-mediate in protein folding

This publication presents the first approach to the initial structure determina-tion for the process of three dimensional structure predicdetermina-tion on the basis of an amino acid sequence. The ES model involving the seven structural codes (A-G) related to the seven zones in the Ramachandran plot (explained in the previous section) was applied. The ES structure prediction was performed by means of the contingency table, which contains information about the sequence-to-structure relation for tetrapeptides.

The amino acid sequence of an unknown three dimensional structure is di-vided into the subsequent, overlapping tetrapeptides. For each of them, the structural motif defined by the sequence of four structural codes, which has the highest probability of occurrence, is chosen from the contingency table. When-ever a certain tetrapeptide is missing in the table its position is skipped. In result, at most four structural codes are assigned to each amino acid position in the sequence. Then, the one consensus sequence of structural codes is selected by defining for each position the most frequent structural code. Finally, the ES structure is obtained in a form of structural codes sequence with the relevant pairs of (φi, ψi) angles.

The main aim of the paper was to evaluate the prediction method accuracy. In order to perform such estimation, two nonredundant (at most 95% of sequence identity) and disjoined sets of protein structures were selected from the PDB - a training set of ca. 25 000 protein structures and a randomly chosen test set of 250 proteins. On the basis of the training set a contingency table for tetrapeptides and four letter structural motifs (of A-G structural codes) was built (like in Table 1). By means of this table, ES structures were determined for the whole test set as described above (the step-forward procedure). The known

(31)

native structures of each protein from the test set was also presented in a form of a structural codes sequence, which was defined by the Ramachandran plot zones (A-G) related to subsequent (φi, ψi) angles (the step-back procedure). The prediction accuracy was evaluated by comparison between the two structural code sequences determined for each testing protein – the predicted (in the step forward procedure) and the expected (in the step-back procedure) one.

The method predictability was expressed by the percentage of correctly de-termined structural codes. The overall accuracy was estimated at 48%. Also, the method was evaluated for each of the seven structural codes separately. The results showed that the overprediction of the C code (representing α-helixes) was one of the major problems and led to the underestimation of E and F codes (β-structures). Additionally, the less frequent structures (A, B, D and G codes) were too often incorrectly predicted. Nevertheless, almost all codes (except for A) were determined most frequently for accordant native conformations. The analysis of the most and the least accurately predicted structures revealed that the best results were obtained for small (about 100 amino acids) and mainly helical proteins. Big proteins with differentiated secondary structures and con-taining a substantial content of β-strands were found in a group of rather low prediction accuracy but also amongst proteins whose ES structures were in major parts correctly determined. An exemplary quite correctly predicted ES structure (PDB ID: 1PCZ) together with its native structure and structure ob-tained in the step-back procedure is depicted in Fig. 5.

We also performed the analysis of how amino acid residues engaged in in-teractions with ligands, nucleic acids or other proteins affect the results of ES prediction. It was shown that such interactions (except for the interaction with nucleic acids) are a significant factor decreasing the prediction accuracy. This result suggested a need for the contingency table modification by excluding amino acid residues participating in external interaction. Such observation is consistent with a hypothesis saying that additional external interactions, for example with ligands, can be important for correct protein folding because of enforcing protein features required to its proper activity.

(32)

Figure 5: The comparison between native, step-back and step-forward structures of exemplary protein (PDB ID: 1PCZ). a. (φi, ψi) distribution for the native structure. b. The 3D native structure. c. Results of (φi, ψi) projection onto the elliptical pathway. d. The structure being a result of the stepback procedure -a reference point for -accur-acy ev-alu-ation. e. (φi, ψi) angles proposed for the predicted ES structure. Because of the limited resolution along the elliptical pathway, the details of prediction of C and E code for expected secondary structures are depicted below in a form of bar charts. f. The predicted ES structure, which is the

(33)

3.3

Statistical dictionaries for hypothetical in silico model

of the early-stage intermediate in protein folding

The analysis of the ES structure prediction method based on information about the tetrapeptides conformations, which was presented in the previous article, re-vealed that such information is insufficient and requires the inclusion of longer peptides. Another problem was the lack of some tetrapeptides in the sequences from the training set, which suggested the use of shorter peptides (even single amino acids) when required. This publication presents the second approach to ES structures determination based on the statistical dictionaries which are a new form of the sequence-to-structure relation representation. Statistical dic-tionaries assign to all possible peptides (found in the training set) of odd length between 1 and 13 amino acid residues a most probable structural code (A-G) for their middle position. The dictionaries were built by dividing sequences from the training set into peptides of odd lengths (1, 3, . . . 13) and counting the number of structural codes related to the known conformations of center amino acids residues. Then, the most probable central structural code is selected and ascribed to a given peptide.

During the ES structure prediction, for each position in the amino acid se-quence the possibly widest surroundings – between 0 and 6 amino acid residues to the right and to the left - present within the statistical dictionaries (so in the training set too) is chosen. The amino acid conformation is strongly influenced by the characteristics of its neighbors’ residues. It is claimed, that the longer fragments have the same amino acid sequence the higher is the probability of their structural similarity. Therefore, in the first place the possibly longest frag-ment is sought in the dictionary and in the case they are not found stepwisely shorter peptides are looked for. With the growing length of the fragment, how-ever, the chance of finding it in the dictionary decreases. The worst case scenario assumes the application of information about structural preferences for a single amino acid residue.

(34)

For the purpose of comparison between the method and the one based on tetrapeptides, the same training and test set of proteins were used. Also, the level of predictability was estimated by the means of the same method, which compares predicted structural codes with the codes determined in the step-back procedure – by defining the one of the seven zones in the Ramachandran plot including the local native amino acid conformation. As in the Q3 method, the percentage of correctly predicted structural codes in relation to the whole test set size was calculated. The results showed that for the 57% of the positions the proper structural code is proposed, which is a significantly better result than for the previous method. A more detailed analysis revealed the lower frequency of the C code (α-helix) prediction but also increased D and F structures pre-dictability. The increase in the correct determination of the other codes was observed for a part of amino acids. Several examples of proteins with differen-tiated secondary structures were also presented in the publication and it was shown that the high fraction of correctly predicted structural codes (above 90%) can be obtained for both the small α-helical proteins but also for big proteins (> 400 aa) with dominating β structure.

Additionally, a comparison with a common method of secondary structure prediction – SPINE-X [95] was performed. This analysis showed that the ES method is more accurate in the case of α-helixes and β-strands prediction but the problem of random coils overlooking remains. The possible reason is that a part of random coil (φ, ψ) representation appears in areas characteristic for α-helixes and β-strands (the example shown in the Fig. 6).

The main consequence of the analysis was the implementation of the pre-sented method to the complex tool for the three dimensional structure deter-mination, where the proposed ES structure is the first intermediate state and is used as a starting point for following procedures – like the Fuzzy Oil Drop application and energy minimization.

(35)

Figure 6: The (φi, ψi) location in the Ramachandram plot for the exemplary protein (PDB ID: 1PCZ) with their relation to secondary structures occurrance - H (cyan) - α-helixes, S (purple) - β-strands, L (orange) - loops.

3.4

Contingency Table Browser – prediction of early stage

protein structure

The limited accuracy of the ES structure prediction motivated us to build a soft-ware to facilitate the analysis of the contingency table containing information about the sequence-to-structure relation for tetrapeptides. The Contingency Table Browser (CTB) is available on the website: https://tool.u-q.pl:8445/dis-play/CTB/Home. Its main function is the visualization of a contingency table, which can be any data introduced by a user gathered in a table form (of any size), with the only condition of being in a specific text format. The main idea was to analyze the table of occurrence frequency for four letter structural motifs based on the seven structural codes (A-G) for tetrapeptides. However, the user can apply any four letter definition of the data. The data may be also sorted in columns and in rows or a chosen part of the table can be selected (Fig. 7a).

(36)

Figure 7: a) A fragment of the contingency table for a range of subsequent tetrapeptides of protein 2QRW (PDB code) visualised by mean of the Contingency Table Browser. Columns correspond to tetrapeptide sequences while rows correspond to four-letter structural motifs. b) A number of structural motifs’ occurrence in the training protein set for a specific tetrapep-tide (VVSI) presented as a bar chart generated by the CTB program.

(37)

The table is visualized by creating a picture divided into columns and rows, where each cell is one pixel and its level of brightness represents the frequency (or probability if the user prefers it) of four letter structural motif occurrence for a selected tetrapeptide. The user can customize pixels’ display by gamma correction, high or lower values bumping or binary pixel division into zero and non-zero values. The considerably large size of the contingency table causes the picture reduction sometimes, and in such cases one pixel represents more than one value. However, the image can be always zoomed as to enable the detailed display of values. In addition, the program allows the user to generate the bar charts of occurrence frequency for both the tetrapeptides and the four letter structural motifs (Fig. 7b).

Despite the CTB program does not perform any statistical analysis, it can facilitate the work with the table, which because of its large size is especially difficult. The software enables us to observe some dependencies for selected amino acids or specific structural codes. It can be also useful in the ES structure determination for a specific protein by visualizing fragments of highly ambiguous conformations, where for example two secondary structures are equally possible.

3.5

Intrinsically disordered proteins - relation to general

model expressing the active role of the water

envi-ronment

The analysis of the hydrophobicity distribution based on the FOD model was applied to a group of intrinsically disordered proteins. Such proteins are char-acterized by the presence of fragments which do not obtain any stable three dimensional structure during the folding process. These fragments are suscep-tible to structural changes and therefore they perform specific functions like interdomain linkage creating or participating in complexation. The structures with such disordered fragments are gather in the DisProt database [96]. The analysis covered the proteins from this database which have structure deponated in the PDB, including the unstructured fragment. The RD parameter – the

(38)

dis-tance between theoretical and observed hydrophobicity distributions (explained in Section 2.2.), based on the divergence entropy definition, was calculated not only for whole protein molecules but also for selected structural units like chains, domains or disordered fragments. Such procedure provided the way to check if the units form hydrophobic core independently from the rest of the molecule or how much a selected fragment hydrophobicity distribution is accordant with the FOD model.

Amongst the analyzed proteins, there were two main tendencies observed. In one group both the whole chain and disordered fragments were consistent with the expected (FOD-like) hydrophobicity distribution, while in the other group both units were divergent from the FOD model. It suggests that such unstructured fragments adopt their conformation compromising the expected hydrophobicity distribution or lacking the well formed hydrophobic core they are not able to structurize in a stable way. In order to better understand the role which the disordered fragment play in hydrophobic core formation, the several exemplary structures of differentiated characteristics were analyzed in details.

Gamma subunit domain cgmp phosphodiesterase was an example of a pro-tein with a distinct hydrophobic core. It contains two disordered fragments, one of which was shown to be discordant with the FOD model. To present the case of structure far from the hydrophobicity distribution assumed by the FOD model, the homodimer of triosephosphate isomerase was chosen. Both monomers include one unstructured fragment, which is discordant to the ex-pected hydrophobicity. The RD calculation for isolated monomers revealed that they are characterized by the presence of hydrophobic cores, and the sta-tus of the disordered fragments changes towards more accordant with the FOD model, while considering hydrophobicity of a single monomer. Such observa-tion suggests possible sequence of events during the folding process, namely, the monomers folded individually, before the dimer complexation.

The most complicated analyzed protein was the complex of hydrolases and its inhibitor – the structure made of two different chains, containing ligand, a mercury atom and with seven disordered fragment observed. The detailed

(39)

RD estimation for different structural units allowed us to infer that in spite of a shared hydrophobic core for the complex domains, one of the chains exhibits high level of the FOD accordance. Additionally, because almost all (except for one) disordered fragments are also consistent with the model, it can be claimed that they stabilize the complex structure. These facts can be applied to reconstruction of the folding process (from domains to chains) and complex formation, which is beside the structure prediction another application of the FOD model.

3.6

Application of divergence entropy to characterize the

structure of the hydrophobic core in DNA interacting

proteins

The next article continues the analysis of proteins containing disordered frag-ments by the application of the FOD model. The group of protein structures from the Jumonji family was chosen. Proteins from that family participate in the gene transcription regulation by histone methylation (lysine-specific demethy-lase 5A). They consist of several domains performing different functions. The chosen structures belong to three domain groups: AT reach interactive domain – ARID, plant homeodomain (PHD) type zinc finger and C-terminal PHD finger. In order to quantitatively evaluate the structure accuracy with the theoreti-cal model of hydrophobicity distribution the RD parameter (explained in Section 2.2) was calculated. RD was determined for whole molecules, complexes created with ligands but also for the parts of structures like disordered fragments and secondary structures. Special attention was paid to β-hairpin structure present in all domains and taking part in DNA and histones binding (in ARID and PHD domains respectively).

RD calculations showed a consistency between the observed and theoretical hydrophobicity distributions for almost all structures, which indicates a pres-ence of stable hydrophobic cores. In case of ARID domains, all the β-hairpins were accordant with the FOD model (except for the 2RQ5 protein) and the

(40)

similar distributions of hydrophobicity for secondary structures were observed among all these molecules. In PHD domains, the β-hairpin motives partially agree with the assumed model of hydrophobicity in structures without ligands, but in complexes with ligands, where they play a significant role, their charac-teristics drift away from the expected hydrophobicity distribution. The results show, that the complexation process for these domains causes the stabilization of the hydrophobic core and therefore the whole complex structure. The effect exerted by the disordered fragments on structure stability was not unambigu-ous – they significantly diverged from the expected hydrophobicity for some structures (like in 2E6R, which included a well-defined hydrophobic core) and strongly stabilizes the tertiary structure for others (as for 2MA5) when analyzed without the unstructured fragment.

The results shown in the work suggest the possible way of interpretation of the discordance with the FOD model observed for disordered fragments. It was presented that the groups of domains, for example ARID, revealed similar hydrophobicity characteristics. This approach can also explain, by differences between functionally and structurally close domains, conformational changes and stabilizing factors resulting from the complexation with other molecules.

3.7

Role of disulfide bonds in stabilizing the

conforma-tion of selected enzymes — an approach based on

divergence entropy applied to the structure of

hy-drophobic core in proteins

Hydrophobic interactions, especially a hydrophobic core formation, are one of the most substantial factors stabilizing the tertiary protein structure. Some proteins also require additional strong stabilization mechanisms, one of which are disulfide bonds. Such bonds are for example frequently observed in extracel-lular proteins. The publication analyses the influence of both effects on protein structure stabilization. The work focuses on the enzymes of the well-known function and catalytic residues location. In case of this group, specific structure

(41)

flexibility is claimed to be crucial for their biological role. The chosen group of enzymes included human lysozyme, phospholipase, disulfide isomerase, neuro-toxin phospholipase, carboxylic esterase and multi-domain transferase. These proteins contain between 2 and 7 disulfide bonds per domain.

The hydrophobicity distributions in the analyzed molecules were evaluated by means of the FOD model. We calculated quantitatively the level of difference between the observed hydrophobicity distribution and the theoretically assumed hydrophobicity, defined by the Gaussian function. The parameter which most directly informs about the accordance between the model and the observed protein characteristics is RD (explained in Section 2.2.). The current work involved the calculation of RD for whole molecules, selected domains and chain fragments defined by the location of cysteines participating in disulfide bonds. Such approach enables us to answer the questions of whether and how disulfide bond stabilize structures which are unstable in relation to hydrophobic core formation.

For most enzymes, the results revealed the lack of a clear hydrophobic core – meaning a significant discrepancy between the observed hydrophobic properties and the FOD model. There was also observed a strong tendency of fragments in which proteins were divided by disulfide bonds to disagree with the model. Therefore, the main effect exerted by disulfide bonds was the support of the structure divergent from the FOD model. The local divergence appeared to be also collocated with the active site.

Discordance with the assumed model suggests the flexibility of selected frag-ment, which is often required to the effective substrate binding. An example of such effect is the carboxylic esterase (1THG protein), in which the active cavity is located not on the surface. Its activity is supported by the merged influence of the two analyzed factors – hydrophobic interactions and disulfide bonds – providing together a stable but also flexible conformation.

One example of an enzyme which is accordant with the FOD model is neu-rotoxin phospholipase (1QLL structure). Despite the stable hydrophobic core, also supported by accordant fragments (between disulfide bonds), the active

(42)

site is characterized by the exceeding hydrophobicity on the protein surface. Such feature enables the binding of hydrophobic ligands and interactions with membranes, which is observed for this protein.

For the purpose of support the results, RD values were compared with Acces-sible Solvent Area, calculated for amino acid residues. The comparison revealed a significant dependency between these two parameters (in the form of hyper-bolic function). The main conclusion from the work was the observation that the two considered stabilizing factors, the hydrophobic core formation and disulfide bonds, interplay in the way which enables the protein biological function.

4

Final remarks

The in silico model, which was presented in this work, merges two common approaches to protein structure prediction – the knowledge-based approach in the form of the starting Early Stage structure and the ab initio methodology applying the hydrophobic effect as one of the dominant factors in the process of protein folding. While the main purpose of the ES model is the determination of the starting structure for further structure optimization procedure, the FOD model can be also widely used in the field of protein structure stabilization anal-ysis. The deep understanding of the mechanisms which are the driving force behind the structure formation contributes significantly to the development of methods aiming at protein structure prediction on the basis of an amino acid sequence. The results, for example obtained in the Critical Assessment of pro-tein Structure Prediction (CASP) experiments [99], indicate that there is still a need for a solution which would be complex enough to predict different types of protein structures. As mentioned before, the presented here, basing on two in-termediates protein folding model is applied to build a tool for protein structure prediction. The method is currently under optimization, therefore no complete results are available yet.

The FOD model as a tool for the examination of the hydrophobicity distri-bution in known native structures has also been applied in other studies, whose

Cytaty

Powiązane dokumenty

Keeping the type of option constant, in-the-money options experience the largest absolute change in value and out-of-the-money options the smallest absolute change in

Produkcja i wykorzystanie energii pochodzenia rolniczego jest zdecydowanie rozwiązaniem proekologicznym (utylizacja odpadów) i wnosi znaczne korzyści ekonomiczne,

We find that our model of allele frequency distributions at SNP sites is consistent with SNP statistics derived based on new SNP data at ATM, BLM, RQL and WRN gene regions..

We suggest in this paper a method for assessing the validity of the assumption of normal distribution of random errors in a two-factor split-plot design.. The vector

This apparent lack of transferability of parameters between threading and folding potentials, together with the observation of relatively good performance of a simple filter based on

[r]

3.13 Error of the target localization for active FMCW radars versus number of bistatic radars (with one transmit node and varying numbers of receive nodes) for different numbers

Celem niniejszego artykułu jest wskazanie praw i obowiązków, jakie przysługują konsumentowi na rynku usług bankowych, czego może się domagać od banku, a na co nie