Application of computational biophysics and bioinformatics to multiscale biological problems

(1)

Uniwersytet Warszawski Wydzia l Biologii

Tomasz W lodarski

Application of computational biophysics and bioinformatics to multiscale biological

problems

Zastosowanie biofizyki obliczeniowej i bioinformatyki w wielkoskalowych problemach biologicznych

Rozprawa doktorska

Promotorzy rozprawy prof. Pawe l Golik Wydzia l Biologii, Uniwersytet Warszawski prof. Marek Niezg´ odka Interdyscyplinarne Centrum Modelowania Matematycznego i Komputerowego, Uniwersytet Warszawski

Grudzie´ n, 2012

(2)

1 Oświadczenie autora rozprawy:

oświadczam, że niniejsza rozprawa została napisana przeze mnie samodzielnie.

data podpis autora rozprawy

Oświadczenie promotora rozprawy:

niniejsza praca jest gotowa do oceny przez recenzentów

data podpis promotora rozprawy

Oświadczenie promotora rozprawy:

niniejsza praca jest gotowa do oceny przez recenzentów

data podpis promotora rozprawy

(3)

Dedication

To my mother’s memory, for always supporting, helping, and standing by me.

(4)

Acknowledgements

I would like to thank my advisors, Prof. Pawe l Golik and Prof. Marek Niezg´ odka, for supporting me during these past four years of scientific journey. I am also very grateful to dr Krzysztof Ginalski for giving me opportunity to work in his group and for his scientific advice and knowledge. Moreover, I would like to thank dr Bojan Zagrovic for sharing with me his knowledge and for many insightful discussions and suggestions. I very grateful to my friends, Mateusz Sikora, Kamil Steczkiewicz and Lukasz Kni˙zewski, for friendship, help and support, and many great scientific and non-scientific discussions.

I especially thank my mom, dad, and sister, for their love, care and support.

(5)

Preface

Presented thesis describes my work conducted in the group of dr Krzysztof Ginalski at the Interdisciplinary Centre for Mathematical and Computational Modelling at Uni- versity of Warsaw (A comprehensive classification of S. cerevisiae methyltransferases) and dr Bojan Zagrovic at Max F. Perutz Laboratories in Vienna (Application of com- putational biophysics in the study of protein-protein interactions). Results of my work have been published in:

[1] Wlodarski, T. and Zagrovic, B. (2009). Conformational selection and induced fit mechanism underlie specificity in noncovalent interactions with ubiquitin. Proceedings of the National Academy of Sciences of the United States of America, 106(46):19346–

19351.

[2] Wlodarski, T., Kutner, J., Towpik, J., Knizewski, L., Rychlewski, L., Kudlicki, A.,

Rowicka, M., Dziembowski, A., and Ginalski, K. (2011). Comprehensive Structural

and Substrate Specificity Classification of the Saccharomyces cerevisiae Methyltrans-

ferome. PloS One, 6(8):e23168.

(6)

Introduction

The last decades have shown the importance of incorporation of the theoretical and computational methods into biological sciences. Even though these methods are bur- dened with approximations and simplifications, they can extend our understanding of many biological processes, especially in description of the protein structure and dy- namics on the atomic level, prediction of the protein function, and modeling metabolic pathways or cell signaling.

Nowadays, we cannot study protein structure or dynamics without resorting to computational biophysics methods. They are crucial to resolve the protein structure from X-ray electron density maps or from NMR distance restraints. Furthermore, we can apply these methods to study dynamics of obtained protein structure in molecular dynamics or Monte Carlo simulations, which can describe in atomic details protein motion and interactions up to the microsecond time scales. We can also obtain a less detailed description, with use of the pseudo-atoms, where one bead represents a group of atoms. On this coarse-grained level, we can depict dynamics of huge complexes, or even more complicated structures like the bacterial flagellum. Computational biophyscis methods can also be used to derive time dependent properties of proteins, or analyze in details protein interactions in various binding and folding stages.

On the other hand, bioinformatics emerged to address another computational chal- lenge in biology, associated with the excess of sequence data. This situation has recently reached incredible proportions with the development of Next Generation Sequencing methods, which flood us with enormous amounts of sequence data from various whole genome experiments. Bioinformatics can provide us with a model of protein structure, which can be obtained by de novo or homology modeling and can be further used to imply function of previously uncharacterised proteins. It can be applied in high throughput screening methods to study the whole family of proteins or even the whole proteomes. Moreover, the bioinformatic approach forms the basis for the emerging field of systems biology, which tries to describe complex biological systems in more holistic point of view.

In my PhD thesis, I used both of those approaches to study multiscale biological

(9)

CONTENTS 8

problems. In the first part, I used the bioinformatics approach to describe and classify all the methyltransferases encoded in the genome of yeast S. cerevisiae along with their domain composition. In addition to this description, I developed a novel method for substrate specificity prediction, based mainly on the expression pattern in the Yeast Metabolic Cycle. Furthermore, I applied this method to predict substrate specificity of a previously uncharacterised putative MTase, completing the Methyltransferome picture. In the second part, I employed computational biophysics to analyze details of protein-protein interactions, on the example of ubiquitin recognition. I revised a well established model for ubiquitin interaction, a conformation selection model, and proposed a more detailed local description combined with statistical analysis, which led to the proposal of a novel model for ubiquitin recognition.

Both of these projects present the power of application of computational and the-

oretical methods in modern biology problems.

(10)

Chapter 1 Protein structure and dynamics

In this chapter, I introduce all concepts crucial for my thesis. I start with a review of fundamental protein properties and then proceed to the description of protein structure and dynamics. Describing protein structure I also cover protein folding theory. In the next part, I focus on experimental and computational techniques used to obtain protein structural and dynamical data. Finally, I introduce models used to describe protein- protein interactions, which were a starting point for my research presented in the second part of my thesis.

1.1 Proteins

Proteins are molecular nanomachines conducting a variety of functions in the cell [Al-

berts et al., 2008]. They are crucial for processing genomic information (e.g. poly-

merases), retrieving and transferring signals in and out of the cell (e.g. G protein cou-

pled receptors, kinases), they work as a part of metabolic pathways (e.g. synthetases)

or structural elements (e.g. nuclear lamina). They can work in water environment

(globular proteins), embedded in the lipid bilayer (transmembrane proteins) or can

form a long fibre or fibrous aggregates (amyloids). Protein versatility and ubiquity

is based on their physicochemical properties. Generally, they are built from the 20

naturally occurring α-amino acids connected via peptide bond in a way described in

the genomic sequence encoding each protein. Interestingly, even though α-amino acids

exist in two stereoisoforms (L or D), Nature only builds proteins from the L-amino

acids. Those 20 amino acids differs from each other by size, charge and hydrophobic-

ity, and their properties can be further modified by posttranslational modification, like

methylation, phosphorylation or glicosylation [Walsh, 2005]. Due to this variation in

properties, they can build thousands of different structures. This variety is important

because structure generally determines protein function, thus many possible structures

(11)

1.2. Protein structure 10

MVLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKTEAEMKASE...

I

II III

IV

α β

Figure 1.1 Four levels of protein structure: I) primary structure (amino acid sequence), II) regular secondary structure (one strand of β-structure and α-helix are shown), III) tertiary structure (myo- globin), IV) quaternary structure (hemoglobin)

enable many different functions to occur. In the cell proteins are synthesised by a so- phisticated complex of both RNA and proteins - the ribosome [Ben-Shem et al., 2011].

It uses the nucleotide sequence in mRNA as a template for protein synthesis, extending peptide chain by one amino acid at a time. Newly synthesized protein chain starts to fold into a distinct three dimensional structure immediately after leaving the ribosome (or even in the ribosomal exit tunnel). This crucial process occurs spontaneously or with the help of special proteins - chaperones [Fink, 1999]. Degradation of proteins in the cell is also carried out by a protein complex - the proteasome [Groll et al., 1997].

Proteins chosen for degradation are generally tagged by ubiquitin binding to one of the protein lysines [Hochstrasser, 1996]. Both processes are highly regulated in the cell and are crucial for the proper functioning of the cell.

1.2 Protein structure

Protein structure can be described on four different levels (Figure 1.1).

Primary structure is the amino acid sequence of the polypeptide chain, which can be used to distinguish proteins [Sanger and Tuppy, 1951; Sanger and Thompson, 1953].

The next, secondary level of the protein structure is mainly obtained due to the

planarity and rigidity of the peptide bond, as well as hydrogen bonding between back-

bone atoms. Planarity generates possible rotation only around two backbone angles:

(12)

1.2. Protein structure 11

α 310 π PPI PPII

Figure 1.2 Helical secondary structures. Hydrogen bonds are presented as yellow dashed line.

φ and ψ, with a very small allowed rotation space, first predicted by Ramachandran in 1965 [Ramakrishnan and Ramachandran, 1965]. These structural restraints com- bined with hydrogen bonds define local conformations of the polypeptide chain with two main protein secondary structures: α-helices and β-strands (Figure 1.1). Both were theoretically predicted by Linus Pauling in 1951 [Pauling and Corey, 1951]. The α-helices are right-handed helices with 3.6 residues per turn. They are stabilized by backbone hydrogen bonds connecting every fourth residue and parallel to the helix axis. Moreover, due to alignment of peptide bonds dipoles, an α-helix can be treated as a single long dipole, with a positively charged N-terminus and a negatively charged C-terminus. Other helical secondary structures observed in proteins are: the 3 ₁₀ -helix, the π-helix and the polyproline helix (Figure 1.2). The 3 ₁₀ -helix has three residues per turn, so it is much tighter than the α-helix and also every third residue is connected via hydrogen bond. It is observed mainly as a short segment of 3-4 residues e.g. on the end of the α-helix or as a single turn transition. The π-helix, on the other hand is wider than the α-helix, it has 4.4 residues per turn and every fifth residue is con- nected via hydrogen bond. Moreover, often it is flanked by α-helices on either end.

The last example of a helical element is the polyproline helix, which can exist in two

conformations: a left-handed helix (PPII), when propyl bonds are in the trans-isomer

conformation, with 3 residues per turn; and a more compact right-handed helix (PPI)

when propyl bonds are in the cis-isomer conformation, with 3.3 residues per turn. In

(13)

1.2. Protein structure 12

parallel antiparallel β-bulge

Figure 1.3 The beta secondary structure elements. Hydrogen bonds are presented as yellow dashed line.

the water environment the PPII form dominates.

The β-sheets are protein secondary structures formed by hydrogen bonds between polypeptide chains - β-strands. Since each separate β-strand is slightly twisted, the whole flat structure of β-sheet also has a right-handed twist when viewed edge on.

Moreover, they can bend into a fully or partially closed β-barrel. We can distinguish three kinds of β-sheets: parallel, antiparallel and mixed (Figure 1.3). The parallel β- sheet is formed by the β-strands pointing in the same direction, whereas antiparallel is formed by strands pointing in the opposite direction. The mixed β-sheet is just a com- bination of parallel and antiparallel β-strands. Antiparallel β-strands seem to be more stable because their hydrogen bonds are aligned directly opposite each other, whereas in parallel β-strands hydrogen bonds are less aligned and are longer. In addition to these two main secondary structure elements in proteins we also observe others. The β-hairpin turn (I, II, I’, II’, III and VI) joins two adjacent parallel β-strands. The β-bulge is a structure observed in β-sheets, especially in the antiparallel one, when two hydrogen bonds from two neighbouring residues in one β-strand point to the same residue from the second β-strand (Figure 1.3). Loops or coils are defined as transition regions between other secondary structures. One of the examples of these structures are the β-hairpins described above, and longer loops also called Ω-loops. Very often loops are quite ordered in the protein structure and they constitute of the active site or metal binding residues. Those that are not well ordered are described as random coils.

Tertiary level of protein structure describes the spatial arrangement of the secondary

elements into a single three-dimensional structure of the polypeptide chain. The con-

(14)

1.3. Protein folding 13

cept of ”fold”, which additionally takes into account connectivity between secondary structures is related to the tertiary structure. Proteins can be structurally grouped into classes depending on their secondary structure content [Levitt and Chothia, 1976]: all α - proteins built only from α-helices; all β - built only from β structures; α/β - having α and β segments (mainly parallel) that tend to mix, and α + β - having distinct α and β segments (mainly antiparallel) that do not mix. Moreover, there are proteins or regions of proteins which do not have any distinguishable three-dimensional structure, they are described as intrinsically unstructured proteins. Their unique features are a combination of low overall hydrophobicity and a large net charge. They often take part in cellular signalling and regulation, and can adopt a defined structure upon binding.

The highest, quaternary level of the protein structure is described as non-covalent association of two or more independently folded polypeptides. This structural associ- ation has very crucial aspects like: functional cooperativity or colocalization.

1.3 Protein folding

The protein tertiary structure, described above, is adopted during protein folding. A first glimpse of this process was presented in the groundbreaking study of Christian Anfinsen [Anfinsen and Haber, 1961]. In the beginning of 1960’s Anfinsen in his ex- periments on the ribonuclease A showed that after denaturation the protein can be refolded while still preserving its catalytic function. This observation implies that the primary structure of the protein contains all the information required for proper folding into a correct native state, which corresponds to the protein in its energy minimum.

In 1969 Cyrus Levinthal conducted a thought experiment, in which he estimated that

it would take for a protein longer than the life time of the universe to randomly try

all its possible conformations [Levinthal, 1969]. On the other hand, from experimen-

tal measurements we know that protein folding takes a fraction of a second [Kubelka

et al., 2004]. This discrepancy is known as the Levinthal’s Paradox. These observations

helped us understand that protein folding can not be a random process, but it is rather

a directed folding path in a high-dimensional space. In 1991 Hans Frauenfelder and col-

leagues described this space as a funnel-shaped folding energy landscape and depicted

it as very rugged, with hills corresponding to the high energy conformations (transition

states) and valleys with more favourable conformations (local minima) [Frauenfelder

et al., 1991]. Within this model, folding is viewed as a parallel and cooperative pro-

cess, in which an ensemble of proteins goes downhill through the energy funnel, with

a higher probability of going through some obligatory paths. This process could be

spontaneous, generally for small proteins, or could involve special proteins (chaper-

(15)

1.4. Protein dynamics 14

ones), which assist in climbing up the slope out of the local minimum. The chaperones keep proteins from getting stuck in kinetic traps through two different mechanisms:

assistance in the uphill climbing or preventing getting into the trap in the first place.

It is an interesting question, how many distinct three dimensional structures can be achieved by proteins. In 1992 Cyrus Chothia estimated this number to be around 1000, based on similarity across available genomic sequences and only 120 protein structure solved back then [Chothia, 1992]. However, more recent studies show that this number is more likely to be between 1000 and 10 000 distinct folds [Wolf et al., 2000; Grant et al., 2004]. There are two mechanisms which are responsible for this limitation.

The first is evolution, because there is a direct link between protein structure and its function, proteins evolved to have a particular structure which corresponds to the needed function. Evolution can be divergent, then the factor limiting the number of possible folds is the size of the shared common ancestor proteins group, from which the others had been derived. On the other hand, it can be convergent, where the limiting factor is the fact that certain folds are biophysically more favourable, so they were achieved independently more than once during the evolution. A second mechanism responsible for fold limitation is the physics of protein folding. Proteins are forced to assume structures which have a smooth energy landscape, which makes fast folding possible without staying to long in local minima, which can result in aggregation.

Both these mechanisms cause that a modification of the existing fold is more likely to occur than spontaneous generation of a new one. Interestingly, a random polypeptide sequence is not able to fold into a stable three-dimensional structure [Richardson et al., 1992]. Moreover, it is possible that some feasible folds will never be seen, because they are not energetically favourable or organisms have not required them to develop (their corresponding function).

1.4 Protein dynamics

Proteins are not rigid bodies, they are inherently dynamic over a broad spectrum of time scales, ranging from picoseconds to seconds and even beyond [W¨ uthrich and Wagner, 1978; Frauenfelder et al., 1991] (Figure 1.4).

Moreover, conformational changes are a necessary prerequisite for protein to per-

form its functions. Their dynamics plays a crucial role in catalysis, ligand binding,

molecular recognition, allostery and many other processes. Fast protein motions, on

the level of picoseconds to nanoseconds, have a stabilizing effect on the folded state

of the protein, and are necessary for slower time-scale dynamics involving large-scale

collocative motions [Henzler-Wildman et al., 2007a]. Micro- to millisecond time scale

(16)

1.5. Structural and dynamic studies 15

side-chain rotation loop motions

diﬀusion

domain motions

local unfolding allosteric transition

folding and binding

T1, T2, HetNOE T2, T1ρ EXSY Real time

Residual dipolar couplings (RDCs)

τ _c

ps ns μs ms s

Figure 1.4 Time scales of biological motion and NMR methods used to study them. T1, T2 and T1ρ - relaxation times; HetNOE - heteronuclear NOE; EXSY - EXchange SpectroscopY. τ _c is correlation time.

motions are crucial for interconversion of protein subconformations [Volkman et al., 2001; Kern and Zuiderweg, 2003] and allostery communication [Br¨ uschweiler et al., 2009]. Slower time scale (millisecond to second) involves large domain motions (Fig- ure 1.4).

Amplitude of dynamics observed in proteins ranges from small side chains fluc- tuations to the movements of loops, secondary structures, and even whole domains [Gerstein and Krebs, 1998; Goh et al., 2004]. Hinge-bending movements may be the main mechanism responsible [Ma et al., 2002]. Moreover, dynamics, like protein struc- ture, may be an evolutionary determined parameter of proteins. From a kinetic point of view, internal motions may have evolved to actively promote catalytic activity in enzyme families or to enable binding to a range of interaction partners.

In my thesis I analyze the dynamic aspect of ubiquitin binding in order to better understand protein-protein interactions in ubiquitin recognition.

1.5 Structural and dynamic studies

Introduction

Protein tertiary structure and its dynamics, described in previous sections, are a guide to the protein function. First attempts to understand protein structure were under- taken in 1935, when the first X-ray diffraction pattern of a protein crystal was presented [Bernal and Crowfoot, 1934]. Efforts increased after James Watson, Francis Crick and Rosalind Franklin obtained the first structure of DNA in 1953 [Watson and Crick, 1953]

and then it finally became possible with the invention of the isomorphous replacement

method by Max Perutz in 1954 [Green et al., 1954]. At the same time Kaj Linderstrøm-

(17)

1.5.1. Crystallography 16

Lang based on early studies of deuterium/hydrogen exchange in proteins realized that proteins are dynamic and cannot be represent as a single structure [Linderstrø m Lang and Schellman, 1959].

The first structure of a protein, which was a sperm whale myoglobin, was solved by Kendrew and his collaborators in 1958 with a very low resolution of 6˚ A [Kendrew et al., 1958]. After two years they obtained higher quality data and recalculated the structure at 2˚ A resolution [Kendrew et al., 1960]. Scientists were amazed by the lack of symmetry and anticipated regularities in obtained structures. Moreover, they were far more complicated than was predicted by previous protein physics theories. Following Kendrew’s work, Max Perutz and collaborators solved in 1960 the structure of horse haemoglobin at 5.5˚ A resolution [Perutz et al., 1960]. They showed that it consists of four subunits, each structurally almost identical to Kendrew’s myoglobin. This was the first observation that proteins from different species with similar sequence and function tend to have similar structure. This was an important observation for future devel- opment of homology modelling. Perutz also found the first structural manifestation of protein dynamics, finding that side chain movement is needed to permit oxygen to reach binding site at the heme iron [Perutz and Mathews, 1966]. This initiated studies of protein dynamics, which employed local optical probes, exchange of labile protons and 1D NMR spectroscopy [Frauenfelder et al., 1988].

The most common experimental methods used to derive protein structures are X-ray crystallography and NMR spectroscopy, which I will briefly describe bellow, as well as small angle X-ray scattering and cryo-electron microscopy. Thanks to the development in theory and methodology, the number of structures increases rapidly and currently more than 85 000 structures are deposited in Protein Data Bank (PDB) [Berman et al., 2000], which is the world wide database for protein structures. Deposited structures and structural ensembles which portray protein dynamics are invaluable sources of information. In my thesis I used them both in homology modelling and in statistical analysis of protein interactions.

1.5.1 Crystallography

Crystallography was the first and still is the most widely use experimental technique

for solving protein structures [Wlodawer et al., 2008; Jaskolski, 2010]. Of more than

85 000 structures stored in the PDB [Bernstein et al., 1977; Berman et al., 2000], more

than 75 000 were obtained with the use of X-ray crystallography. This method is based

on the phenomenon of elastic X-ray scattering by the molecule’s electrons. In the crys-

tallographic study, proteins are in a crystal form, where each molecule is embedded

in a crystal lattice, creating a 3D diffraction grid. Moreover, the X-ray wavelength

(18)

1.5.1. Crystallography 17

is of the same length as a chemical bond (≈ 10 ⁻⁹ m). The X-ray diffraction encodes protein structure into the 3D diffraction image, which is then projected onto the 2D detector surface, as regularly spaced spots known as reflections. They are a reciprocal space representation of the crystal lattice. Symmetry of obtained diffraction pattern contains information about the size and shape of the unit cell and inherent symme- try within the crystal, i.e. its space group, whereas positions of atoms are encoded into reflections intensities. Reflections themselves are characterized by their amplitude and phase, yet we can only obtain intensities directly from the experiment, which are squared reflections’ amplitudes. Phase information is lost when a 3D diffraction im- age is projected into lower 2D space. Unfortunately, to obtain electron density we need to calculate the Fourier transform of the structure factor, which consists of both reflections’ amplitudes and unmeasurable phases. This makes phase determination a crucial and challenging step in the crystallography, alongside with obtaining proper protein crystals. Nowadays, various approaches to solve the phase problem exist, like:

molecular replacement, anomalous X-ray scattering or heavy atom methods. Gener- ally, all of them assume some prior knowledge of the electron density or structure, leading to initial values for phases, which can then be improved in an iterative fashion [Taylor, 2003]. For molecular replacement, the starting point is a homology model, for anomalous X-ray scattering it is an anomalous atom substructure, and for heavy atom replacement it is a heavy atom substructure. With proper phases we can con- duct Fourier transform on structure factor and obtain electron density, which is used for fitting and interpretation of the protein structure. With high resolution electron density maps (< 1.5˚ A) we can obtain structural details on atomic level.

We need to remember that the diffraction pattern obtained in the first place is

averaged across the ensemble of protein structures in the crystal (static averaging)

and over time of the experiment (dynamic averaging). The static averaging is due to

fact that proteins can adopt different conformations in different unit cells, which apart

from crystal contacts, are also influenced by crystallographic medium. On the other

hand, proteins can have very flexible structures with motion time scales shorter than

the experiment, which generates dynamic averaging. Both these effects can give rise

to the observation that electron density for some fragments of the protein is smeared

over multiple conformation states. This distortion reflects the amplitude of atoms

fluctuation about their average position and is described by the B-factor (”temperature

factor”, ˚ A ² ). Even with this in mind, the protein conformational ensemble, which is

present in protein crystals, is usually treated as a static single structure.

(19)

1.5.2. NMR 18

B

0

μ

B ₀

z

x y

M

₀

z

x y

M

₀

B ₁ Θ

A B C

Figure 1.5 NMR spectroscopy. A) Static magnetic field (B 0 ) aligns nuclear spin and causes its preces- sion with Larmor frequency. B) Excess of spins in lower energy level, give rise to magnetization M 0 . C) Magnetization flips due to radio frequency pulse (B ₁ ).

1.5.2 NMR

NMR spectroscopy is the second most widely use experimental technique for structure determination and the most common one for dynamical studies [Cavanagh et al., 2006].

Generally, it is a solution state spectroscopy, but recently we also observe development of the solid-state methods [Castellani et al., 2002], especially useful for transmembrane proteins. More then 9 500 structures that have been obtained by NMR are deposited in the PDB database and in Biological Magnetic Resonance Data Bank (BMRB) [Ulrich et al., 2008]. The first NMR protein structure was obtained in 1985 and it was the structure of proteinase inhibitor IIA [Williamson et al., 1985].

NMR is based on transitions in the magnetic field between spin states of mag- netically active nuclei (with non-zero nuclear spin, like ¹ H, ¹³ C and ¹⁵ N). If natural abundance of those isotopes is not enough for NMR experiment, we can use efficient stable isotope labelling techniques [Tugarinov et al., 2006]. Without external magnetic field, nuclear spin is randomly oriented and has one degenerate energy level. When we introduce external static magnetic field, the energy level splits due to symmetry breakage into 2I+1 states, where I is a magnetic quantum number, for example for nucleus with I= 1

2 (hydrogen) into two spin states + 1

2 (α state) and − 1

2 (β state).

Moreover, nuclear spin starts to precess aligned with the direction of the magnetic

field (Figure 1.5 A). The precession frequency (Larmor frequency) mainly depends on

the type of nucleus and magnetic field, but is also modulated by the chemical environ-

ment: ω = −γB ₀ , where γ is magnetogyric ratio and B ₀ is static magnetic field. This

influence of chemical environment is encoded in the chemical shift value (δ), which

describes the difference between resonant frequency of the nucleus in the sample and

reference value for this type of nucleus. Next, we apply the disturbance - the second

external magnetic field, now as radio frequency pulse, which is perpendicular to the

static one (B ₁ ). When both frequencies, of this pulse and the nuclear spin precession,

(20)

1.5.2. NMR 19

become equal we have a resonant absorption of energy and spin flips into the excited state.

In the sample we have an ensemble of nuclear spins, with population of spin states defined by Boltzmann distribution:

N _α

N _β = e ^∆E/RT ,

where N _α,β are numbers of nuclei in the α or β spin orientation, R is the gas constant, T the temperature and ∆E is the energy difference between each spin state. This energy difference is very small, which corresponds to almost identical population of α and β states. For a hydrogen atom in the highest field strengths for every 2 millions of protons α and β states differ by only 100 protons. This makes a NMR very insensitive method compared to other spectroscopy techniques. This very small excess of nuclear spins gives rise to the observed magnetization vector M ₀ along the z-axis (Figure 1.5 B). When we apply a radio frequency pulse to the sample it will impose a torque on the magnetization M 0 vector in the direction which is perpendicular to the B 1 field.

Depending on the length of this pulse, magnetization can be flipped by different Θ angle (Figure 1.5 C). We can apply a 90 ^◦ pulse, which causes magnetization to flip from being aligned to the static magnetic field to the plane perpendicular to the field (x-y plane).

Rotation of the magnetization on the x-y plane generates current in the electromagnetic coils. When the radio frequency pulse ends, spins relax to their not perturbed low energy state and emit excess of energy. As a consequence, magnetization again rises along the z axis and in the same time decays in the x-y plane. This decay signal (know as Free Induction Decay, FID) can be measured in the electromagnetic coils as oscillating current. This time-domain signal, needs to be further Fourier transformed into a frequency-domain signal (”NMR peak”). In addition to this single FID pulse, specially designed pulse sequences are used in experiments to create two- or three-dimensional NMR spectra, generally used for structure determination. With these pulse sequences magnetization can be transferred between specified types of nuclei through chemical bonds (scalar couplings) or space (Nuclear Overhauser Effect, NOE), giving a lot more details about the protein structure.

Generally, the first step in structure determination is to use multidimensional NMR

methods to assign obtained resonances (NMR peaks) to specific nuclei in the covalent

structure of the molecule - in this step magnetization is transferred through chemi-

cal bonds. This assignment allows us to decide which chemical shifts correspond to

which atom. Recently, chemical shifts can be used stand alone for structure determi-

nation as distance restraints in molecular dynamics simulations [Wishart and Case,

2001]. Structural restraints are then obtained in experiments which use magnetiza-

tion transfer through space, like the Nuclear Overhauser Effect (NOE) spectroscopy

(21)

1.5.2. NMR 20

[Neuhaus and Williamson, 2000]. ¹ H- ¹ H distances (< 5˚ A) obtained with NOESY are used as constraints in a protein structure determination protocol (e.g. with molecu- lar dynamics simulations). Additionally, distance information is supplemented by the measurement of torsion angles from scalar J-couplings [Hahn and Maxwell, 1952] or the orientation of inter-atomic vectors with respect to a reference frame obtained from Residual Dipolar Coupling (RDC) [Saupe and Englert, 1963]. With standard method- ology (double-labeled samples + NOESY) we can obtain structures of protein domains up to roughly 25 kDa. NMR spectra of larger proteins are characterized by faster decay of the signal, which is manifested by increased line width and lowered sensitivity. With triple-labelled proteins ( ¹³ C, ¹⁵ N and ² H) we can extend this limit up to 40-50 kDa.

Moreover, with the use of transverse relaxation-optimized spectroscopy (TROSY) [Per- vushin et al., 1997; Fernandez, 2003], based on interference between different relaxation mechanisms, and selective labelling we can go beyond this limit, in some special cases even up to 900 kDa, like in the case of chaperone GroES-GroEL [Fiaux et al., 2002].

Apart from structure determination, NMR spectroscopy, in contrast to X-ray crys- tallography, is uniquely suited for investigation of the kinetics and thermodynamics of molecular motions [Mittermaier and Kay, 2006; Torchia, 2011]. It is particularly valuable when a statistically or dynamically disordered region of the protein may fail to produce electron density in X-ray crystallography. In a NMR experiment, dynamic and structural data are intricately mixed and to obtain dynamical properties one has to take advantage of MD simulations. Nevertheless, NMR spectroscopy can cover dynam- ics on time scales from ps to s (Figure 1.4). Protein motions ranging from picoseconds to nanoseconds are available in measurements of the time constant for ¹⁵ N longitudinal (spin-lattice, T ₁ ) and transverse (spin-spin, T ₂ ) relaxations rates [Palmer, 1997]. These relaxation rates are then encoded into the squared generalized order parameter (S ² ), which measures the amplitude of motions and ranges from 0 to 1, describing isotropic motions or complete rigidity, respectively. These order parameters are then used as restraints in a molecular dynamics calculation to produce an ensemble of structures which describe the dynamic behaviour of the protein [Lindorff-Larsen et al., 2005].

Additionally, to study protein dynamics in the time range of microsecond to mil-

liseconds, relaxation dispersion methods can be used [Akke and Palmer, 1996]. These

motions cause dephasing of coherence, resulting in additional broadening of NMR sig-

nals by an amount R ex . From relaxation dispersion experiment, three physical parame-

ters can be obtained: rates of interconversion (k _ex ), relative populations of exchanging

spieces (p _A and p _B ) and chemical shifts between the exchanging species (∆ω). Microsec-

ond and millisecond time scales are crucial for many biologically important processes,

like enzyme catalysis, signal transduction, ligand binding or allosteric regulation [Kern

(22)

1.6. Distant homology detection and comparative modelling 21

and Zuiderweg, 2003; Henzler-Wildman et al., 2007b; Br¨ uschweiler et al., 2009; Teilum et al., 2009]. Additionally, residual dipolar couplings (RDCs) can be used to sample this range of dynamics.

The RDC technique is based on observation that in standard isotropic conditions dipolar couplings average to zero due to Brownian rotational diffusion [Bax and Gr- ishaev, 2005]. This averaging of anisotropic interactions to zero is useful for standard NMR experiments, because it is a key to the sharpness of the resonance spectrum.

However, with this averaging, we lose important structural and dynamical informa- tion. In contrast to NOE, the RDCs are not relative to nearest neighbours, but define orientations relative to an external coordinate frame (magnetic field) and therefore have a global character. To recover and use this information we need to force partial (on the order of 0.1% deviation from the uniform distribution) orientation of the sample in the magnetic field. This orientation can be achieved by dissolving the protein in an anisotropic medium, such as phospholipid bicelles, phage particles or stretches of polyacrylamide gel. With this partial protein alignment, orientation-dependent dipo- lar couplings are scaled to non-zero values. Obtained RDCs can be used as structural restraints in structure calculation or refinement as well as modelling of protein com- plexes.

Actually, RDCs seem to be able to sample the whole spectrum of dynamics, from picoseconds to milliseconds [Lakomek et al., 2008]. In the second part of my thesis I look into this range of dynamics of ubiquitin, described by a structural ensemble obtained with RDC restraints [Lange et al., 2008], to understand protein-protein interaction in ubiquitin recognition.

1.6 Distant homology detection and comparative modelling

In previous sections I have described experimental methods used to obtain protein structural and dynamical data. Even though we observe a huge progress in those ap- proaches we are still not able to use them for all proteins of interest, due to limitations of time, money and sometimes still in techniques. This is particularly pronounced with the development of Next Generation Sequencing [Shendure and Ji, 2008], which floods us with enormous amount of sequence data, extending the huge gap between the num- ber of protein sequences (26,000,000 known and predicted proteins in the UniProtKB database, October 2012) and the number of protein structures (85,000 structures in the PDB, but only 46,000 are unique).

In order to meet increasing needs of structural biology, computational methods

(23)

1.6. Distant homology detection and comparative modelling 22

capable of modelling protein structures have been developed [Gu and Bourne, 2009].

They are based on the previously described Anfinsen’s hypothesis that protein structure is determined by protein sequence. With this notion and accurate knowledge of laws of physics, a protein structure could be in theory obtained by the simulation of protein folding [Levitt and Warshel, 1975]. However, due to the limited computational power and approximations in physics behind the force fields, we are still not able to solve the folding problem, except for short peptides or small fast-folding proteins [Lindorff- Larsen et al., 2011].

Fortunately, we can use another approach known as homology modelling, based mainly on the observation that evolutionarily related proteins have similar sequence and tend to have similar structure as well [Kendrew et al., 1960; Perutz et al., 1960;

Chothia and Lesk, 1986]. The main idea behind this method is to find proteins with known structure, which are evolutionarily related to the target protein and use them as a template for building a new model (Figure 1.6).

In order to address this issue a lot of sequence based methods have been devel- oped, with various ranges of applications depending on sequence similarity [Ginal- ski et al., 2005]. Origins of these methods are in well-known algorithms for global (Needleman-Wunsch [Needleman and Wunsch, 1970]) and local sequence alignment (Smith-Waterman [Smith and Waterman, 1981]). They both use dynamic program- ming combined with a substitution matrix and gap penalties to find optimal global or local sequence alignment. Their use is however limited by computational costs of dynamic programming.

The next group of methods, such as FASTA [Pearson and Lipman, 1988] or BLAST [Altschul et al., 1990], use heuristics methods instead of dynamic programming. Be- cause of that, they are much faster but are not always able to find the optimal solution.

With the use of these methods, we can obtain the local gapped alignment between a pair of sequences, scored also based on substitution matrix and the penalty for allow- ing gaps (gap initiation and gap extension penalty). They are, however, useful only for close homologs, because each position in the alignment is treated with the same weight (both in conserved and variable regions). Moreover, while comparing only pairs of sequences we are missing the evolutionary variability in the alignment, which is ex- ploited in more advanced methods, like those based on profile-to-sequence comparisons.

The idea behind them is to vary the weight of each alignment position, depending on

its conservation across protein sequences belonging to the same protein family, which

combines proteins with the common evolutionary history. This conservation is ob-

tained within a multiple sequence alignment (MSA) between the query sequence and

its homologs, and is coded in position-specific substitution matrices (PSSM) [Bork and

(24)

1.6. Distant homology detection and comparative modelling 23

Sequence Query: ...FKDKIVLDVGCGTGILSM...

...FKDKIVLDVGCGTGILSM...

...VKDKVVLDVGCGTAI-CL...

...FKDKVVMDVGAGTGILSA...

...

...-RGKTLLHLGCGMGLYTM...

Multiple sequence alignment Meta proﬁle

...FKDKIVLDVGCGTGILSM...

A I L M F W Y V S T N Q C G P R H K D E

sequence proﬁle

predicted secondary structure 1a

1b 1c

Target protein ...FKDKIVLDVGCGTGILSM...

sequence proﬁle

predicted secondary structure

1d Set of meta proﬁles for protein

with known structure ...LKDKVVMDVGAGTGILSA...

sequence proﬁle

predicted secondary structure Vs

Target protein ...FK--DKIVLDVGCGTGILSM...

Template protein 1 ...LKASDKVVMDVGAGTGILSA...

Target protein ...FKDKIVLDVGCGTGILSM...

Template protein 2 ...LKDKILMHVGAGTGI-SA...

Template protein 2 ...LK--DKILMHVGAGTG-ISA...

structure conservation sequence conservation

Target protein ...FK--DKIVLDVGCGTGILSM...

Template protein 2 ...LK--DKILMHVGAGTG-ISA...

sequence-to-structure alignment

Template protein 1 Template protein 2 2a

Structural alignment

Structural based sequence alignment

2b

3

Template protein 1 Template protein 2 Build model

model building

Insertion Deletion Conserved

Figure 1.6 Pipeline for homology modelling: 1) Search for protein templates 2) Generation of the sequence-to-structure alignment between target and template, which is used in 3D model generation.

3) Model building and validation

Gibson, 1996], known also as profiles. They are N times 20 matrices, describing the

score of a substitution of any of the 20 amino acids in each of the N residues of the

protein. Obtained sequence profile can be used to screen for sequence of homologs

with known structure. The most popular algorithm for profile-to-sequence comparison

is Position Specific Iterated BLAST (PSI-BLAST [Altschul et al., 1997]), with initial

search method based on BLAST algorithm. Profile and sequence comparison can be

also applied the other way around (sequence-to-profile). In that scenario sequence is

compared to the previously calculated sequence profiles, like in RPS-BLAST [Marchler-

Bauer et al., 2003]. This method is much faster than the original one, we do not need

to calculate profiles, it is, however, much more computationally demanding (in terms

of RAM memory usage).

(25)

1.6. Distant homology detection and comparative modelling 24

Methods more sensitive in homology detection use profile-to-profile comparisons [Rychlewski et al., 2000], where score at each position in the alignment can be calculated as a dot product of two corresponding columns from PSSMs. Moreover, we can add additional information to the sequence profiles, like the predicted secondary structure as a three values (H - α-helix, E - β-strand and C - coil), creating meta profiles, which can be also compared with each other [Ginalski, 2003] and are able to find even distant homology (Figure 1.6).

With the notion of convergent evolution, which can drive two unrelated proteins to adopt a similar 3D structure, and with the development of residue contact potentials, a new structure based approach, known as threading, has emerged [Jones and Thornton, 1996]. Essentially, it is almost literal threading of the protein sequence through the structure of various possible templates. Based on the residue contact potential, we can obtain for each template an N times 20 matrix, describing amino acid preferences for each position in the structure. This matrix can be used for further evaluation of the fitness of sequence to a particular structure, like in sequence-to-profile methods mentioned previously.

Finally, meta-search methods for template selection have been developed. In this approach, the query sequence is subjected to various template search methods, like previously described sequence or structure based ones. The obtained results are further analyzed for consistency, a consensus is generated, which has more statistical power than a result from a single method [Lundstr¨ om et al., 2001]. Template search methods in addition to finding the best template, often automatically generate a model of protein structure. Models obtained using different methods can then be structurally clustered and the most common one can be used to verify the template selection [Ginalski et al., 2003].

The next step in homology modelling is the generation of a sequence-to-structure alignment. First, template structures are superimposed to generate a structural align- ment, which is then translated to the sequence alignment with incorporated information about structure conservation. This structure based sequence alignment is further com- bined with the MSA of the target protein family to generate, after manual verification, the final sequence-to-structure alignment (Figure 1.6).

The obtained sequence-to-structure alignment is used to direct building of the model

from template structures. Structures of regions which are well aligned can be taken

directly from template structures. Insertions in the target sequence are modelled with

the use of loop or secondary elements derived respectively from a loop library or from

known fragments of protein structures with a similar sequence. Deletions are handled

by bypassing fragments existing only in templates. The most widely used modelling

(26)

1.7. Molecular dynamics simulations 25

program is the MODELLER package [Sali and Blundell, 1993]. It derives various re- straints from the template structure and maps them onto corresponding residues of the target, guided by a template-to-target alignment. Obtained restraints (bond lengths and angles, dihedral angles, van der Waals contact distances, solvent accessibility, etc.) are used to derive, separately for each feature, the conditional probability density func- tions (PDFs). Those PDFs can also be analytically derived from a database of known structures. In order to obtain the final model, MODELLER runs a minimization of the function combining those PDFs, using methods of conjugate gradients and molecular dynamics with simulated annealing.

Finally, the model should be validated with software for X-ray and NMR structure evaluation, in order to estimate the model quality [Chen et al., 2010; Wiederstein and Sippl, 2007].

1.7 Molecular dynamics simulations

Homology modelling, described in the previous section, is a computational method used to obtain a 3D model of protein structure. Protein dynamics can also be obtained with the use of computational methods, such as a classical molecular dynamics (MD) simulation, which is based on the numerical integration of Newton’s equation of motion [Newton, 1687].

The first computational approach to proteins was an energy minimalization of lysozyme and myoglobin conduced by Levitt in 1969 [Levitt and Lifson, 1969]. Eight years later Karplus and coworkers conducted the first classical molecular dynamics sim- ulation [McCammon et al., 1977]. They simulated in vacuo bovine pancreatic trypsin inhibitor (BPTI) for 9.2 ps. It took another eleven years to obtain accurate simula- tions in a water environment [Levitt and Sharon, 1988]. Nowadays with advancement in computer hardware and algorithms we can run a millisecond time scale simulations of protein dynamics in all-atom solvent [Shaw et al., 2009].

In order to run a molecular dynamic simulation we need to have a system (usually a protein atomic structure submerged in water molecules) and a force field (FF), which includes the parameters for the system along with the energy function used to compute the system’s energy. The energy function used is a pseudo-empirical potential energy function with additive knowledge-based terms:

V (R) = P

bonds

K _b (b − b ₀ ) ² + P

angles

K _θ (θ − θ ₀ ) ² + P

dihedral

K _χ [1 + cos(nχ − σ)] + P

nonbonded

_ij

"

R ^min _ij r _ij

¹²

−

R ^min _ij r _ij

⁶ #

+ P

nonbonded

q _i q _j

_d r _ij .

(27)

1.8. Protein interaction 26

All the parameters are determined from the experimental or quantum chemical data.

The first three terms describe the energy function between covalently bonded atoms.

Bond-length and bond-angle potentials are modelled by harmonic function, where K b,θ

are spring constants and b ₀ and θ ₀ are corresponding equilibrium distance and angle, respectively. Dihedral angle energy term is represented by a trigonometric function, where K _χ is a dihedral force constant, n is multiplicity of the function (number of minima and maxima in [0, 2π] interval) and σ is a phase. The last two terms are nonbonded energy potentials, which include van der Waals and electrostatic Coulomb interactions. The van der Waals interaction is represented by the commonly used Lennard-Jones 12-6 potential. The R ^min _ij represents the r _ij distance where the Lennard- Jones potential is equal to zero and _ij corresponds to the depth of the energy minimum.

In electrostatic Coulomb potential, q _i,j are partial charges and _d is effective electric constant of the milieu.

Equipped with the force field, we can start to solve the classical Newton equation of motion in order to obtain the molecular trajectory:

F _i = ∂V (R)

∂R _i = m _i d ² R _i dt ²

There are many algorithms developed to numerically solve this equation, Velocity Ver- let, based on Verlet integration, and Leapfrog method being most commonly used.

Starting positions of atoms are obtained from the experiment, whereas starting veloc- ities are drawn from the Maxwell-Boltzmann distribution.

1.8 Protein interaction

1.8.1 Introduction

Proteins carry out biological functions through interactions with other proteins, DNA, RNA or small molecules. In my research, I looked into protein-protein interactions, specifically the interactions of ubiquitin with its binding partners. The first description of protein interaction was the ”key-lock” model used to explain enzyme specificity by Emil Fisher in 1894 [Fischer, 1894]. In his theory, the enzyme was described as a rigid

”negative” of the substrate which had to fit into its shape in order to interact (Fig- ure 1.7). Moreover, any changes to the substrate shape would prevent the reaction.

Due to its limitations, this model was not able to explain many experimental observa-

tions like: why smaller analogous compounds react extremely slow or not at all, or why

some enzymes are highly selective, whereas others can bind many different ligands.

(28)

1.8.2. Induced fit model 27

Key-lock model

Induced ﬁt

Conformational selection

Figure 1.7 Three main models of protein-protein interactions. In conformational selection relative population of conformations are indicated by size.

1.8.2 Induced fit model

In order to explain these and many more discrepancies Daniel Koshland in 1958 intro- duce a new concept - the ”induced fit” theory (IF) [Koshland, 1958]. In it he proposed three main ideas: (1) a precise orientation of catalytic groups is required for enzyme activity; (2) interaction with substrate can change local three-dimensional structure of the protein active site; (3) this structural change will bring the catalytic groups into the proper orientation for reaction, whereas a non-substrate will not. Induced fit theory was a general textbook model used ever since in description of protein interactions. In this model, proteins exist in a single native conformation and binding requires an ini- tial complex formation. Moreover, better complementarity between binding partners is obtained by conformational changes induced by ligand binding (Figure 1.7). The kinetics of this mechanism can be expressed as:

P + L P L (P L) ^∗

where ligand (L) and protein (P ) first approach each other creating initial complex

(P L), this is followed by structural adjustments to yield the bound state (P L) ^∗ .

(29)

1.8.3. Conformational selection model 28

In induced fit mechanism, the reaction rate as a function of ligand concentration has a biphasic behaviour. The first phase corresponds to the initial complex formation (biomolecular association) and its rate depends linearly on the ligand concentration.

The second phase is a concentration-independent conformational rearrangement. Ad- ditionally, Bosshard and coworkers, studying kinetics of antibody-antigen reaction, propose that binding via an induced fit path makes sense only if there is a preexisting complementarity between P and L, otherwise the initial complex P L is too short lived [Berger et al., 1999].

Most evidence for induced fit mechanism is based on crystallographic studies, es- pecially from structural comparison between protein conformation in bound and free state. In these studies we lack, however, a comprehensive description of protein con- formational variability in the native state, which is crucial in the next model - confor- mational selection (CS) [Tsai et al., 1999a,b; Ma et al., 1999; James and Tawfik, 2003].

Because of that, in the absence of solution kinetics data, structural differences between bound and free proteins are quite often wrongly ascribed to an induced fit mechanism.

1.8.3 Conformational selection model

In the previous sections I described that proteins are intrinsically dynamic. Proteins in the native state can obtain different conformations, with different life times. First such hypothesis was presented in 1940 by Linus Pauling to describe the multispecificity of antibodies [Pauling, 1940]. Scientists for a long time tried to explain how a limited repertoire of antibodies can bind to an almost infinite variety of antigens. Pauling proposed that specific antigen binding sites were selected out of the ensemble of preex- isting conformations: ”It is assumed that antibodies differ from normal serum globulin only in the way in which the two end parts of the globulin polypeptide chain are coiled, these parts, as a result of their amino acid composition and order, having accessible a very great many configurations with nearly the same stability; under the influence of an antigen molecule they assume configurations complementary to surface regions of the antigen, thus forming two active ends.” [Pauling, 1940].

Conformation selection also has its origin in ”fluctuation fit”, a hypothesis presented by the Hungarian biochemist Brun´ o Ferenc Straub in 1964 [Straub, 1964]. He proposed that: ”Instead of a fit induced by the substrate, I would suggest a fluctuating enzyme molecule, one particular form of which is able to bind the substrate and other forms another ligand” [V´ ertessy and Orosz, 2011].

The next big step was proposal of the well known Monod-Wyman-Changeux (WMC)

model of allosteric regulation in 1965 [Monod et al., 1965]. In their model, proteins exist

in two different conformational states, named R (for relaxed) and T (for tense), which

(30)

1.8.3. Conformational selection model 29

Population shift

Figure 1.8 Population shift connected with conformational selection.

are characterized by a different affinity for ligands (substrate, activator or inhibitor).

In the absence of ligands those states are in equilibrium. However, upon ligand binding the affinity of one of the sites towards the corresponding ligand is altered, generating a shift of the equilibrium towards one state or another.

The first experimental data indicating preexisting structural variability were pro- vided by pre-steady-state binding kinetics. An especially very well studied example of conformational variability were the already mentioned antibodies. Kinetic studies showed that they do not have a single conformation, but rather exist in an equilib- rium between isomeric states of the combining site - conformational isomerism, and only one conformation binds antigen with high affinity [Lancet and Pecht, 1976; Foote and Milstein, 1994; Berger et al., 1999; James et al., 2003]. It is particularly evi- dent for germline antibodies, which exist in a range of conformations, where the one that binds the invading antigen is the one whose structure is complementary to that antigen [Wedemayer, 1997]. During antibody maturation, mutation events take place, rigidifying some conformers with favourable geometries.

All those theoretical models and observations were not fully appreciated until the

introduction of, the already mentioned, folding funnel concept [Frauenfelder et al.,

1991; Bryngelson et al., 1995; Karplus, 1997; Dill and Chan, 1997]. To recapitulate,

the idea behind this concept is that protein folding progresses via multiple routes going

downhill the energy funnel, with higher probability of going through some obligatory

steps. The energy landscape describing the folding funnel is quite complicated with

hills, corresponding to high energy conformations and valleys with more favourable

states. Similar funnel model can be derived for binding, approximated as a fusion of

two individual folding funnels, and can be used to describe protein-protein interac-

tions [Tsai et al., 1999b]. With these theoretical frameworks a new model of protein

interaction has been proposed - conformational selection (CS) (Figure 1.7) [Ma et al.,

1999, 2002; Tsai et al., 1999b; James and Tawfik, 2003]. As we already know, en-

ergy landscape around the bottom of the folding funnel is rugged and thus proteins

(31)

1.8.3. Conformational selection model 30

usually do not exist in a single native conformation, but rather in an ensemble of con- formations, separated by energy barriers of various heights (Figure 1.8). Proteins due to their intrinsic dynamic can sample those conformation with different probabilities, giving different population times for each state. Proteins with larger flexibility have more rugged energy landscape and thus more conformations available. Some of those sub-populations of conformations can be structurally identical to the bound state, so they can bind corresponding ligand without the need of induced-fit mechanism. Con- formational selection mechanism can be expressed as

{P 1 , P 2 , P 3 , ..., P N } + L {P ⁱ } + L P ⁱ L

where P ₁ , P ₂ , P ₃ , ..., P _N are different conformers of protein P . In the conformational selection model the rate of formation P _i L is linearly proportional to the concentration of the proper conformer P _i and non-linearly proportional to the total concentration of {P ₁ , P ₂ , P ₃ , ..., P _N } [Bosshard, 2001].

The energy landscape described above is not a static structure, but rather it is dynamically responding to changes in protein environment, accordingly redistributing the populations of conformations [Kumar et al., 2000]. We observe a shift in the energy landscape of an individual folding funnel caused by a binding event [Freire, 1999] or generally any environment changes [Tsai et al., 1999b]. This shift alters the energy landscape to favour the functional conformation instead of the native structure (Figure 1.8).

Examples of conformation selection started to appear more quickly with the de-

velopment of new NMR techniques [Mittermaier and Kay, 2006]. They enabled us to

obtain structural information about lowly populated higher energy conformations that

are invisible to other methods. Apart from previously discussed antibodies, we observe

conformational variability and conformational selection in many other proteins [Boehr

et al., 2009]. One of the examples is Volkman’s study on conformational selection

during phosphorylation. He showed that phosphorylation does not induce conforma-

tional changes, but rather it leads to a shift in preexinsting conformations, causing

their dynamic re-distribution [Volkman et al., 2001]. Another example of conforma-

tional selection obtained from NMR studies is the dihydrofolate reductase complex,

where NMR relaxation experiments revealed excited state conformations, which re-

semble conformations of the bound and free enzyme [Osborne et al., 2001]. Similar

conformation selection was also observed in the maltose-binding protein, which in the

apo form exists in two species: open and partially closed [Tang et al., 2007]. Other

studied proteins with confirmed conformational selection model are: cyclophilin A

[Eisenmesser et al., 2005], ribonuclease A [Beach et al., 2005], adenylate kinase (at

the beginning this protein was a solid proof for IF) [Henzler-Wildman et al., 2007b],

(32)

1.8.4. Conformational selection vs induced fit model 31

calmodulin [Gsponer et al., 2008], peptide antibiotic synthetase [Koglin et al., 2008], U2AF [Mackereth et al., 2011] and many more. Moreover, conformational selection can be also observed in protein-nucleic acid interactions [Kalodimos et al., 2004; Zhang et al., 2007; Al-Hashimi and Walter, 2008].

In conclusion, conformational selection model describes proteins as a dynamic distri- bution of conformations, with a range of binding site shapes available to the incoming ligands (Figure 1.7). Moreover, those binding site shapes are already in the proper conformation without need of additional IF. This structural variability can include alternative side-chain rotamers and loop conformations. However, larger global struc- tural rearrangements or even fold transitions, were also observed [Andreeva and Murzin, 2006]. One of the example is lymphotactin, which exist in equilibrium between two different folds [Tuinstra et al., 2008], and Mad2 a homodimer that adopts two different β-sheet organizations [Mapelli et al., 2007].

1.8.4 Conformational selection vs induced fit model

In a way, induced fit and conformational selection are two extremes of possible mecha- nisms underlying protein interaction [Boehr and Wright, 2008]: in the former, optimal binding is achieved by specific structural changes, whereas in the latter it is brought about through selection from the already present unbound ensemble. In the past, whenever we had crystal structures of free and bound protein, induced fit was always used for description of structural changes. In many structural studies we can see ex- amples where the authors claim that they observed IF rather than CS [Goh et al., 2004; Sullivan and Holyoak, 2008]. However, I think that after comprehensive studies of structural variability of the native state, a lot of these examples would be attributed mainly to the CS mechanism, like in case of adenylate kinases [Henzler-Wildman et al., 2007b]. This is particularly apparent, when with use of NMR combined with MD sim- ulation we can observe how dynamic the proteins are [Lindorff-Larsen et al., 2005].

Moreover, structural difference between bound and unbound protein conformations obtained from crystallography only implies that under these crystallization conditions those states are the most populated ones.

Generally, a lot of effort is put into understanding the difference and distinguish

between conformational selection and induced fit on different levels. From kinetic

point of view this comparison was carried out by Thomas Weikl and coworkers [Weikl

and von Deuster, 2009]. They found a characteristic difference in kinetics between IF

Application of computational biophysics and bioinformatics to multiscale biological problems

Uniwersytet Warszawski Wydzia l Biologii

Tomasz W lodarski