Uniwersytet Warszawski Wydzia l Biologii
Tomasz W lodarski
Application of computational biophysics and bioinformatics to multiscale biological
problems
Zastosowanie biofizyki obliczeniowej i bioinformatyki w wielkoskalowych problemach biologicznych
Rozprawa doktorska
Promotorzy rozprawy prof. Pawe l Golik Wydzia l Biologii, Uniwersytet Warszawski prof. Marek Niezg´ odka Interdyscyplinarne Centrum Modelowania Matematycznego i Komputerowego, Uniwersytet Warszawski
Grudzie´ n, 2012
1
Oświadczenie autora rozprawy:
oświadczam, że niniejsza rozprawa została napisana przeze mnie samodzielnie.
data podpis autora rozprawy
Oświadczenie promotora rozprawy:
niniejsza praca jest gotowa do oceny przez recenzentów
data podpis promotora rozprawy
Oświadczenie promotora rozprawy:
niniejsza praca jest gotowa do oceny przez recenzentów
data podpis promotora rozprawy
Dedication
To my mother’s memory, for always supporting, helping, and standing by me.
Acknowledgements
I would like to thank my advisors, Prof. Pawe l Golik and Prof. Marek Niezg´ odka, for supporting me during these past four years of scientific journey. I am also very grateful to dr Krzysztof Ginalski for giving me opportunity to work in his group and for his scientific advice and knowledge. Moreover, I would like to thank dr Bojan Zagrovic for sharing with me his knowledge and for many insightful discussions and suggestions. I very grateful to my friends, Mateusz Sikora, Kamil Steczkiewicz and Lukasz Kni˙zewski, for friendship, help and support, and many great scientific and non-scientific discussions.
I especially thank my mom, dad, and sister, for their love, care and support.
Preface
Presented thesis describes my work conducted in the group of dr Krzysztof Ginalski at the Interdisciplinary Centre for Mathematical and Computational Modelling at Uni- versity of Warsaw (A comprehensive classification of S. cerevisiae methyltransferases) and dr Bojan Zagrovic at Max F. Perutz Laboratories in Vienna (Application of com- putational biophysics in the study of protein-protein interactions). Results of my work have been published in:
[1] Wlodarski, T. and Zagrovic, B. (2009). Conformational selection and induced fit mechanism underlie specificity in noncovalent interactions with ubiquitin. Proceedings of the National Academy of Sciences of the United States of America, 106(46):19346–
19351.
[2] Wlodarski, T., Kutner, J., Towpik, J., Knizewski, L., Rychlewski, L., Kudlicki, A.,
Rowicka, M., Dziembowski, A., and Ginalski, K. (2011). Comprehensive Structural
and Substrate Specificity Classification of the Saccharomyces cerevisiae Methyltrans-
ferome. PloS One, 6(8):e23168.
Contents
1 Protein structure and dynamics 9
1.1 Proteins . . . . 9
1.2 Protein structure . . . . 10
1.3 Protein folding . . . . 13
1.4 Protein dynamics . . . . 14
1.5 Structural and dynamic studies . . . . 15
1.5.1 Crystallography . . . . 16
1.5.2 NMR . . . . 18
1.6 Distant homology detection and comparative modelling . . . . 21
1.7 Molecular dynamics simulations . . . . 25
1.8 Protein interaction . . . . 26
1.8.1 Introduction . . . . 26
1.8.2 Induced fit model . . . . 27
1.8.3 Conformational selection model . . . . 28
1.8.4 Conformational selection vs induced fit model . . . . 31
1.8.5 Flexible protein recognition model . . . . 33
2 Application of structural bioinformatics in the study of methylation 35 2.1 Introduction . . . . 35
2.2 A comprehensive classification of S. cerevisiae methyltransferases . . . 46
2.2.1 Background . . . . 47
2.2.2 Methods . . . . 48
2.2.3 Results . . . . 51
2.2.4 Discussion . . . . 64
2.2.5 Conclusions . . . . 66
3 Application of computational biophysics in study of protein-protein
interactions 70
3.1 Introduction . . . . 70
CONTENTS 6
3.2 Methods . . . . 73
3.2.1 Structural dataset . . . . 73
3.2.2 Structural analysis . . . . 75
3.2.3 Statistical significance of structural differences . . . . 77
3.2.4 Quantitative comparison of conformational selection and induced fit . . . . 78
3.2.5 Analysis of correlations . . . . 78
3.3 Results . . . . 79
3.3.1 Global structural analysis . . . . 79
3.3.2 Local structural analysis . . . . 81
3.3.3 The significance of structural changes . . . . 83
3.3.4 Backrub and ERNST ensembles . . . . 84
3.3.5 Magnitude of conformational selection versus induced fit . . . . 89
3.3.6 Correlation . . . . 89
3.4 Discussion . . . . 92
3.5 Conclusions . . . . 93
Introduction
The last decades have shown the importance of incorporation of the theoretical and computational methods into biological sciences. Even though these methods are bur- dened with approximations and simplifications, they can extend our understanding of many biological processes, especially in description of the protein structure and dy- namics on the atomic level, prediction of the protein function, and modeling metabolic pathways or cell signaling.
Nowadays, we cannot study protein structure or dynamics without resorting to computational biophysics methods. They are crucial to resolve the protein structure from X-ray electron density maps or from NMR distance restraints. Furthermore, we can apply these methods to study dynamics of obtained protein structure in molecular dynamics or Monte Carlo simulations, which can describe in atomic details protein motion and interactions up to the microsecond time scales. We can also obtain a less detailed description, with use of the pseudo-atoms, where one bead represents a group of atoms. On this coarse-grained level, we can depict dynamics of huge complexes, or even more complicated structures like the bacterial flagellum. Computational biophyscis methods can also be used to derive time dependent properties of proteins, or analyze in details protein interactions in various binding and folding stages.
On the other hand, bioinformatics emerged to address another computational chal- lenge in biology, associated with the excess of sequence data. This situation has recently reached incredible proportions with the development of Next Generation Sequencing methods, which flood us with enormous amounts of sequence data from various whole genome experiments. Bioinformatics can provide us with a model of protein structure, which can be obtained by de novo or homology modeling and can be further used to imply function of previously uncharacterised proteins. It can be applied in high throughput screening methods to study the whole family of proteins or even the whole proteomes. Moreover, the bioinformatic approach forms the basis for the emerging field of systems biology, which tries to describe complex biological systems in more holistic point of view.
In my PhD thesis, I used both of those approaches to study multiscale biological
CONTENTS 8
problems. In the first part, I used the bioinformatics approach to describe and classify all the methyltransferases encoded in the genome of yeast S. cerevisiae along with their domain composition. In addition to this description, I developed a novel method for substrate specificity prediction, based mainly on the expression pattern in the Yeast Metabolic Cycle. Furthermore, I applied this method to predict substrate specificity of a previously uncharacterised putative MTase, completing the Methyltransferome picture. In the second part, I employed computational biophysics to analyze details of protein-protein interactions, on the example of ubiquitin recognition. I revised a well established model for ubiquitin interaction, a conformation selection model, and proposed a more detailed local description combined with statistical analysis, which led to the proposal of a novel model for ubiquitin recognition.
Both of these projects present the power of application of computational and the-
oretical methods in modern biology problems.
Chapter 1
Protein structure and dynamics
In this chapter, I introduce all concepts crucial for my thesis. I start with a review of fundamental protein properties and then proceed to the description of protein structure and dynamics. Describing protein structure I also cover protein folding theory. In the next part, I focus on experimental and computational techniques used to obtain protein structural and dynamical data. Finally, I introduce models used to describe protein- protein interactions, which were a starting point for my research presented in the second part of my thesis.
1.1 Proteins
Proteins are molecular nanomachines conducting a variety of functions in the cell [Al-
berts et al., 2008]. They are crucial for processing genomic information (e.g. poly-
merases), retrieving and transferring signals in and out of the cell (e.g. G protein cou-
pled receptors, kinases), they work as a part of metabolic pathways (e.g. synthetases)
or structural elements (e.g. nuclear lamina). They can work in water environment
(globular proteins), embedded in the lipid bilayer (transmembrane proteins) or can
form a long fibre or fibrous aggregates (amyloids). Protein versatility and ubiquity
is based on their physicochemical properties. Generally, they are built from the 20
naturally occurring α-amino acids connected via peptide bond in a way described in
the genomic sequence encoding each protein. Interestingly, even though α-amino acids
exist in two stereoisoforms (L or D), Nature only builds proteins from the L-amino
acids. Those 20 amino acids differs from each other by size, charge and hydrophobic-
ity, and their properties can be further modified by posttranslational modification, like
methylation, phosphorylation or glicosylation [Walsh, 2005]. Due to this variation in
properties, they can build thousands of different structures. This variety is important
because structure generally determines protein function, thus many possible structures
1.2. Protein structure 10
MVLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKTEAEMKASE...
I
II III
IV
α β
Figure 1.1 Four levels of protein structure: I) primary structure (amino acid sequence), II) regular secondary structure (one strand of β-structure and α-helix are shown), III) tertiary structure (myo- globin), IV) quaternary structure (hemoglobin)
enable many different functions to occur. In the cell proteins are synthesised by a so- phisticated complex of both RNA and proteins - the ribosome [Ben-Shem et al., 2011].
It uses the nucleotide sequence in mRNA as a template for protein synthesis, extending peptide chain by one amino acid at a time. Newly synthesized protein chain starts to fold into a distinct three dimensional structure immediately after leaving the ribosome (or even in the ribosomal exit tunnel). This crucial process occurs spontaneously or with the help of special proteins - chaperones [Fink, 1999]. Degradation of proteins in the cell is also carried out by a protein complex - the proteasome [Groll et al., 1997].
Proteins chosen for degradation are generally tagged by ubiquitin binding to one of the protein lysines [Hochstrasser, 1996]. Both processes are highly regulated in the cell and are crucial for the proper functioning of the cell.
1.2 Protein structure
Protein structure can be described on four different levels (Figure 1.1).
Primary structure is the amino acid sequence of the polypeptide chain, which can be used to distinguish proteins [Sanger and Tuppy, 1951; Sanger and Thompson, 1953].
The next, secondary level of the protein structure is mainly obtained due to the
planarity and rigidity of the peptide bond, as well as hydrogen bonding between back-
bone atoms. Planarity generates possible rotation only around two backbone angles:
1.2. Protein structure 11
α 310 π PPI PPII
Figure 1.2 Helical secondary structures. Hydrogen bonds are presented as yellow dashed line.
φ and ψ, with a very small allowed rotation space, first predicted by Ramachandran in 1965 [Ramakrishnan and Ramachandran, 1965]. These structural restraints com- bined with hydrogen bonds define local conformations of the polypeptide chain with two main protein secondary structures: α-helices and β-strands (Figure 1.1). Both were theoretically predicted by Linus Pauling in 1951 [Pauling and Corey, 1951]. The α-helices are right-handed helices with 3.6 residues per turn. They are stabilized by backbone hydrogen bonds connecting every fourth residue and parallel to the helix axis. Moreover, due to alignment of peptide bonds dipoles, an α-helix can be treated as a single long dipole, with a positively charged N-terminus and a negatively charged C-terminus. Other helical secondary structures observed in proteins are: the 3 10 -helix, the π-helix and the polyproline helix (Figure 1.2). The 3 10 -helix has three residues per turn, so it is much tighter than the α-helix and also every third residue is connected via hydrogen bond. It is observed mainly as a short segment of 3-4 residues e.g. on the end of the α-helix or as a single turn transition. The π-helix, on the other hand is wider than the α-helix, it has 4.4 residues per turn and every fifth residue is con- nected via hydrogen bond. Moreover, often it is flanked by α-helices on either end.
The last example of a helical element is the polyproline helix, which can exist in two
conformations: a left-handed helix (PPII), when propyl bonds are in the trans-isomer
conformation, with 3 residues per turn; and a more compact right-handed helix (PPI)
when propyl bonds are in the cis-isomer conformation, with 3.3 residues per turn. In
1.2. Protein structure 12
parallel antiparallel β-bulge
Figure 1.3 The beta secondary structure elements. Hydrogen bonds are presented as yellow dashed line.
the water environment the PPII form dominates.
The β-sheets are protein secondary structures formed by hydrogen bonds between polypeptide chains - β-strands. Since each separate β-strand is slightly twisted, the whole flat structure of β-sheet also has a right-handed twist when viewed edge on.
Moreover, they can bend into a fully or partially closed β-barrel. We can distinguish three kinds of β-sheets: parallel, antiparallel and mixed (Figure 1.3). The parallel β- sheet is formed by the β-strands pointing in the same direction, whereas antiparallel is formed by strands pointing in the opposite direction. The mixed β-sheet is just a com- bination of parallel and antiparallel β-strands. Antiparallel β-strands seem to be more stable because their hydrogen bonds are aligned directly opposite each other, whereas in parallel β-strands hydrogen bonds are less aligned and are longer. In addition to these two main secondary structure elements in proteins we also observe others. The β-hairpin turn (I, II, I’, II’, III and VI) joins two adjacent parallel β-strands. The β-bulge is a structure observed in β-sheets, especially in the antiparallel one, when two hydrogen bonds from two neighbouring residues in one β-strand point to the same residue from the second β-strand (Figure 1.3). Loops or coils are defined as transition regions between other secondary structures. One of the examples of these structures are the β-hairpins described above, and longer loops also called Ω-loops. Very often loops are quite ordered in the protein structure and they constitute of the active site or metal binding residues. Those that are not well ordered are described as random coils.
Tertiary level of protein structure describes the spatial arrangement of the secondary
elements into a single three-dimensional structure of the polypeptide chain. The con-
1.3. Protein folding 13
cept of ”fold”, which additionally takes into account connectivity between secondary structures is related to the tertiary structure. Proteins can be structurally grouped into classes depending on their secondary structure content [Levitt and Chothia, 1976]: all α - proteins built only from α-helices; all β - built only from β structures; α/β - having α and β segments (mainly parallel) that tend to mix, and α + β - having distinct α and β segments (mainly antiparallel) that do not mix. Moreover, there are proteins or regions of proteins which do not have any distinguishable three-dimensional structure, they are described as intrinsically unstructured proteins. Their unique features are a combination of low overall hydrophobicity and a large net charge. They often take part in cellular signalling and regulation, and can adopt a defined structure upon binding.
The highest, quaternary level of the protein structure is described as non-covalent association of two or more independently folded polypeptides. This structural associ- ation has very crucial aspects like: functional cooperativity or colocalization.
1.3 Protein folding
The protein tertiary structure, described above, is adopted during protein folding. A first glimpse of this process was presented in the groundbreaking study of Christian Anfinsen [Anfinsen and Haber, 1961]. In the beginning of 1960’s Anfinsen in his ex- periments on the ribonuclease A showed that after denaturation the protein can be refolded while still preserving its catalytic function. This observation implies that the primary structure of the protein contains all the information required for proper folding into a correct native state, which corresponds to the protein in its energy minimum.
In 1969 Cyrus Levinthal conducted a thought experiment, in which he estimated that
it would take for a protein longer than the life time of the universe to randomly try
all its possible conformations [Levinthal, 1969]. On the other hand, from experimen-
tal measurements we know that protein folding takes a fraction of a second [Kubelka
et al., 2004]. This discrepancy is known as the Levinthal’s Paradox. These observations
helped us understand that protein folding can not be a random process, but it is rather
a directed folding path in a high-dimensional space. In 1991 Hans Frauenfelder and col-
leagues described this space as a funnel-shaped folding energy landscape and depicted
it as very rugged, with hills corresponding to the high energy conformations (transition
states) and valleys with more favourable conformations (local minima) [Frauenfelder
et al., 1991]. Within this model, folding is viewed as a parallel and cooperative pro-
cess, in which an ensemble of proteins goes downhill through the energy funnel, with
a higher probability of going through some obligatory paths. This process could be
spontaneous, generally for small proteins, or could involve special proteins (chaper-
1.4. Protein dynamics 14
ones), which assist in climbing up the slope out of the local minimum. The chaperones keep proteins from getting stuck in kinetic traps through two different mechanisms:
assistance in the uphill climbing or preventing getting into the trap in the first place.
It is an interesting question, how many distinct three dimensional structures can be achieved by proteins. In 1992 Cyrus Chothia estimated this number to be around 1000, based on similarity across available genomic sequences and only 120 protein structure solved back then [Chothia, 1992]. However, more recent studies show that this number is more likely to be between 1000 and 10 000 distinct folds [Wolf et al., 2000; Grant et al., 2004]. There are two mechanisms which are responsible for this limitation.
The first is evolution, because there is a direct link between protein structure and its function, proteins evolved to have a particular structure which corresponds to the needed function. Evolution can be divergent, then the factor limiting the number of possible folds is the size of the shared common ancestor proteins group, from which the others had been derived. On the other hand, it can be convergent, where the limiting factor is the fact that certain folds are biophysically more favourable, so they were achieved independently more than once during the evolution. A second mechanism responsible for fold limitation is the physics of protein folding. Proteins are forced to assume structures which have a smooth energy landscape, which makes fast folding possible without staying to long in local minima, which can result in aggregation.
Both these mechanisms cause that a modification of the existing fold is more likely to occur than spontaneous generation of a new one. Interestingly, a random polypeptide sequence is not able to fold into a stable three-dimensional structure [Richardson et al., 1992]. Moreover, it is possible that some feasible folds will never be seen, because they are not energetically favourable or organisms have not required them to develop (their corresponding function).
1.4 Protein dynamics
Proteins are not rigid bodies, they are inherently dynamic over a broad spectrum of time scales, ranging from picoseconds to seconds and even beyond [W¨ uthrich and Wagner, 1978; Frauenfelder et al., 1991] (Figure 1.4).
Moreover, conformational changes are a necessary prerequisite for protein to per-
form its functions. Their dynamics plays a crucial role in catalysis, ligand binding,
molecular recognition, allostery and many other processes. Fast protein motions, on
the level of picoseconds to nanoseconds, have a stabilizing effect on the folded state
of the protein, and are necessary for slower time-scale dynamics involving large-scale
collocative motions [Henzler-Wildman et al., 2007a]. Micro- to millisecond time scale
1.5. Structural and dynamic studies 15
side-chain rotation loop motions
diffusion
domain motions
local unfolding allosteric transition
folding and binding
T1, T2, HetNOE T2, T1ρ EXSY Real time
Residual dipolar couplings (RDCs)
τ c
ps ns μs ms s
Figure 1.4 Time scales of biological motion and NMR methods used to study them. T1, T2 and T1ρ - relaxation times; HetNOE - heteronuclear NOE; EXSY - EXchange SpectroscopY. τ c is correlation time.
motions are crucial for interconversion of protein subconformations [Volkman et al., 2001; Kern and Zuiderweg, 2003] and allostery communication [Br¨ uschweiler et al., 2009]. Slower time scale (millisecond to second) involves large domain motions (Fig- ure 1.4).
Amplitude of dynamics observed in proteins ranges from small side chains fluc- tuations to the movements of loops, secondary structures, and even whole domains [Gerstein and Krebs, 1998; Goh et al., 2004]. Hinge-bending movements may be the main mechanism responsible [Ma et al., 2002]. Moreover, dynamics, like protein struc- ture, may be an evolutionary determined parameter of proteins. From a kinetic point of view, internal motions may have evolved to actively promote catalytic activity in enzyme families or to enable binding to a range of interaction partners.
In my thesis I analyze the dynamic aspect of ubiquitin binding in order to better understand protein-protein interactions in ubiquitin recognition.
1.5 Structural and dynamic studies
Introduction
Protein tertiary structure and its dynamics, described in previous sections, are a guide to the protein function. First attempts to understand protein structure were under- taken in 1935, when the first X-ray diffraction pattern of a protein crystal was presented [Bernal and Crowfoot, 1934]. Efforts increased after James Watson, Francis Crick and Rosalind Franklin obtained the first structure of DNA in 1953 [Watson and Crick, 1953]
and then it finally became possible with the invention of the isomorphous replacement
method by Max Perutz in 1954 [Green et al., 1954]. At the same time Kaj Linderstrøm-
1.5.1. Crystallography 16
Lang based on early studies of deuterium/hydrogen exchange in proteins realized that proteins are dynamic and cannot be represent as a single structure [Linderstrø m Lang and Schellman, 1959].
The first structure of a protein, which was a sperm whale myoglobin, was solved by Kendrew and his collaborators in 1958 with a very low resolution of 6˚ A [Kendrew et al., 1958]. After two years they obtained higher quality data and recalculated the structure at 2˚ A resolution [Kendrew et al., 1960]. Scientists were amazed by the lack of symmetry and anticipated regularities in obtained structures. Moreover, they were far more complicated than was predicted by previous protein physics theories. Following Kendrew’s work, Max Perutz and collaborators solved in 1960 the structure of horse haemoglobin at 5.5˚ A resolution [Perutz et al., 1960]. They showed that it consists of four subunits, each structurally almost identical to Kendrew’s myoglobin. This was the first observation that proteins from different species with similar sequence and function tend to have similar structure. This was an important observation for future devel- opment of homology modelling. Perutz also found the first structural manifestation of protein dynamics, finding that side chain movement is needed to permit oxygen to reach binding site at the heme iron [Perutz and Mathews, 1966]. This initiated studies of protein dynamics, which employed local optical probes, exchange of labile protons and 1D NMR spectroscopy [Frauenfelder et al., 1988].
The most common experimental methods used to derive protein structures are X-ray crystallography and NMR spectroscopy, which I will briefly describe bellow, as well as small angle X-ray scattering and cryo-electron microscopy. Thanks to the development in theory and methodology, the number of structures increases rapidly and currently more than 85 000 structures are deposited in Protein Data Bank (PDB) [Berman et al., 2000], which is the world wide database for protein structures. Deposited structures and structural ensembles which portray protein dynamics are invaluable sources of information. In my thesis I used them both in homology modelling and in statistical analysis of protein interactions.
1.5.1 Crystallography
Crystallography was the first and still is the most widely use experimental technique
for solving protein structures [Wlodawer et al., 2008; Jaskolski, 2010]. Of more than
85 000 structures stored in the PDB [Bernstein et al., 1977; Berman et al., 2000], more
than 75 000 were obtained with the use of X-ray crystallography. This method is based
on the phenomenon of elastic X-ray scattering by the molecule’s electrons. In the crys-
tallographic study, proteins are in a crystal form, where each molecule is embedded
in a crystal lattice, creating a 3D diffraction grid. Moreover, the X-ray wavelength
1.5.1. Crystallography 17
is of the same length as a chemical bond (≈ 10 −9 m). The X-ray diffraction encodes protein structure into the 3D diffraction image, which is then projected onto the 2D detector surface, as regularly spaced spots known as reflections. They are a reciprocal space representation of the crystal lattice. Symmetry of obtained diffraction pattern contains information about the size and shape of the unit cell and inherent symme- try within the crystal, i.e. its space group, whereas positions of atoms are encoded into reflections intensities. Reflections themselves are characterized by their amplitude and phase, yet we can only obtain intensities directly from the experiment, which are squared reflections’ amplitudes. Phase information is lost when a 3D diffraction im- age is projected into lower 2D space. Unfortunately, to obtain electron density we need to calculate the Fourier transform of the structure factor, which consists of both reflections’ amplitudes and unmeasurable phases. This makes phase determination a crucial and challenging step in the crystallography, alongside with obtaining proper protein crystals. Nowadays, various approaches to solve the phase problem exist, like:
molecular replacement, anomalous X-ray scattering or heavy atom methods. Gener- ally, all of them assume some prior knowledge of the electron density or structure, leading to initial values for phases, which can then be improved in an iterative fashion [Taylor, 2003]. For molecular replacement, the starting point is a homology model, for anomalous X-ray scattering it is an anomalous atom substructure, and for heavy atom replacement it is a heavy atom substructure. With proper phases we can con- duct Fourier transform on structure factor and obtain electron density, which is used for fitting and interpretation of the protein structure. With high resolution electron density maps (< 1.5˚ A) we can obtain structural details on atomic level.
We need to remember that the diffraction pattern obtained in the first place is
averaged across the ensemble of protein structures in the crystal (static averaging)
and over time of the experiment (dynamic averaging). The static averaging is due to
fact that proteins can adopt different conformations in different unit cells, which apart
from crystal contacts, are also influenced by crystallographic medium. On the other
hand, proteins can have very flexible structures with motion time scales shorter than
the experiment, which generates dynamic averaging. Both these effects can give rise
to the observation that electron density for some fragments of the protein is smeared
over multiple conformation states. This distortion reflects the amplitude of atoms
fluctuation about their average position and is described by the B-factor (”temperature
factor”, ˚ A 2 ). Even with this in mind, the protein conformational ensemble, which is
present in protein crystals, is usually treated as a static single structure.
1.5.2. NMR 18
B
0μ
B 0
z
x y
M
0z
x y
M
0B 1 Θ
A B C
Figure 1.5 NMR spectroscopy. A) Static magnetic field (B 0 ) aligns nuclear spin and causes its preces- sion with Larmor frequency. B) Excess of spins in lower energy level, give rise to magnetization M 0 . C) Magnetization flips due to radio frequency pulse (B 1 ).
1.5.2 NMR
NMR spectroscopy is the second most widely use experimental technique for structure determination and the most common one for dynamical studies [Cavanagh et al., 2006].
Generally, it is a solution state spectroscopy, but recently we also observe development of the solid-state methods [Castellani et al., 2002], especially useful for transmembrane proteins. More then 9 500 structures that have been obtained by NMR are deposited in the PDB database and in Biological Magnetic Resonance Data Bank (BMRB) [Ulrich et al., 2008]. The first NMR protein structure was obtained in 1985 and it was the structure of proteinase inhibitor IIA [Williamson et al., 1985].
NMR is based on transitions in the magnetic field between spin states of mag- netically active nuclei (with non-zero nuclear spin, like 1 H, 13 C and 15 N). If natural abundance of those isotopes is not enough for NMR experiment, we can use efficient stable isotope labelling techniques [Tugarinov et al., 2006]. Without external magnetic field, nuclear spin is randomly oriented and has one degenerate energy level. When we introduce external static magnetic field, the energy level splits due to symmetry breakage into 2I+1 states, where I is a magnetic quantum number, for example for nucleus with I= 1
2 (hydrogen) into two spin states + 1
2 (α state) and − 1
2 (β state).
Moreover, nuclear spin starts to precess aligned with the direction of the magnetic
field (Figure 1.5 A). The precession frequency (Larmor frequency) mainly depends on
the type of nucleus and magnetic field, but is also modulated by the chemical environ-
ment: ω = −γB 0 , where γ is magnetogyric ratio and B 0 is static magnetic field. This
influence of chemical environment is encoded in the chemical shift value (δ), which
describes the difference between resonant frequency of the nucleus in the sample and
reference value for this type of nucleus. Next, we apply the disturbance - the second
external magnetic field, now as radio frequency pulse, which is perpendicular to the
static one (B 1 ). When both frequencies, of this pulse and the nuclear spin precession,
1.5.2. NMR 19
become equal we have a resonant absorption of energy and spin flips into the excited state.
In the sample we have an ensemble of nuclear spins, with population of spin states defined by Boltzmann distribution:
N α
N β = e ∆E/RT ,
where N α,β are numbers of nuclei in the α or β spin orientation, R is the gas constant, T the temperature and ∆E is the energy difference between each spin state. This energy difference is very small, which corresponds to almost identical population of α and β states. For a hydrogen atom in the highest field strengths for every 2 millions of protons α and β states differ by only 100 protons. This makes a NMR very insensitive method compared to other spectroscopy techniques. This very small excess of nuclear spins gives rise to the observed magnetization vector M 0 along the z-axis (Figure 1.5 B). When we apply a radio frequency pulse to the sample it will impose a torque on the magnetization M 0 vector in the direction which is perpendicular to the B 1 field.
Depending on the length of this pulse, magnetization can be flipped by different Θ angle (Figure 1.5 C). We can apply a 90 ◦ pulse, which causes magnetization to flip from being aligned to the static magnetic field to the plane perpendicular to the field (x-y plane).
Rotation of the magnetization on the x-y plane generates current in the electromagnetic coils. When the radio frequency pulse ends, spins relax to their not perturbed low energy state and emit excess of energy. As a consequence, magnetization again rises along the z axis and in the same time decays in the x-y plane. This decay signal (know as Free Induction Decay, FID) can be measured in the electromagnetic coils as oscillating current. This time-domain signal, needs to be further Fourier transformed into a frequency-domain signal (”NMR peak”). In addition to this single FID pulse, specially designed pulse sequences are used in experiments to create two- or three-dimensional NMR spectra, generally used for structure determination. With these pulse sequences magnetization can be transferred between specified types of nuclei through chemical bonds (scalar couplings) or space (Nuclear Overhauser Effect, NOE), giving a lot more details about the protein structure.
Generally, the first step in structure determination is to use multidimensional NMR
methods to assign obtained resonances (NMR peaks) to specific nuclei in the covalent
structure of the molecule - in this step magnetization is transferred through chemi-
cal bonds. This assignment allows us to decide which chemical shifts correspond to
which atom. Recently, chemical shifts can be used stand alone for structure determi-
nation as distance restraints in molecular dynamics simulations [Wishart and Case,
2001]. Structural restraints are then obtained in experiments which use magnetiza-
tion transfer through space, like the Nuclear Overhauser Effect (NOE) spectroscopy
1.5.2. NMR 20
[Neuhaus and Williamson, 2000]. 1 H- 1 H distances (< 5˚ A) obtained with NOESY are used as constraints in a protein structure determination protocol (e.g. with molecu- lar dynamics simulations). Additionally, distance information is supplemented by the measurement of torsion angles from scalar J-couplings [Hahn and Maxwell, 1952] or the orientation of inter-atomic vectors with respect to a reference frame obtained from Residual Dipolar Coupling (RDC) [Saupe and Englert, 1963]. With standard method- ology (double-labeled samples + NOESY) we can obtain structures of protein domains up to roughly 25 kDa. NMR spectra of larger proteins are characterized by faster decay of the signal, which is manifested by increased line width and lowered sensitivity. With triple-labelled proteins ( 13 C, 15 N and 2 H) we can extend this limit up to 40-50 kDa.
Moreover, with the use of transverse relaxation-optimized spectroscopy (TROSY) [Per- vushin et al., 1997; Fernandez, 2003], based on interference between different relaxation mechanisms, and selective labelling we can go beyond this limit, in some special cases even up to 900 kDa, like in the case of chaperone GroES-GroEL [Fiaux et al., 2002].
Apart from structure determination, NMR spectroscopy, in contrast to X-ray crys- tallography, is uniquely suited for investigation of the kinetics and thermodynamics of molecular motions [Mittermaier and Kay, 2006; Torchia, 2011]. It is particularly valuable when a statistically or dynamically disordered region of the protein may fail to produce electron density in X-ray crystallography. In a NMR experiment, dynamic and structural data are intricately mixed and to obtain dynamical properties one has to take advantage of MD simulations. Nevertheless, NMR spectroscopy can cover dynam- ics on time scales from ps to s (Figure 1.4). Protein motions ranging from picoseconds to nanoseconds are available in measurements of the time constant for 15 N longitudinal (spin-lattice, T 1 ) and transverse (spin-spin, T 2 ) relaxations rates [Palmer, 1997]. These relaxation rates are then encoded into the squared generalized order parameter (S 2 ), which measures the amplitude of motions and ranges from 0 to 1, describing isotropic motions or complete rigidity, respectively. These order parameters are then used as restraints in a molecular dynamics calculation to produce an ensemble of structures which describe the dynamic behaviour of the protein [Lindorff-Larsen et al., 2005].
Additionally, to study protein dynamics in the time range of microsecond to mil-
liseconds, relaxation dispersion methods can be used [Akke and Palmer, 1996]. These
motions cause dephasing of coherence, resulting in additional broadening of NMR sig-
nals by an amount R ex . From relaxation dispersion experiment, three physical parame-
ters can be obtained: rates of interconversion (k ex ), relative populations of exchanging
spieces (p A and p B ) and chemical shifts between the exchanging species (∆ω). Microsec-
ond and millisecond time scales are crucial for many biologically important processes,
like enzyme catalysis, signal transduction, ligand binding or allosteric regulation [Kern
1.6. Distant homology detection and comparative modelling 21
and Zuiderweg, 2003; Henzler-Wildman et al., 2007b; Br¨ uschweiler et al., 2009; Teilum et al., 2009]. Additionally, residual dipolar couplings (RDCs) can be used to sample this range of dynamics.
The RDC technique is based on observation that in standard isotropic conditions dipolar couplings average to zero due to Brownian rotational diffusion [Bax and Gr- ishaev, 2005]. This averaging of anisotropic interactions to zero is useful for standard NMR experiments, because it is a key to the sharpness of the resonance spectrum.
However, with this averaging, we lose important structural and dynamical informa- tion. In contrast to NOE, the RDCs are not relative to nearest neighbours, but define orientations relative to an external coordinate frame (magnetic field) and therefore have a global character. To recover and use this information we need to force partial (on the order of 0.1% deviation from the uniform distribution) orientation of the sample in the magnetic field. This orientation can be achieved by dissolving the protein in an anisotropic medium, such as phospholipid bicelles, phage particles or stretches of polyacrylamide gel. With this partial protein alignment, orientation-dependent dipo- lar couplings are scaled to non-zero values. Obtained RDCs can be used as structural restraints in structure calculation or refinement as well as modelling of protein com- plexes.
Actually, RDCs seem to be able to sample the whole spectrum of dynamics, from picoseconds to milliseconds [Lakomek et al., 2008]. In the second part of my thesis I look into this range of dynamics of ubiquitin, described by a structural ensemble obtained with RDC restraints [Lange et al., 2008], to understand protein-protein interaction in ubiquitin recognition.
1.6 Distant homology detection and comparative modelling
In previous sections I have described experimental methods used to obtain protein structural and dynamical data. Even though we observe a huge progress in those ap- proaches we are still not able to use them for all proteins of interest, due to limitations of time, money and sometimes still in techniques. This is particularly pronounced with the development of Next Generation Sequencing [Shendure and Ji, 2008], which floods us with enormous amount of sequence data, extending the huge gap between the num- ber of protein sequences (26,000,000 known and predicted proteins in the UniProtKB database, October 2012) and the number of protein structures (85,000 structures in the PDB, but only 46,000 are unique).
In order to meet increasing needs of structural biology, computational methods
1.6. Distant homology detection and comparative modelling 22
capable of modelling protein structures have been developed [Gu and Bourne, 2009].
They are based on the previously described Anfinsen’s hypothesis that protein structure is determined by protein sequence. With this notion and accurate knowledge of laws of physics, a protein structure could be in theory obtained by the simulation of protein folding [Levitt and Warshel, 1975]. However, due to the limited computational power and approximations in physics behind the force fields, we are still not able to solve the folding problem, except for short peptides or small fast-folding proteins [Lindorff- Larsen et al., 2011].
Fortunately, we can use another approach known as homology modelling, based mainly on the observation that evolutionarily related proteins have similar sequence and tend to have similar structure as well [Kendrew et al., 1960; Perutz et al., 1960;
Chothia and Lesk, 1986]. The main idea behind this method is to find proteins with known structure, which are evolutionarily related to the target protein and use them as a template for building a new model (Figure 1.6).
In order to address this issue a lot of sequence based methods have been devel- oped, with various ranges of applications depending on sequence similarity [Ginal- ski et al., 2005]. Origins of these methods are in well-known algorithms for global (Needleman-Wunsch [Needleman and Wunsch, 1970]) and local sequence alignment (Smith-Waterman [Smith and Waterman, 1981]). They both use dynamic program- ming combined with a substitution matrix and gap penalties to find optimal global or local sequence alignment. Their use is however limited by computational costs of dynamic programming.
The next group of methods, such as FASTA [Pearson and Lipman, 1988] or BLAST [Altschul et al., 1990], use heuristics methods instead of dynamic programming. Be- cause of that, they are much faster but are not always able to find the optimal solution.
With the use of these methods, we can obtain the local gapped alignment between a pair of sequences, scored also based on substitution matrix and the penalty for allow- ing gaps (gap initiation and gap extension penalty). They are, however, useful only for close homologs, because each position in the alignment is treated with the same weight (both in conserved and variable regions). Moreover, while comparing only pairs of sequences we are missing the evolutionary variability in the alignment, which is ex- ploited in more advanced methods, like those based on profile-to-sequence comparisons.
The idea behind them is to vary the weight of each alignment position, depending on
its conservation across protein sequences belonging to the same protein family, which
combines proteins with the common evolutionary history. This conservation is ob-
tained within a multiple sequence alignment (MSA) between the query sequence and
its homologs, and is coded in position-specific substitution matrices (PSSM) [Bork and
1.6. Distant homology detection and comparative modelling 23
Sequence Query: ...FKDKIVLDVGCGTGILSM...
...FKDKIVLDVGCGTGILSM...
...VKDKVVLDVGCGTAI-CL...
...FKDKVVMDVGAGTGILSA...
...
...-RGKTLLHLGCGMGLYTM...
Multiple sequence alignment Meta profile
...FKDKIVLDVGCGTGILSM...
A I L M F W Y V S T N Q C G P R H K D E
sequence profile
predicted secondary structure 1a
1b 1c
Target protein ...FKDKIVLDVGCGTGILSM...
A I L M F W Y V S T N Q C G P R H K D E
sequence profile
predicted secondary structure
1d Set of meta profiles for protein
with known structure ...LKDKVVMDVGAGTGILSA...
A I L M F W Y V S T N Q C G P R H K D E
sequence profile
predicted secondary structure Vs
Target protein ...FK--DKIVLDVGCGTGILSM...
Template protein 1 ...LKASDKVVMDVGAGTGILSA...
Target protein ...FKDKIVLDVGCGTGILSM...
Template protein 2 ...LKDKILMHVGAGTGI-SA...
Template protein 1 ...LKASDKVVMDVGAGTGILSA...
Template protein 2 ...LK--DKILMHVGAGTG-ISA...
structure conservation sequence conservation
Target protein ...FK--DKIVLDVGCGTGILSM...
Template protein 1 ...LKASDKVVMDVGAGTGILSA...
Template protein 2 ...LK--DKILMHVGAGTG-ISA...
sequence-to-structure alignment
Template protein 1 Template protein 2 2a
Structural alignment
Structural based sequence alignment
2b
3
Template protein 1 Template protein 2 Build model
model building
Insertion Deletion Conserved