Molecular Modeling of Proteins PDF
Molecular Modeling of Proteins PDF
Molecular Modeling of Proteins PDF
Molecular
Modeling
of Proteins
Second Edition
METHODS
IN
M O L E C U L A R B I O LO G Y
Series Editor
John M. Walker
School of Life Sciences
University of Hertfordshire
Hatfield, Hertfordshire, AL10 9AB, UK
Edited by
Andreas Kukol
University of Hertfordshire, Hatfield, Hertfordshire, UK
Editor
Andreas Kukol
University of Hertfordshire
Hatfield
Hertfordshire, UK
Preface
Over the years, molecular modeling and simulation of biomolecules has become an important tool in the molecular biosciences. Initially situated in the realm of specialists with indepth knowledge of physics and computer science and access to supercomputers, molecular
modeling is used increasingly by bioscientists who are mainly interested in investigating
biological problems. This development has been supported by improved hardware, such as
multi-core processors or graphic processing units, on the one hand, and accelerated sampling algorithms on the other hand that increase the timescale without increasing the
demands on the hardware or the calculation time. The purpose of Molecular Modeling of
Proteins is to provide a theoretical background of various methods available and to enable
nonspecialists to apply methods to their problems. Most chapters contain, in addition to a
thorough introduction, step-by-step instructions and notes on troubleshooting and how to
avoid common pitfalls.
The current second edition of Molecular Modeling of Proteins provides some updated
chapters and new material not covered in the first edition. The first part describes classical
and advanced simulation methods as well as methods to set up complex systems such as
lipid membranes and membrane proteins. The second part is devoted to the simulation and
analysis of conformational changes of proteins, while Part III covers computational methods for protein structure prediction as well as using experimental data in combination with
computational techniques. The final part contains chapters concerning proteinligand
interactions, which are relevant in the drug design process.
The topics cover some long established methods together with the latest developments
in the field. The chapters are written by internationally renowned investigators: they include
leading developers of popular simulation packages or force fields.
The second edition of Molecular Modeling of Proteins is directed at researchers in the
physical-, chemical-, and biosciences working in industry and academia, who are interested
in applying the methods in their own research. Additionally, the book forms a valuable
resource for educators who wish to teach courses about molecular modeling.
Hertfordshire, UK
Andreas Kukol
Contents
Preface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Contributors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
PART I
SIMULATION METHODS
PART II
v
ix
27
47
73
91
109
125
151
173
CONFORMATIONAL CHANGE
vii
213
237
253
289
viii
Contents
PART III
PART IV
309
331
351
PROTEINLIGAND INTERACTIONS
383
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
471
399
425
445
Contributors
ROMMIE E. AMARO Department of Chemistry and Biochemistry, University of California,
San Diego, CA, USA
ALESSANDRO BARDUCCI Laboratory of Statistical Biophysics, Ecole Polytechnique Fdrale
de Lausanne, Lausanne, Switzerland
JONATHAN BARNOUD INSERM, Lyon, France
PHILIP C. BIGGIN Department of Biochemistry, University of Oxford, Oxford, UK
PETER J. BOND Department of Chemistry, The Unilever Centre for Molecular Science
Informatics, Cambridge, USA; Department of Biological Sciences, National University of
Singapore, Singapore
MASSIMILIANO BONOMI Department of Bioengineering and Therapeutic Sciences
and California Institute of Quantitative Biosciences, University of California,
San Francisco, CA, USA
ALEXANDRE M.J.J. BONVIN Computational Structural Biology Group, Bijvoet Center
for Biomolecular Research, Faculty of Science, Utrecht University, Utrecht,
The Netherlands
ZLEM DEMIR Department of Chemistry and Biochemistry, University of California,
San Diego, CA, USA
VICTORIA A. FEHER Department of Chemistry and Biochemistry, University of California,
San Diego, CA, USA
VYTAUTAS GAPSYS Max Planck Institute for Biophysical Chemistry, Gttingen, Germany
PATRICK C. GEDEON Department of Biomedical Engineering, Duke University, Durham,
NC, USA
FRAUKE GRTER Heidelberg Institute for Theoretical Studies, Heidelberg, Germany
BERT L. DE GROOT Max Planck Institute for Biophysical Chemistry, Gttingen, Germany
OLGUN GUVENCH Department of Pharmaceutical Sciences, University of New England,
Portland, ME, USA
MING-JING HWANG Institute of Biomedical Sciences, Academia Sinica, Taipei, Taiwan
ROBERT L. JERNIGAN National Cancer Institute, National Institute of Health, Bethesda,
MD, USA; Interdepartmental Program for Bioinformatics and Computational Biology,
L.H. Baker Center for Bioinformatics and Biological Statistics, Iowa State University,
Ames, IA, USA
KEJUE JIA National Cancer Institute, National Institute of Health, Bethesda, MD, USA;
Interdepartmental Program for Bioinformatics and Computational Biology, L.H. Baker
Center for Bioinformatics and Biological Statistics, Iowa State University, Ames, IA, USA
EZGI KARACA Computational Structural Biology Group, Bijvoet Center for Biomolecular
Research, Faculty of Science, Utrecht University, Utrecht, The Netherlands
ATAUR R. KATEBI National Cancer Institute, National Institute of Health, Bethesda, MD,
USA; Interdepartmental Program for Bioinformatics and Computational Biology, L.H.
Baker Center for Bioinformatics and Biological Statistics, Iowa State University, Ames,
IA, USA
ANDREAS KUKOL School of Life and Medical Sciences, University of Hertfordshire,
Hatfield, UK
ix
Contributors
Part I
Simulation Methods
Chapter 1
Molecular Dynamics Simulations
ErikLindahl
Abstract
Molecular dynamics has evolved from a niche method mainly applicable to model systems into a
cornerstone in molecular biology. It provides us with a powerful toolbox that enables us to follow and
understand structure and dynamics with extreme detailliterally on scales where individual atoms can
be tracked. However, with great power comes great responsibility: Simulations will not magically
provide valid results, but it requires a skilled researcher. This chapter introduces you to this, and
makes you aware of some potential pitfalls. We focus on the two basic and most used methods; optimizing a structure with energy minimization and simulating motion with molecular dynamics. The
statistical mechanics theory is covered briefly as well as limitations, for instance the lack of quantum
effects and short timescales. As a practical example, we show each step of a simulation of a small protein, including examples of hardware and software, how to obtain a starting structure, immersing it in
water, and choosing good simulation parameters. You will learn how to analyze simulations in terms
of structure, fluctuations, geometrical features, and how to create ray-traced movies for presentations.
With modern GPU acceleration, a desktop can perform s-scale simulations of small proteins in a
dayonly 15 years ago this took months on the largest supercomputer in the world. As a final exercise, we show you how to set up, perform, and interpret such a folding simulation.
Key words Molecular dynamics, Simulation, Force field, Protein, Solvent, Energy minimization,
Position restraints, Equilibration, Trajectory analysis, Secondary structure
1 Introduction
Biomolecular dynamics occur over a wide range of scales in both
time and space, and the choice of approach to study them depends
on the question asked. In many cases the best alternative is an
experimental technique, for instance spectroscopy to study bond
vibrations or electrophysiology to study ion channels opening and
closing. However, theoretical methods have made huge advances
the last few decades, and there are now large domains where modeling and simulation either provide more detail or are more efficient to use compared to setting up a new experiment.
Molecular dynamics simulation is far from the only theoretical
method; when the aim is to predict for example the structure
Andreas Kukol (ed.), Molecular Modeling of Proteins, Methods in Molecular Biology, vol. 1215,
DOI 10.1007/978-1-4939-1465-4_1, Springer Science+Business Media New York 2015
Erik Lindahl
lipid
lipid
normal protein
diffusion
rotation
rotation
folding
rapid
ribosome
transport in
around bonds water
synthesis
relaxation ion channel protein folding
10-15s
10-12s
10-9s
10-6s
10-3s
1s
"biology"
membrane
protein fodling
103s
Fig. 1 Range of time scales for dynamics in biomolecular systems. While the individual time steps of molecular
dynamics is 12fs, parallel computers make it possible to simulate on microsecond scale, and distributed
computing techniques can sample even slower processes, almost reaching milliseconds
2 Theory
Macroscopic properties measured in an experiment are not direct
observations, but averages over billions of molecules representing
a statistical mechanics ensemble. This has deep theoretical implications that are covered in great detail in the literature [4, 5], but
even from a practical point of view there are important consequences: (1) It is not sufficient to work with individual structures,
but systems have to be expanded to generate a representative
ensemble of structures (see Note 1) at the given experimental conditions, e.g., temperature and pressurethis is one thing that sets
classical molecular dynamics apart from quantum chemistry. (2)
Thermodynamic equilibrium properties related to free energy,
such as binding constant, solubilities, and relative stability cannot
be calculated directly from individual simulations, but require
more elaborate techniques covered in later chaptersthese all rely
on entropy. (3) For equilibrium properties (in contrast to kinetic)
the aim is to examine the ensemble of structures, and not necessarily to reproduce individual atomic trajectories!
The two most common ways to generate statistically faithful
equilibrium ensembles are Monte Carlo and Molecular Dynamics
simulations, where the latter also has the advantage of accurately
reproducing kinetics of non-equilibrium properties such as diffusion or folding times. However, these methods cannot handle the
case where a structure is very far from equilibrium, for instance if
two atoms are almost overlapping after building a new side chain.
To remove this type of clashes prior to simulation, we typically start
with an Energy Minimization. This type of minimization is also
commonly used to refine low-resolution experimental structures.
All classical simulation methods rely on more or less empirical
sets of parameters called Force fields [69] to calculate interactions
and evaluate the potential energy of the system as a function of
pointlike atomic coordinates. A force field consists of both the set
of equations used to calculate the potential energy and forces from
particle coordinates, as well as a collection of parameters used in
Erik Lindahl
Fig. 2 Examples of interaction functions in modern force fields. Bonded interactions include covalent bond-stretching, angle-bending, torsion rotation around
bonds, and out-of-plane or improper torsions (not shown). Nonbonded interactions are based on neighborlists and consist of LennardJones attraction and
repulsion, as well as Coulomb electrostatics. Even a small amino acid residue
contains a large number of interactions, and for a protein there are thousands
V ( r1 ,,rN )
ri
= Fi
t 2
ri
mi
Erik Lindahl
4
3
relative potential
2
1
0
-1
4
3
2
1
0
-1
0.1
0.2
0.3
0.4
0.5
0.7
0.6
r (nm)
0.8
0.9
1.1
Fig. 4 Alternatives to a sharp cutoff for nonbonded coulomb interactions. Top: By switching off the interaction
(dashed) before the cutoff the force will be the exact derivative of potential, but the derivative (and thus force)
will unnaturally increase just before the cutoff. Bottom: Particle-Mesh-Ewald is an amazing algorithm where
the coulomb interaction (solid) is divided into a short-range term that is evaluated within a cutoff (dashed) and
a long-range term which can be solved exactly in reciprocal space with Fourier transforms (dot-dash)
For PME, the cutoff is not really a cutoff; it only determines the
balance between the two parts, and the long-range part is treated
by assigning charges to a grid that is solved in reciprocal space
through Fourier transforms.
Cutoffs and rounding errors can lead to drifts in energy, which
will cause the system to heat up during the simulation. Even with
a theoretically perfect simulation we would run into problems
since we typically start from an imperfect structure. As the potential energy of this structure decreases during the simulation, the
kinetic energy (i.e., temperature) would increase if the total system
energy was constant. To control this, the system is normally coupled to a thermostat that scales velocities during the integration to
maintain room temperature. Similarly, the total pressure in the system can be adjusted through scaling the simulation box size, either
isotropically or separately in x/y/z dimensions.
The single most demanding part of simulations is the computation of nonbonded interactions, since millions of pairs have to be
evaluated for each time step. Extending the time step is thus an
important way to improve simulation performance, but unfortunately errors are introduced in bond vibrations already at 1fs.
However, in most simulations these bond vibrations are not of
interest per se, and can be removed entirely by introducing bond
constraint algorithms such as SHAKE [12] or LINCS [13].
Constraints make it possible to extend time steps to 2fs, and fixed-
length bonds are likely better approximations of the quantum
mechanical oscillators than harmonic springs (see Note 3)and in
the final section we will show you how to go even further.
3 Methods
With the basic theory covered, this section will describe how to (1)
choose and obtain a starting structure, (2) prepare it for a simulation, (3) create a simulation box, (4) add solvent water, (5) p
erform
energy minimization, (6) equilibrate the structure with simulation,
(7) perform the production simulation, and (8) analyze the trajectory data. To reproduce it, you will need access to a Unix/Linux
machine (see Note 4) with a molecular dynamics package installed.
While the options and files below refer to the GROMACS program [14], the description should be reasonably straightforward to
follow with other programs like AMBER [15], CHARMM [16],
or NAMD [17]. It will also be useful to have the molecular viewer
PyMOL [18] and Unix graph program Grace installed (see Note5).
3.1 Obtaining
aStarting Structure
3.2 Preparation
ofInput Data
10
Erik Lindahl
Fig. 5 Cartoon representation of the BPTI structure 6PTI from Protein Data Bank,
with side chains shown as sticks. Including hydrogens, the protein contains
roughly 800 atoms. Ray-traced image generated with PyMOL
The default box is taken from the PDB crystal cell, but a simulation in water requires something larger. The box size is a trade-off,
though: volume is proportional to the box side cubed, and more
water means the simulation is slower. The easiest option it to place
the solute in the center of a cube, with for example 0.75nm to the
box sides. We will show up some more advanced alternatives later,
but for now this will suffice:
editconf f conf.gro d 0.75 o box.gro
where the distance (-d) flag automatically centers the protein in
the box, and the new conformation is written to the file box.gro
(see Note 10).
3.4 Adding
SolventWater
The last step before the simulation is to add water in the box to
solvate the protein. This is done by using a small pre-equilibrated
system of water coordinates that is repeated over the box, and
11
Fig. 6 BPTI solvated in water in a cubic box. Note that there is quite a lot of water,
in particular in the box corners
In principle you could use the system as is, but the net charge on
the protein is unphysical in an infinite system, and many proteins
interact with counterions. There is a GROMACS program to help
us with this, but we first need an input file. GROMACS uses a
separate preprocessing program grompp to collect parameters,
topology, and coordinates into a single run input file (em.tpr)
from which the simulation is then started (this makes it easier to
move it to a another computer). Here we are not really going to
run anything, so just create an empty file called ions.mdp and
prepare an input file as:
12
Erik Lindahl
3.7 Position
Restrained
Equilibration
13
The difference between equilibration and production run is minimal: the position restraints and pressure coupling are turned off
(see Note 15), we decide how often to write output coordinates to
analyze (say, every 5,000 steps), and start a significantly longer
simulation. How long depends on what you are studying, and that
should be decided before starting any simulations. For decent sampling the simulation should be at least ten times longer than the
phenomena you are studying, which unfortunately sometimes
14
Erik Lindahl
15
RMSD (nm)
0.2
0.15
0.1
0.05
0
Time (ns)
10
Fig. 7 Instantaneous Root-mean-square displacement (RMSD) of all heavy atoms in Lysozyme during the
simulation (solid), relative to the crystal structure. To a large extent atoms are vibrating around an equilibrium,
so the RMSD of a 1-ns running average structure (dashed gray) is a better measure
3.9.3 Secondary
Structure
16
Erik Lindahl
RMSF(nm)
0.08
0.07
0.06
0.05
0.04
0.03
30
B-factor
25
20
15
10
5
0
10
20
30
Residue
40
50
60
Fig. 8 Top: Root-mean-square fluctuations of residue coordinates in the simulation. Bottom: The fluctuations
can be converted to X-ray temperature factors (solid), which agree quite well with the experimental B-factors
from the PDB file (dashed)
Coil
B-Sheet
B-Bridge
2000
4000
Bend
Turn
A-Helix
3-Helix
Residue
50
40
30
20
10
0
6000
8000
10000
Time (ps)
Fig. 9 Local secondary structure in BPTI as a function of time during the simulation, according to the DSSP
definition. Note how some elements periodically lose a bit of structure, but it rapidly reforms and the overall
structure is quite stable over 10ns
There are two more very basic properties that are useful to analyze:
The size of the protein defined by the radius of gyration and the
number of hydrogen bonds. To calculate the radius of gyration,
use the command:
g_gyrate s run.tpr f run.xtc
17
Rgyr (nm)
1.2
1.15
1.1
# H-bonds
1.05
35
30
25
20
2000
4000
Time (ps)
6000
8000
10000
Fig. 10 Top: Radius of gyration of BPTI during 10ns simulation. This is a good measure of how compact a
structure is. Bottom: Number of hydrogen bonds inside the protein
18
Erik Lindahl
19
5 Conclusions
This chapter should hopefully provide a basic introduction to general simulations. An important lesson is that high-quality simulations require a lot of care from the userjust as with experimental
techniques the entire result can be ruined by a single sloppy step.
Further, recent techniques based on distributed computing and
markovian state models have been able to probe dynamics in the
millisecond range without extending individual simulations to
those scales [31]this will be covered in much more detail in subsequent chapters presenting metadynamics (Chapter 8) and
accelerated MD (Chapter 12). While simulations are advancing
rapidly due to the continuous development of faster computers,
the field has also been plagued by (published) simulations that
20
Erik Lindahl
6 Notes
1. Most simulations rely on systems being ergodic, that is, the
time average of the properties of a single molecule on a long
simulation should be the same as the instantaneous ensemble
average over all molecules in an experimental measurement.
This is often (but not always) true, although it assumes our
single simulation is sufficiently long, which can be very inefficient to achieve.
2. The standard harmonic bond potentials in molecular simulations will never allow atoms to separate. However, the alternative Morse potential is supported in many programs (including
GROMACS) and will allow atoms to separate. Still, this is not
used very frequentlyif your problem involves breaking and
forming bonds it is likely a better solution to use a QM/MM
simulation.
3. The classical representations can be corrected in a number of
ways to make sure that they are faithful representations of the
real system. This is discussed in great detail in the first chapter
of the GROMACS manual, to which we refer the interested
reader. However, the really important thing in modeling is to
understand your system and decide in each case what approximations are reasonable. It is easy to add more detail (e.g., by
using quantum chemistry), but that automatically means you
lose in the other end by not getting as much sampling. The
challenge is to strike the right balance for each problem!
4. In general, most computational chemistry programs behave best
with the Linux operating system, although it is possible to run
GROMACS on Windows. When starting out, you want a standard AMD or Intel desktop. Currently (2013), you will get the
best priceperformance ratio by investing in a single-
socket
machine with fastest consumer processor you can buy, for
instance Intel Core i7 4770. You can get this for well under
$1000. GROMACS and some other codes support GPU acceleration for NVIDIA cards, so to improve performance significantly it is a good idea to add a high-end graphics card such as
21
22
Erik Lindahl
23
16. In this particular case we just used pressure coupling to get the
right density, while the production simulation is performed in
a so-called NVT ensemble (constant number of particles,
volume, and temperature). For some systems, in particular
24
Erik Lindahl
25
References
1. Alder BJ, Wainwright TE (1957) Phase transition for a hard sphere system. J Chem Phys
27:12081209
2. Rahman A, Stillinger FH (1971) Molecular
dynamics study of liquid water. J Chem Phys
55:33363359
3. McCammon JA, Gelin BR, Karplus M (1977)
Dynamics of folded proteins. Nature 267:
585590
4. Allen MP, Tildesley DJ (1989) Computer simulation of liquids. Clarendon, NewYork, NY
5. Frenkel D, Smit B (2001) Understanding
molecular simulation. Academic, NewYork, NY
6. Kaminski GA, Friesner RA, Tirado-Rives J,
Jorgensen WL (2001) Evaluation and reparametrization of the OPLS-AA force field for
proteins via comparison with accurate quantum chemical calculations on peptides. J Phys
Chem B 105:64746487
7. MacKerell AD Jr etal (1998) All-atom empirical potential for molecular modeling and
dynamics Studies of proteins. J Phys Chem B
102:35863616
8. Oostenbrink C, Villa A, Mark AE, van
Gunsteren WF (2004) A biomolecular force
field based on the free enthalpy of hydration
and solvation: the GROMOS force-field
parameter sets 53A5 and 53A6. J Comput
Chem 25:16561676
9. Wang J, Cieplak P, Kollman PA (2000) How
well does a restrained electrostatic potential
(RESP) model perform in calculating conformational energies of organic and biological
molecules? J Comput Chem 21:10491074
10. Chandler D (1987) Introduction to modern
statistical mechanics. Oxford University Press,
NewYork, NY
11. Essman U, Perera L, Berkowitz M, Darden T,
Lee H, Pedersen LG (1995) A smooth particle mesh Ewald method. J Chem Phys 103:
85778593
12. Ryckaert JP, Ciccotti G, Berendsen HJC
(1977) Numerical integration of the cartesian
equations of motion of a system with constraints; molecular dynamics of n-alkanes.
JComp Phys 23:327341
13. Hess B, Bekker H, Berendsen HJC, Fraaije
JGEM (1998) LINCS: a linear constraint
solver for molecular simulation. J Comput
Chem 18:14631472
14. Lindahl E, Hess B, van der Spoel D (2001)
GROMACS 3.0: A package for molecular simulation and trajectory analysis. J Mol Model
7:306317
26
Erik Lindahl
30. Ensign DL, Kasson P, Pande V (2007)
Heterogeneity even at the speed limit of folding: large-scale molecular dynamics study of a
fast-folding variant of the villin headpiece.
JMol Biol 374:806816
Chapter 2
Transition Path Sampling withQuantum/Classical
Mechanics forReaction Rates
FraukeGrter andWenjinLi
Abstract
Predicting rates of biochemical reactions through molecular simulations poses a particular challenge for
two reasons. First, the process involves bond formation and/or cleavage and thus requires a quantum
mechanical (QM) treatment of the reaction center, which can be combined with a more efficient molecular
mechanical (MM) description for the remainder of the system, resulting in a QM/MM approach. Second,
reaction time scales are typically many orders of magnitude larger than the (sub-)nanosecond scale accessible by QM/MM simulations. Transition path sampling (TPS) allows to efficiently sample the space of
dynamic trajectories from the reactant to the product state without an additional biasing potential. We
outline here the application of TPS and QM/MM to calculate rates for biochemical reactions, by means
of a simple toy system. In a step-by-step protocol, we specifically refer to our implementation within the
MD suite Gromacs, which we have made available to the research community, and include practical advice
on the choice of parameters.
Key words Protein folding, Biochemical reactions, QM/MM, Reactive paths, Rate calculations
1 Introduction
Processes such as biochemical reactions or conformational changes of
biomolecules typically occur on timescales beyond those accessible by
Molecular Dynamics (MD) simulations at atomistic detail. In many
cases, reducing the resolution of the simulation by coarse-graining
the biomolecule is not an option, as critical players such as hydrogen
bonds or hydrophobic effects involved in the reaction under investigation might be lost or are described at insufficient accuracy.
Purely classical MD simulations at atomistic resolution routinely can reach microsecond time scales. In a few recent cases,
millisecond scales were achieved, which allowed the prediction of
quantitative rates for the folding of proteins, either by highly parallel distributed computing of many short trajectories or by special
purpose high-performance computing to obtain a small number of
ultralong trajectories [13]. However, the conventionally reached
Andreas Kukol (ed.), Molecular Modeling of Proteins, Methods in Molecular Biology, vol. 1215,
DOI 10.1007/978-1-4939-1465-4_2, Springer Science+Business Media New York 2015
27
28
FraukeGrter andWenjinLi
29
2 Theory
Many processes such as chemical reactions or protein folding can
be simplified to processes with two stable states that are separated
by a single high energy barrier. In Fig.1a, regions A and B are the
two stable states, and the energy barrier is highlighted in between.
For chemical reactions, regions A and B represent the reactant and
product states, respectively. In this example, the multidimensional
space of the system is projected onto two order parameters, R1 and
R2, both of which change during the reaction. Examples for order
parameters, often distances, angles, or collective coordinates, are
given further below. A reactive trajectory (shown as a black solid
line) leads to the rare but crucial transition between A and B.The
system spends considerably longer times in the two free energy
wells of the reactant and product than in the high free energy states
between the two. Thus, while the transition of interest might only
take a few 100fs, the dwell time of the system in A or B might be
in the microsecond to second time scale. Transition path sampling
(TPS) has been developed to enhance the sampling of the rare
reactive trajectories, which are otherwise hardly harvested by conventional simulations [6, 7, 1618].
2.1 Sampling
theTransition
Path Ensemble
The idea of transition path sampling (TPS) is to sample a new transition path based on an existing (old) one (a transition path refers
to a reactive trajectory) with a Monte Carlo procedure, and the
new path is made sure to be equally weighted with the old one in
the transition path ensemble. In principle, there are many strategies to do this. For illustrating the concept of TPS, we here use the
shooting move in a deterministic simulation as an example.
(a) Defining the probability of a reactive path. In molecular simulations, the time evolution of a system is represented by an
ordered sequence of states, X(T){X0,Xt,X2t,,XT} (see
Fig.1a, black solid line). Here, t is the time increment. X(T)
consists of L=T/t+1 states, and its starting point is X0.
30
FraukeGrter andWenjinLi
Fig. 1 Schematic description of the free energy landscape of a system and the
shooting and shifting moves in TPS. (a) A typical free energy landscape of a process is shown with two stable states (labelled with A and B) and a barrier in the
middle. R1 and R2 are two arbitrary coordinates. A transition pathway (black solid
line) connecting states A and B is given as well. The transition path is represented
by an ordered sequence of states X(T){X0,Xt,X2t,,XT}. (b) An example of
shooting moves. The two filled grey areas represent the states A and B mentioned
o
o
above. A state {qit,pit} is randomly chosen from an old transition path (solid line).
o
The momentum pit is perturbed to be pitn, where pitn=pito+p, while the
o
coordinate is unchanged with qit=qitn. From the newly generated state
n
n
{qit ,pit }, a new transition path (dashed line) is obtained by evolving the system
backward in time to zero and forward in time to T. (c) An example of forward shifting moves. A new path is generated by removing a small segment from the beginning of the old path (the starting frame, shown as a black dot, moves forward to a
new start) and evolving the system forward from the last frame to create a new
part with the same length as the removed one (the dashed line). Figure adopted
from Hierarchical Methods for Dynamics in Complex Molecular Systems, Lecture
Notes, Eds. Grotendorst et al, Juelich, 2012 with permission
PAB ( X (T ) ) = h A ( X 0 ) ( X 0 ) h B ( X T ) / Z AB (T )
(1)
31
Z AB (T ) dX 0h A ( X 0 ) ( X 0 ) h B ( X T )
(2)
(b) Sampling the transition path ensemble by shooting. In a transition path ensemble, the distribution of transition paths is given
in Eq.1. To make sure that the correctly weighted transition
paths are sampled, the following two probabilities should
equal: the probability to generate a new transition path from a
old one Pgen(Xo(T)Xn(T)), and the probability to generate
the old transition path from the new one Pgen(Xn(T)Xo(T)).
In a shooting move, a state Xito,i[0,L], is randomly chosen.
Then, a new state Xitn is generated by adding a small perturbation to Xito. Here, the superscript o and n refer to the old
path and the new path, respectively. Note that a state X consists of the coordinate q and the momentum p, X=
{q, p}, the perturbation can be added to q or/and p. In practice, it is convenient to keep q untouched and change p by p.
As illustrated in Fig.1b, the selected state Xito={qito,pito} in
an old transition path (the solid line in Fig.1b) is changed to
Xitn={qitn,pitn}, where pitn=pito+p. Starting with Xitn,
one can evolve the system backward in time to 0 and forward
in time to T, then a new transition path is generated if it initials
from region A and ends in region B (the dashed line in Fig.1b).
The probability to generate a new transition path from an old
one is the product of four parts, the probability of the old path
in the given ensemble, the probability to generate Xitn from
Xito (Pgen(XitoXitn)), the probability of that the new path
is reactive, and the probability to accept the new transition
path Pacc(Xn(T)Xo(T)).
Pgen ( X o (T ) X n (T ) ) = PAB ( X o (T ) ) Pgen ( X iot X int ) h A ( X 0n ) hB ( X Tn )
Pacc ( X o (T ) X n (T ) )
(3)
Similarly, for generating the old path from the new one, we have
Pgen ( X n (T ) X o (T ) ) = PAB ( X n (T ) ) Pgen ( X int X iot ) h A ( X 0o ) hB ( X To )
Pacc ( X n (T ) X o (T ) )
(4)
32
FraukeGrter andWenjinLi
Pacc ( X o (T ) X n (T ) )
Pacc ( X n (T ) X o (T ) )
(5)
Note that the old path is reactive, i.e., hA(X0o)=1 and hB(XTo)=1.
Equation6 can be simplified as
( X int ) Pgen ( X int X iot )
Pacc ( X (T ) X (T ) ) = h A ( X ) hB ( X ) min 1,
( X iot ) Pgen ( X iot X int )
o
n
0
n
T
(7)
Here, we apply Eq.1 and the fact that the probabilities of the states
on the same path in deterministic dynamics are the same. Although
Eq.7 is obtained based on deterministic dynamics, it can be also
inferred based on a general dynamics [18]. In the implementation
of shooting moves, a symmetric generation probability is normally
ensured, and thus Pgen(XitoXitn)=Pgen(XitnXito). Specific
strategies are always applied to ensure that states Xito and Xitn are
within the same microcanonical ensemble, i.e., (Xito)=(Xitn).
Thus, the acceptance probability becomes
Pacc ( X o (T ) X n (T ) ) = h A ( X 0n ) hB ( X Tn )
(8)
(9)
33
If the time required for a system to cross the energy barrier and
commit to the other stable state (mol) is far smaller than the reaction time of the system (i.e., mol<<rxn), C(t) scales linearly in the
intermediate time region, and we have
C (t ) kABt ,
(10)
C (t )
(11)
hA ( X 0 )
Here, is the ensemble average of all initial states. In deterministic dynamics, C(t) can be written in terms of the probability of all
initial states (X0):
C (t ) =
dX ( X ) h ( X ) h ( X )
dX ( X ) h ( X )
0
(12)
Equations10 and 12 together provide a way to calculate the forward reaction rate constant kAB by molecular simulations. One can
simply run a large set of simulations that start in region A and are
of the same time length t, and then count the probability of the
end state to be in region B, which gives the value of C(t). The
derivative of C(t) over time gives the rate constant. However, this
apparently involves numerous computational efforts.
If region B can be defined by an order parameter (X), and the
distribution of the end states, i.e., X(t), along the order parameter
P(, t) is known, C(t) is simply the integral of P(, t) along over
region B.
C (t ) =
_ max
_ min
dP ( ,t ) .
(13)
Here, _min and _max are the lower and upper bound of region
B along . P(, t) is given by
P ( ,t ) =
dX ( X ) h ( X ) ( X (t ) ) ,
dX ( X ) h ( X )
0
(14)
34
FraukeGrter andWenjinLi
hB (t )
hB (t
AB
C (t ) ,
AB
0 <t <T
(15)
kAB =
d hB (t )
hB (t
AB
/ dt
AB
C (t ) ,
(16)
3 Materials
A GROMACS-4.0.7 package [15] with a TPS implementation can
be downloaded from http://wenjin.people.uic.edu/download/
Gromacs4_tps_patch.tar.gz, which is implemented by Dr. Wenjin
Li and currently maintained by him as well (see Note 1). The package can be installed by following the installation instructions of the
original GROMACS-4.0.7 version at http://www.gromacs.org. A
Linux or Unix system is required for compilation, as well as FFTW
libraries.
4 Methods
In this section, we will describe how to (1) establish a toy system, (2)
define the stable basins, (3) obtain the hB(t)AB curve, (4) obtain the
P(, t) distribution, (5) calculate rate constants, and (6) monitor
TPS.All the files necessary to complete this tutorial are available at
http://wenjin.people.uic.edu/download/example_3_Ar.tar.gz.
35
Fig. 2 Simulation setup of the toy system. Black spheres: Ar atoms. Grey lines: water molecules
We here will illustrate how to use TPS to calculate the rate constant
of a rare event with a toy system, which consists of three Ar atoms
in a water box (see Fig.2). All three Ar atoms are lying in a line
along the Z-axis. Atoms 1 and 3 are held by position restraints
along the X-, Y-, and Z-axis, while atom 2 is restrained along the
X- and Y-axis, but free to move along the Z-axis. Position restrains
were switched on by setting define=-DPOSRES in the .mdp file,
with parameters for position restraints given in posre.itp. Atoms 1
and 3 are separated by approximately 1.0nm. Due to the van der
Waals interaction with the other two Ar atoms, atom 2 has two
preferred positions (or stable basins). One position is about 0.2nm,
the other 0.8nm away from atom 1. There is a relatively high barrier between the two minima. Atom 2 can overcome the attraction
of one Ar atom and transit from one stable basin to the other.
Here, we will estimate the rate of these transitions with TPS.The
parameters for van der Waals interaction between two Ar atoms
have been modified to unrealistic values (see file ffoplsaanb.itp) to
increase the barrier between the two minima to make sure that the
transition is a rare event (see Note 2). Therefore, we here are
looking at an unphysical toy model to solely focus on the procedure to run TPS with the modified GROMACS package.
36
FraukeGrter andWenjinLi
4.1.1 Definition
ofStableBasins
= a_1 a_2
tps_grps2
= a_2 a_3
tps_dimension
= one
tps_weight_dim
=1
tps_initial_max
= -0.5
tps_initial_min
= -1
tps_final_max
=1
tps_final_min
= 0.5
-1
=========================
37
tps_npost
=4
tps_grps1
= a_1 a_2
tps_grps2
= a_2 a_3
tps_dimension
= one
tps_weight_dim
=1
tps_initial_max
= -0.5
tps_initial_min
= -1
tps_final_max
=1
tps_final_min
= 0.5
-1
tps
= rand_ini
tps_maxcycle
=5
tps_maxshoot
= 10
tps_endpoint
= yes
tps_kin_ref
= 100
tps_Temperature
= 300
tps_forward_steps
= 400
tps_backward_steps
= 400
tps_maxframe
=1
tps_ntrrout
=1
=========================
38
FraukeGrter andWenjinLi
4.3 Obtaining
thehB(t)ABCurve
A requisite to compute the rate constant using TPS is the flux versus time, or the hB(t)AB curve, and the probability distribution
along an order parameter P(, t), or specifically P(d) in this case,
which is then used to calculate the value of C(t) at a specific time t
(see Theory). With an initial path at hand, we can start TPS to
obtain these ingredients for the rate constant calculations. The settings for this purpose are:
=====Part of tps.mdp =====
tps_npost
=4
tps_grps1
= a_1 a_2
tps_grps2
= a_2 a_3
tps_dimension
= one
tps_weight_dim
= 1 -1
tps_initial_max
= -0.5
tps_initial_min
= -1
tps_final_max
=1
tps_final_min
= 0.5
tps
= normal
tps_maxcycle
= 150
= 10
tps_maxshoot
tps_maxshift
= 10
tps_endpoint
= no
tps_kin_ref
= 100
tps_reput_length
= 300
tps_maxframe
= 800
tps_ntrrout
=0
=========================
39
Fig. 3 Results for the hB(t)AB curve. (a) Black curve: the averaged hB(t)AB curve. Grey curves: the five
hB(t)AB curves obtained from five independent samplings. (b) The derivative of the hB(t)AB curve shows a
plateau, indicating a length of 16ps to be sufficient. Grey: the derivative of the black curve in a. Black: the
smoothed curve of the grey one by averaging over five nearby points
We read the initial reactive path again via the option -rerun.
Here, we run 150 cycles of TPS, with 10 shooting moves and 10
shifting moves in each cycle. In total there are 3,000 TPS runs.
This will take about 4 days to complete on a single standard processor. The results of the hB(t)AB curve is saved in hahb.dat.
To obtain an accurate hB(t)AB curve, we recommend the reader to
run five independent simulations (see Note 7), and to then combine the resulting five hahb.dat files into one by simple averaging
(Fig.3a). Here, the derivative of hB(t)AB reaches a plateau at 13ps
with dhB(t)AB/dt=0.1 ps1 as shown in Fig.3b (see Note8).
4.4 Obtaining
theP(, t) Distribution
40
FraukeGrter andWenjinLi
41
Fig. 4 Calculation of P(d) through TPS in windows. (a) Distribution of P(d) in different windows. (b) The
connected distribution of P(d) over the whole configuration space. Dashed grey lines: the boundaries
between regions A and B and the transition region
The mdrun command will generate four output files that help to
monitor the progress of the sampling: acc.dat, endpoint.dat, hahb.
dat, and summary.dat. They are explained below:
acc.dat: It summarizes the number of shooting trials, the number
of successful shooting trials, the number of shifting trials, and the
number of successful shifting trials at each frame. It also includes
the acceptance ratio for shooting and shifting.
endpoint.dat: It gives the endpoints of the transition paths in the
value of the order parameter, which is used to calculate P(, t)
when tps_endpoint=yes.
hahb.dat: It gives the hB(t)AB curve when tps_endpoint=no.
summary.dat: It summarizes the overall number of shooting and
shifting cycles and their acceptance ratio. An example is given
below:
42
FraukeGrter andWenjinLi
=============== summary.dat ==================
The totol TPS cycle is -----------------------------3000
The totol shooting cycle is ------------------------1524
The totol leftshift cycle is -----------------------747
The totol rightshift cycle is ---------------------729
The totol acceptance is ---------------------------0.3076667
The acceptance for shooting is --------------------0.0577428
The acceptance for leftshift is ------------------0.5689424
The acceptance for rightshift is ------------------0.5624143
===================================
5 Notes
1. The modified GROMACS package supports simulations on
only a single CPU and not in parallel, as neither domain
decomposition nor particle decomposition are supported
inthe current implementation.
2. Equation10 is based on the assumption that the barrier is so
high that the time of the actual transition is much smaller than
the inverse of the rate constant. Therefore, Eq.10 is only applicable to systems with high energy barriers, i.e., of several kBT.
3. For many systems, the choice of an order parameter is trivial.
One can run a relatively long simulation at the two stable states,
and then find an order parameter to distinguish the stable states
by inspection of the coordinate spaces that the two simulations
sampled at both basins. Usually, an inspection by eye is enough.
If not, principle component analysis [23] can assist in identifying an order parameter. Once an order parameter is found, one
defines the two stable states according to their distribution of
the sampled configuration along the order parameter. Make
sure that the two basins are separated and cover the major part
of the sampled configurations in that state.
4. One can use multiple coordinates to define regions A and B if
the interest is to investigate the mechanism of the transition
process rather than the rate constant. If one want to get the
rate constant, region A can be defined with multiple coordinates, while region B is preferably defined with a single coordinate, as this reduces the computational expense. If defining
region B by multiple coordinates is nevertheless essential, the
distribution of P(, t) is required in the multidimensional
43
44
FraukeGrter andWenjinLi
Acknowledgment
We are grateful to the Klaus Tschira Foundation for financial
support.
References
1. Lane TJ, Bowman GR, Beauchamp K etal
(2011) Markov state model reveals folding and
functional dynamics in ultra-long MD trajectories. J Am Chem Soc 133:1841318419
2. Bowman GR, Pande VS (2010) Protein folded
states are kinetic hubs. Proc Natl Acad Sci U S A
107:1089010895
3. van der Spoel D, Seibert MM (2006) Protein
folding kinetics and thermodynamics from
atomistic simulations. Phys Rev Lett 96:238102
4. Best RB, Hummer G (2006) Diffusive model
of protein folding dynamics with Kramers turnover in rate. Phys Rev Lett 96:228104
5. Popa I, Fernndez JM, Garcia-Manyes S (2011)
Direct quantification of the attempt frequency
determining the mechanical unfolding of ubiquitin protein. J Biol Chem 286:3107231079
6. Dellago C, Bolhuis PG, Csajka FS etal (1998)
Transition path sampling and the calculation of
rate constants. J Chem Phys 108:1964
7. Dellago C, Bolhuis PG, Chandler D (1998)
Efficient transition path sampling: application
45
22. Bolhuis PG, Chandler D, Dellago C etal
(2002) Transition path sampling: throwing
ropes over rough mountain passes, in the dark.
Annu Rev Phys Chem 53:291318
23. Jolliffe I (2005) Principal component analysis.
Wiley Online Library
24. Dellago C, Bolhuis PG, Geissler PL (2006)
Transition path sampling methods. In:
Computer simulations in condensed matter
systems: from materials to chemical biology,
vol 1. Springer, Berlin, pp349391
25. Bolhuis PG (2003) Transition-path sampling
of -hairpin folding. Proc Natl Acad Sci U S A
100:1212912134
26. Hu J, Ma A, Dinner AR (2006) Bias annealing:
a method for obtaining transition paths de
novo. J Chem Phys 125:114101
27. Grubmller H (1995) Predicting slow structural transitions in macromolecular systems:
conformational flooding. Phys Rev E 52:2893
28. Laio A, Gervasio FL (2008) Metadynamics: a
method to simulate rare events and reconstruct
the free energy in biophysics, chemistry and
material science. Rep Prog Phys 71:126601
29. Crehuet R, Field MJ (2007) A transition path
sampling study of the reaction catalyzed by the
enzyme chorismate mutase. J Phys Chem B
111:57085718
30. Ma A, Dinner AR (2005) Automatic method
for identifying reaction coordinates in complex
systems. J Phys Chem B 109:67696779
Chapter 3
Current Status of Protein Force Fields for Molecular
Dynamics Simulations
Pedro E.M. Lopes, Olgun Guvench, and Alexander D. MacKerell Jr.
Abstract
The current status of classical force fields for proteins is reviewed. These include additive force fields as well
as the latest developments in the Drude and AMOEBA polarizable force fields. Parametrization strategies
developed specifically for the Drude force field are described and compared with the additive CHARMM36
force field. Results from molecular simulations of proteins and small peptides are summarized to illustrate
the performance of the Drude and AMOEBA force fields.
Key words Force field, Molecular dynamics, Drude polarizable force field, CHARMM, AMOEBA,
AMBER, GROMOS, OPLS, NAMD, Electronic polarization
Introduction
Classical molecular dynamics (MD) simulations of proteins using
empirical force fields have reached a mature state after 35 years of
development and are now widely used as tools to investigate their
structure and dynamics under a wide variety of conditions. These
include studies of ligand binding, enzymatic-reaction mechanisms,
protein folding and unfolding, and proteinprotein interactions.
Fundamental to such simulations is determination of the time
evolution of the systems energy (protein for example) as a function of its atomic coordinates. An accurate description of the
energy is thus required, since the lower energy states are expected
to be populated. The gradient of the energy function, which is
differentiable, is related to the forces acting on individual atoms.
In chemistry the set of potential energy functions from which the
forces are derived is commonly referred to as a force field (FF).
As a result of many years of careful refinement, current additive
protein energy functions are of sufficient quality that they may be
used predicatively for studying protein dynamics and protein
protein interactions and in pharmacological applications [1]. It is
clear that the next major step in advancing protein force field accuracy
Andreas Kukol (ed.), Molecular Modeling of Proteins, Methods in Molecular Biology, vol. 1215,
DOI 10.1007/978-1-4939-1465-4_3, Springer Science+Business Media New York 2015
47
48
2.1 CHARMM
Force Field
49
50
3.1 Drude
Polarizable Force Field
51
52
4.1 Generic
Parametrization
Strategies
for the Drude
Polarizable
Force Field
The quality of FFs is heavily dependent on the quality of the underlying parameters. To obtain parameters of sufficient quality that are
capable of producing accurate simulation results, procedures have
been developed to target properties such as molecular geometries
and vibrations, pure solvent properties, and free energies of solvation, among others during the parametrization. In this section we
will describe parametrization of the polarizable Drude FF implemented in CHARMM. Reference to the well-established protocol
used to derive CHARMM additive FF parameters will be done
whenever a parallel is useful. The general outline of the parametrization process has been described for the CHARMM additive FF
in several publications (see refs. 1 and 19 for more details). Note
that parameter optimization remains an iterative process in the
polarizable FF and several rounds of parametrization are typically
performed until a satisfactory level of agreement with target data is
obtained.
A common strategy in parameter optimization of biological macromolecules is that parameters are developed for small, representative
model compounds and then transferred to the larger macromolecules. The advantages of this approach are: (1) smaller models are
easier to treat using both MM and QM methods and (2) more
experimental data are available for the smaller systems, including
thermodynamic properties of condensed phases, such as heats of
vaporization or sublimation and free energies of aqueous solvation.
It is crucial to include such data in the parameter optimization process to get an accurate description of the non-bond portion of the FF.
53
54
for small molecules [95]. LPs typically carry the charge of the atom
(e.g., N, O, S in proteins) to which they are attached. The associated polarizability and Thole factor are both assigned to the parent
atom. Anisotropic polarizability of hydrogen bond acceptors was
found to be required to reproduce interactions with ions as a function of orientation. Initial values for the partial atomic charges are
taken from the C22 additive all-atom FF, and those for the polarizabilities are based on adjusted Millers atomic hybrid polarizability
(ahp) values [96].
Although gas-phase properties (e.g., dipole moments) are
easily reproduced with full atomic polarizabilities, scaling of the
polarizabilities has been shown to be necessary to reproduce condensed-phase properties [64]. A scaling factor of approximately
0.7 was found appropriate for the SWM4-DP and SWM4-NDP
water models while for other classes of molecules scaling factors
range from 0.6 to 1.0, with 1.0 being full polarizability. For
instance, scaling factors are 0.7 for primary and secondary alcohols
[67], 0.85 for aromatics [68], N-containing heterocycles [94],
nucleic acid bases [73] and ethers [97], and 1.0 for alkanes [42].
Other scaling factors are 0.7 for thiols, 0.85 for dimethyl disulfide
and 0.6 for ethylmethyl sulfide [72]. A value of 0.724 was recently
used in ion parameters [98]. Final optimization of the electrostatic parameters consists of testing the model for reproduction of
the pure solvent dielectric constants and adjusting the polarizability scaling if necessary.
Development of parameters to model vdW forces in the Drude
FF, which are treated using the LennardJones (LJ) 612 term,
follows closely the protocol established for the additive FF and will
only be briefly outlined here. Jorgensen and coworkers [99, 100]
pioneered the use of condensed-phase simulations, usually pure
liquids, as the basis for optimization of LennardJones (LJ) parameters that account for both vdW attraction and interatomic repulsion. Typically, once electrostatic parameters are determined, the
LJ parameters for a model compound can be adjusted to reproduce
experimental pure solvent properties such as heat of vaporization,
density, isothermal compressibility, heat capacity, heat of sublimation, lattice geometry, and free energy of aqueous solvation, as
available. Although this is an effective method for the fine-tuning
of the parameters, there are important issues. One is parameter
correlation, such that LJ parameters for different atoms in a molecule and/or the magnitudes of ij and Rmin on the same atom, can
compensate for individual unbalanced values, making it difficult to
gauge whether they are balanced relative to one another [101].
To overcome this problem, a method has been developed to determine the relative value of the LJ parameters based on high level
QM data [102] with the absolute values being based on scans of ij
and Rmin that reproduce experimental data [103, 104]. This
approach requires supramolecular interactions between rare gases
55
56
Table 1
Gas phase dipole moments of alanine dipeptide and (Ala)5a, molecular polarizability of alanine
dipeptide, and relative energies of (Ala)5
Molecular dipole moment of alanine dipeptide (Debye)
R
QMb
C5
DrudeNMA
DrudeALA
Drude-2013
QM
DrudeNMA
DrudeALA
Drude-2013
6.2
5.0
6.4
6.7
4.7
5.8
2.3
2.6
1.3
0.1
3.1
3.0
4.4
5.6
1.8
2.3
1.6
0.9
1.7
1.5
1.0
0.9
0.6
0.3
5.9
4.9
5.4
5.8
1.2
0.9
1.3
1.3
C5
QMc
DrudeNMA
DrudeALA
Drude-2013
QM
DrudeNMA
22.0
13.5
22.4
20.8
11.6
24.4
DrudeALA
4.5
Drude-2013
9.3
C5
QMb
DrudeNMA
DrudeALA
Drude-2013
QM
DrudeNMA
DrudeALA
Drude-2013
x x
13.57
13.40
16.18
15.30
15.49
16.02
19.89
16.07
y y
12.72
12.60
14.29
14.36
12.06
11.87
13.39
12.78
zz
11.71
11.03
12.68
9.94
10.35
9.78
11.05
10.39
Drude-NMA
Drude-ALA
Drude-2013
6.59
6.21
5.31
3.89
14.83
5.77
0.42
10.17
(Ala)5 is acetyl-(Ala)5-N-methylamide
QM dipole moments and polarizabilities of alanine dipeptide obtained at the B3LYP/aug-cc-pVDZ level with the
polarizabilities scaled by 0.85
c
QM dipole moments for (Ala)5 obtained at the B3LYP/6-31G* level
d
Single point energies were calculated at the RIMP2/cc-pVTZ//RIMP2/cc-pVDZ level
b
57
Fig. 1 Illustration of induced dipoles on dipeptide moieties of alanine dipeptide and (Ala)5. Values in parenthesis
are for alanine dipeptide
58
weaker than in the longer polypeptide where Ci1 feels the electric
field originating from the same amino acids NH group. The case
is similar for the N atoms, with Ni+1 showing a much stronger
induced local dipole in (Ala)5 as compared to the alanine dipeptide.
The induced dipole on C is smaller on (Ala)5, enhancing the
dipole interaction between Ni and Ci. This results in two effects.
First, local dipoles associated with the peptide bonds interact with
each other enhancing the local dipole moments associated with
each peptide bond and, second, the larger dipole strengthens electrostatic interactions with water leading to overstabilization of the
C5 conformation. Indeed, a comparison of the dipole moments of
acetyl-(Ala)5-N-methylamide for the NMA based model with QM
data indicated the overall dipole moment of the C5 conformation
to be significantly overestimated (Table 1). It was, therefore,
hypothesized that the overestimation, which would lead to even
more favorable interactions with aqueous solvent, was due to the
electrostatic parameter optimization procedure based on NMA
alone not defining balanced electrostatic interactions between the
individual peptide bonds. Based on this analysis it was concluded
that use of larger model compounds allowing communication
between adjacent peptide bonds was required in the determination
of electrostatic parameters, with the initial candidate being the alanine dipeptide.
Electrostatic parameters based on the alanine dipeptide were
determined by averaging the components over five independent
sets of parameters obtained from electrostatic potential (ESP) fitting
corresponding to the R, L, C5, PPII and C7eq conformations.
This model is referred to as Drude-ALA in the text below. For
each conformation the electrostatic parameter optimization, which
included the partial atomic charges, atomic polarizabilities, and
atom-based Thole factors, was performed using the FITCHARGE
module of CHARMM by fitting to the QM ESP maps as described
above. The outcome is electrostatic parameters that better reproduce the change in the ESP associated with electrostatic interactions
between the peptides bonds in the different relative orientations.
The resulting Drude-ALA model yielded a smaller dipole moment
for the C5 conformation for acetyl-(Ala)5-N-methylamide (Table 1).
Simulations of (Ala)5 in aqueous solution were also performed
and compared to Drude-NMA, and while the Drude-ALA model
showed improved agreement with experiment, the agreement was
still poor as compared to the additive C36 FF. It was found that the
PPII region started to be populated, though the C5 conformation
still dominated, indicating that the inclusion of electrostatic interactions between the peptide bonds during parameter optimization did
improve the quality of the FF. However, those improvements were
clearly insufficient, indicating that different target data were needed
to obtain a more accurate electrostatic model for the polypeptide
backbone.
59
60
61
target data was obtained for a number of amino acids, notable examples being Ile, Lys, and Thr. Overall, the final OC values are typically
0.7 or higher, though lower values were also found including Asn 2,
Asp 1, Gln 2, and Glu 1. The final parameters were used for the
reported polypeptide and protein simulations. In ref. 120 we present
detailed descriptions of the optimization protocol and final results.
4.4 The AMOEBA
Force Field
and Parametrization
of Proteins
5.1 Peptide
Simulations with C36
Additive,
AMOEBA-2013,
and Drude-2013 Force
Fields
62
63
Full Proteins
64
Fig. 2 100-ns snapshots from Drude-2013 simulations (red ) of lysozyme (135L) and dethiobiotin synthase
1BYI superimposed on the starting crystallographic structures (blue)
65
Summary
The field of empirical FF based simulations of proteins continues
to develop. Since the last publication of a similar review great progress has been made, including the publication of two polarizable
force fields for proteins as well as improvements in the AMBER
and CHARMM additive protein force fields. Work on other classes
of biopolymers has also made significant progress allowing for simulations of heterogeneous systems. As other researchers start using
the recently published force fields, in particular the polarizable
force fields, limitations will certainly be found and corrections and
improvements are expected.
As was emphasized in this review, development of electrostatic
parameters in the Drude force field is very complex. It is expected
that new optimization algorithms together with more sophisticated target data will lead to significant progress. Polarizable models for other classes of biomolecules based on the Drude oscillator
will be published soon for DNA and carbohydrates as well as a
wider range of lipids.
While polarizable MD simulations will make a significant
contribution to our understanding of protein structure and
function it should be emphasized that these models are more
sensitive to initial conditions than with an additive FF, and can
have polarization catastrophes that will cause simulations to fail.
To overcome this it is suggested that systems initially be set up
and equilibrated with an additive FF and then converted to the
polarizable model. To facilitate this procedure the CHARMMGUI [130] has been extended to include a new utility, the
Drude Prepper. The Drude Prepper reads equilibrated
CHARMM PSF and coordinate files and converts them to Drude
format files. This includes the production of inputs for MD
simulations using CHARMM or NAMD. This utility will greatly
facilitate the application of the Drude model to a range of
proteins as well as other systems.
Concerning computational efficiency, the Drude model typically requires the use of a 1 fs integration time step during MD
simulations. In addition, there is an approximately twofold overhead associated with the calculation of the polarization contribution to the electrostatics. Thus, the model is approximately fourfold
slower than corresponding additive simulations performed with a
2 fs integration time step. However, the NAMD implementation is
highly parallelizable [59], which will facilitate simulations of large
systems using the Drude model.
66
Acknowledgement
Financial support from the NIH (GM072558) and computational
support from the University of Maryland Computer-Aided Drug
Design Center, and the Extreme Science and Engineering Discovery
Environment (XSEDE), which is supported by National Science
Foundation grant number OCI-1053575, are acknowledged.
References
1. MacKerell AD (2004) Empirical force fields
for biological macromolecules: overview and
issues. J Comput Chem 25(13):15841604
2. Stone AJ (2008) Intermolecular potentials.
Science 321(5890):787789
3. Freddolino PL, Harrison CB, Liu YX, Schulten
K (2010) Challenges in protein-folding simulations. Nat Phys 6(10):751758
4. Warshel A, Kato M, Pisliakov AV (2007)
Polarizable force fields: history, test cases, and
prospects. J Chem Theory Comput
3(6):20342045
5. Lopes PEM, Roux B, MacKerell AD (2009)
Molecular modeling and dynamics studies
with explicit inclusion of electronic polarizability: theory and applications. Theor Chem
Acc 124(12):1128
6. Zhu X, Lopes PEM, MacKerell AD (2012)
Recent developments and applications of the
CHARMM force fields. Wiley Interdiscip Rev
Comput Mol Sci 2(1):167185
7. Guvench O, MacKerell AD (2008)
Comparison of protein force fields for molecular dynamics simulations. In: Kukol A (ed)
Molecular modeling of proteins. Humana
Press, Totowa, NJ, pp 6388
8. Lopes PEM, Harder E, Roux B, MacKerell AD
(2009) Formalisms for the explicit inclusion of
electronic polarizability in molecular modeling
and dynamics studies. In: York DM, Lee T-S
(eds) Multi-scale quantum models for biocatalysis. Springer, Netherlands, pp 219257
9. Salomon-Ferrer R, Case DA, Walker RC
(2013) An overview of the Amber biomolecular simulation package. Wiley Interdiscip
Rev Comput Mol Sci 3(2):198210
10. Beauchamp K, Lin Y-S, Das R, Pande V
(2012) Are protein force fields getting better?
A systematic benchmark on 524 diverse NMR
measurements. J Chem Theory Comput 8(4):
14091414
11. Burkert U, Allinger N (1982) Molecular
mechanics. American Chemical Society,
Washington, DC
67
68
60.
61.
62.
63.
64.
65.
66.
67.
68.
69.
70.
85.
86.
87.
88.
89.
90.
91.
92.
93.
94.
95.
96.
97.
69
70
123.
124.
125.
126.
127.
128.
129.
130.
71
Chapter 4
Lipid Membranes forMembrane Proteins
AndreasKukol
Abstract
The molecular dynamics (MD) simulation of membrane proteins requires the setup of an accurate
representation of lipid bilayers. This chapter describes the setup of a lipid bilayer system from scratch
using generally available tools, starting with a definition of the lipid molecule POPE, generation of a
lipid bilayer, energy minimization, MD simulation, and data analysis. The data analysis includes the
calculation of area and volume per lipid, deuterium order parameters, self-diffusion constant, and the
electron density profile.
Key words Lipid bilayer, Molecular dynamics, Simulation, Trajectory analysis, Area per lipid, Volume
per lipid, Deuterium order parameter, Self-diffusion constant, Electron density profile
1 Introduction
Molecular simulations of membrane proteins require consideration
of the lipid membrane environment. While molecular dynamics
(MD) simulations with implicit membrane models have been used
successfully [1], for higher accuracy explicit representation of the
lipid bilayer is desirable. Furthermore, dependent on the research
question, if lipidprotein interactions are a subject of the study, an
explicit representation of lipid molecules in unavoidable. Having
decided on an explicit representation of lipids, further choice exists
between coarse-grained, united-atom, and all-atom lipid models
and force fields. In coarse-grained forcefields (covered in Chapter
7 of this book) several atoms are subsumed into one particle, for
example in the MARTINI force field [2, 3] four carbon atoms of
the aliphatic chain are subsumed into one particle. All-atom force
fields usually provide the highest accuracy for the description of
lipids and proteins. United-atom force fields subsume nonpolar
hydrogen atoms into their adjacent carbon-atoms resulting in a
moderate reduction of the number of particles, e.g., for a
1,2-dipalmitoyl-glycero-3-phosphocholine (DPPC) from 130 particles for the all-atom model to 50 particles for the united-atom
Andreas Kukol (ed.), Molecular Modeling of Proteins, Methods in Molecular Biology, vol. 1215,
DOI 10.1007/978-1-4939-1465-4_4, Springer Science+Business Media New York 2015
73
74
Andreas Kukol
2 Methods
2.1 Materials
75
field are assigned to those atoms, bonds, and angles, for example a
predefined bond length and force constant. A good place to obtain
lipid topologies for various force fields is Lipidbook (http://
lipidbook.bioch.ox.ac.uk/) [14]. For this chapter, we will use a
new topology of POPE shown in Fig.1 based on topologies developed in earlier work for the GROMOS96 53a6 force field [4].
2.3 Lipid
BilayerSetup
76
Andreas Kukol
[ moleculetype ]
nrexcl
; Name
POPE
3
[ atoms ]
;
nr
type
1
H
2
H
3
H
4
NL
5
CH2
6
CH2
7
OA
8
P
9
OM
10
OM
11
OA
12
CH2
13
CH1
14
OE
15
C
16
O
17
CH2
18
CH2
19
CH2
20
CH2
21
CH2
22
CH2
23
CH2
24
CR1
25
CR1
26
CH2
27
CH2
28
CH2
29
CH2
30
CH2
31
CH2
32
CH2
33
OE
34
C
35
O
36
CH2
37
CH2
38
CH2
39
CH2
40
CH2
41
CH2
42
CH2
43
CH2
44
CH2
45
CH2
46
CH2
47
CH2
48
CH2
49
CH2
50
CH3
51
CH2
52
CH3
[ bonds ]
;
ai
4
5
6
7
8
8
8
11
12
13
13
14
15
15
17
18
19
20
21
22
23
24
resnr
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
aj
5
6
7
8
9
10
11
12
13
14
32
15
16
17
18
19
20
21
22
23
24
25
residu
POPE
POPE
POPE
POPE
POPE
POPE
POPE
POPE
POPE
POPE
POPE
POPE
POPE
POPE
POPE
POPE
POPE
POPE
POPE
POPE
POPE
POPE
POPE
POPE
POPE
POPE
POPE
POPE
POPE
POPE
POPE
POPE
POPE
POPE
POPE
POPE
POPE
POPE
POPE
POPE
POPE
POPE
POPE
POPE
POPE
POPE
POPE
POPE
POPE
POPE
POPE
POPE
funct
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
atom
H1
H2
H3
N4
C5
C6
O7
P8
O9
O10
O11
C12
C13
O14
C15
O16
C17
C18
C19
C20
C21
C22
C23
C24
C25
C26
C27
C28
C29
C30
C31
C32
O33
C34
O35
C36
C37
C38
C39
C40
C41
C42
C43
C44
C45
C46
C47
C48
C49
C50
CA1
CA2
cgnr
0
0
0
0
0
1
1
1
1
1
1
2
2
2
2
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
18
18
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
charge
0.3000
0.3000
0.3000
-0.2
0.3
0.4
-0.8
1.7
-0.8
-0.8
-0.7
0.4
0.3
-0.7
0.7
-0.7
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0.5
-0.7
0.8
-0.6
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
gb_21
gb_27
gb_18
gb_28
gb_24
gb_24
gb_28
gb_18
gb_27
gb_18
gb_27
gb_10
gb_5
gb_23
gb_27
gb_27
gb_27
gb_27
gb_27
gb_27
gb_27
gb_10
Fig. 1 The complete topology of POPE in the GROMOS96 54a7 force field
mass
1.0080 ; qtot:0.3
1.0080 ; qtot:0.6
1.0008 ; qtot:0.9
14.0067 ; qtot:0.7
14.0270 ; qtot:1.0
14.0270 ; qtot:1.0
15.9994 ; qtot:0.54
30.9738 ; qtot:2.3
15.9994 ; qtot:1.5
15.9994 ; qtot:0.7
15.9994 ; qtot:0
14.0270 ; qtot:0.08
13.0190 ; qtot:0.52
15.9994 ; qtot:-0.14
12.0110 ; qtot:0.56
15.9994 ; qtot:0.0
14.0270 ; qtot:
14.0270 ; qtot:
14.0270 ; qtot:
14.0270 ; qtot:
14.0270 ; qtot:
14.0270 ; qtot:
14.0270 ; qtot:
13.0190 ; qtot:
13.0190 ; qtot:
14.0270 ; qtot:
14.0270 ; qtot:
14.0270 ; qtot:
14.0270 ; qtot:
14.0270 ; qtot:
14.0270 ; qtot:
14.0270 ; qtot:
15.9994 ; qtot:
12.0110 ; qtot:
15.9994 ; qtot:
14.0270 ; qtot:
14.0270 ; qtot:
14.0270 ; qtot:
14.0270 ; qtot:
14.0270 ; qtot:
14.0270 ; qtot:
14.0270 ; qtot:
14.0270 ; qtot:
14.0270 ; qtot:
14.0270 ; qtot:
14.0270 ; qtot:
14.0270 ; qtot:
14.0270 ; qtot:
14.0270 ; qtot:
15.0350 ; qtot:
14.0270 ; tail2
15.0350; tail2
26
27
28
29
30
31
51
52
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
4
4
4
[ pairs ]
; ai
aj funct
1
6
1
2
6
1
3
6
1
4
7
1
5
8
1
6
9
1
6
10
1
6
11
1
7
12
1
8
13
1
9
12
1
10
12
1
11
14
1
11
32
1
12
15
1
12
33
1
13
16
1
13
17
1
13
34
1
14
18
1
14
33
1
15
19
1
15
32
1
16
18
1
22
25
1
24
27
1
32
35
1
32
36
1
33
37
1
34
38
1
35
37
1
[ angles ]
; ai
aj
4
5
6
7
7
7
8
9
9
10
11
12
12
13
13
14
14
Fig. 1(continued)
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
gb_27
gb_27
gb_27
gb_27
gb_27
gb_27
gb_27
gb_27
gb_18
gb_10
gb_5
gb_23
gb_27
gb_27
gb_27
gb_27
gb_27
gb_27
gb_27
gb_27
gb_27
gb_27
gb_27
gb_27
gb_27
gb_27
gb_2
gb_2
gb_2
ak funct
5
6
6
7
7
8
8
9
8
10
8
11
11
12
8
10
8
11
8
11
12
13
13
14
13
32
14
15
32
33
13
32
15
16
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
ga_15
ga_15
ga_26
ga_14
ga_14
ga_5
ga_26
ga_29
ga_14
ga_14
ga_15
ga_13
ga_13
ga_22
ga_15
ga_13
ga_31
77
78
Andreas Kukol
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
1
2
3
1
2
3
[ dihedrals ]
; ai
aj
1
4
4
5
4
5
5
6
6
7
6
7
7
8
7
8
8
11
11
12
11
12
11
12
12
13
12
13
12
13
13
32
13
14
14
13
14
15
15
17
17
18
18
19
19
20
20
21
21
22
21
22
21
22
22
23
24
25
25
26
25
26
25
26
26
27
27
28
28
29
29
30
30
31
13
32
Fig. 1(continued)
15
17
15
18
19
20
21
22
23
24
25
26
27
28
29
30
31
51
33
34
34
36
34
37
38
39
40
41
42
43
44
45
46
47
48
49
4
4
4
4
4
4
ak
5
6
6
7
8
8
11
11
12
13
13
13
32
32
14
33
15
32
17
18
19
20
21
22
23
23
23
24
26
27
27
27
28
29
30
31
51
33
17
18
17
19
20
21
22
23
24
25
26
27
28
29
30
31
51
52
34
35
36
37
36
38
39
40
41
42
43
44
45
46
47
48
49
50
2
3
1
5
5
5
al funct
6
1
7
1
7
1
8
1
11
1
11
1
12
1
12
1
13
1
14
1
32
1
32
1
33
1
33
1
15
1
34
1
17
1
33
1
18
1
19
1
20
1
21
1
22
1
23
1
24
1
24
1
24
1
25
3
27
3
28
1
28
1
28
1
29
1
30
1
31
1
51
1
52
1
34
1
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
ga_16
ga_15
ga_35
ga_15
ga_15
ga_15
ga_15
ga_15
ga_15
ga_27 ;double bond
ga_27 ; double bond
ga_15
ga_15
ga_15
ga_15
ga_15
ga_15
ga_15
ga_22
ga_31
ga_16
ga_15
ga_35
ga_15
ga_15
ga_15
ga_15
ga_15
ga_15
ga_15
ga_15
ga_15
ga_15
ga_15
ga_15
ga_15
ga_10
ga_10
ga_10
ga_11
ga_11
ga_11
phi0
gd_29
gd_4
gd_36
gd_29
gd_20
gd_27
gd_20
gd_27
gd_29
gd_34
gd_34
gd_17
gd_34
gd_17
gd_29
gd_29
gd_13
gd_18
gd_40
gd_34
gd_34
gd_34
gd_34
gd_34
0
180
0
2.885
2.885
0
180
0
gd_34
gd_34
gd_34
gd_34
gd_34
gd_29
cp
mult
3.350
1
1.660
2
7.333
3
4.17 7.8 4.4 0.0 0.0
4.17 7.8 4.4 0.0 0.0
3.350
1
1.660
2
7.333
3
33
34
36
37
38
39
40
41
42
43
44
45
46
47
48
34
36
37
38
39
40
41
42
43
44
45
46
47
48
49
[ dihedrals ]
; ai
aj
ak
13
14
15
14
34
33
23
24
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
al funct
32
17
36
25
79
gd_13
gd_40
gd_34
gd_34
gd_34
gd_34
gd_34
gd_34
gd_34
gd_34
gd_34
gd_34
gd_34
gd_34
gd_34
12
16
35
26
2
2
2
2
gi_2
gi_1
gi_1
gi_1 ; double bond
#ifdef POSRES_LIPID
#include "lipid_posre.itp"
#endif
Fig. 1(continued)
Fig. 2 The topology of the whole lipid bilayer system composed of 128 POPE molecules and 2560 water molecules.
The molecular topologies are read from itp-files via the #include commands
80
Andreas Kukol
Fig. 3 The parameters for initial energy minimization. Note that the values for nsteps (normally thousands) and
emstep (normally 0.1) are set very low in this example
81
82
Andreas Kukol
Fig. 4 The Perl-script ScaleNonBond.pl to rescale the non-bonding interactions of a GROMACS processed
topology
Fig. 5 The parameters that were used to run the 100ns MD simulation of the lipid bilayer system
83
84
Andreas Kukol
Fig. 5(continued)
85
Since this chapter is aimed at providing lipid membranes for membrane proteins, rather than investigating the properties of lipid
bilayers in detail, the aim of the data analysis is to establish, if the
simulation reproduces experimentally known parameters to reasonable accuracy. Typically the area per lipid and volume per lipid
are compared with experiment; for POPE this is available from a
study by Rappolt etal. [15]. Deuterium order parameters and self-
diffusion constants are available for some lipids. If they are not
available for the particular type of lipid, they may be compared
with values from other lipids in order to check for errors in the
topology or simulation.
86
Andreas Kukol
Fig. 6 Area/lipid (black curve) and volume/lipid (grey curve) over the course of the 100ns simulation
Table 1
Properties of the POPE lipid bilayer from simulations compared with experimental data
Simulation
Experiment
Area/lipid
(0.6200.008) nm2
0.6025nm2 [15]
Volume/lipid
(1.1350.003) nm3
1.175nm3 [15]
Self-diffusion coefficient
(6.420.02) 108cm2/s
[C50]
For the sn2 lipid acyl chain, the order parameters must be
calculated separately for the saturated and unsaturated carbons.
The index file for the saturated carbons sn2.ndx contains all
atoms:
[C15]
[C17]
[C18]
87
[C31]
[CA1]
[CA2]
The index file for the unsaturated carbons sn2_unsat.ndx
contains index groups for the unsaturated carbons and the two
neighbors on each side:
[C23], [C24], [C25], [C26]
The index groups are made interactively with:
make_ndx f after_100ns.gro o sn1.ndx
This opens an interactive session, in which you create
index groups for each acyl chain atom using the add command: a c34, a c36, a c37, and so on. Finally
you delete the default groups: del 0-5 and quit.
2. Calculate the order parameters over the last 70ns of the simulation from the trajectory:
g_order f run_100ns.txt s run_100ns.tpr n
sn1.ndx od deuter_sn1.xvg
b 30000
g_order f run_100ns.txt s run_100ns.tpr n
sn2.ndx od deuter_sn2.xvg
b 30000
g_order f run_100ns.txt s run_100ns.tpr n
sn2_unsat.ndx
od deuter_sn2unsat.xvg b 30000
3. Using a text editor replace the order parameters for the unsaturated carbons in deuter_sn2.xvg by the corresponding values
from deuter_sn2unsat.xvg (see Note 4).
2.6.3 Lateral Self-
Diffusion Coefficient
1. An index file need to be prepared that contains all atoms numbers belonging to lipid molecules:
make_ndx f after_100ns.gro o lipids.ndx
One of the default index group should correspond to the
lipid, e.g., 2 POPE. Then you type keep 2 and q for save
and quit.
2. The self-diffusion coefficient can then be calculated with the
g_msd tool.
g_msd f run_100ns.xtc s run_100ns.tpr n
lipids.ndx
lateral z mol diffusion.xvg o msd.xvg b
50000
A value of (6.420.02) 108cm2/s is reported, which is in
the right region for lipid diffusion. Note that only the last 50ns
of the trajectory were analyzed in the example above due to
computer memory limitations.
88
Andreas Kukol
55
H1 = 1
H2 = 1
H3 = 1
N4 = 7
C5 = 8
C6 = 8
O7 = 8
P8 = 15
O9 = 8
O10 = 8
...
...
...
OW = 8
HW1 = 1
HW2 = 1
3 Notes
1. Clicking further on Next Step provides us with the required
input file to run CHARMM MD simulations to equilibrate the
bilayer on the local computer. Since we want to use a different
force field, we will continue the equilibration with Gromacs.
89
Fig. 8 The electron density in electrons/nm3 along the z-coordinate of the simulation box
Acknowledgements
This work was supported by the School of Life and Medical
Sciences, University of Hertfordshire and has made use of the
University of Hertfordshire Science and Technology Research
Institute high-performance computing facility. I thank all research
groups that made their tools and programs available to the research
community.
90
Andreas Kukol
References
1. Tanizaki S, Feig M (2006) Molecular dynamics
simulations of large integral membrane proteins with an implicit membrane model. J Phys
Chem B 110(1):548556
2. Monticelli L, Kandasamy SK, Periole X, Larson
RG, Tieleman DP, Marrink SJ (2008) The
MARTINI
coarse-grained
force
field:
Extension to proteins. J Chem Theory Comput
4(5):819834
3. Marrink SJ, Risselada HJ, Yefimov S, Tieleman
DP, de Vries AH (2007) The MARTINI
force field: Coarse grained model for biomolecular simulations. J Phys Chem B 111(27):
78127824
4. Kukol A (2009) Lipid Models for United-Atom
Molecular Dynamics Simulations of Proteins.
JChem Theory Comput 5(3):615626
5. Ulmschneider JP, Ulmschneider MB (2009)
United
Atom
Lipid
Parameters
for
Combination with the Optimized Potentials
for Liquid Simulations All-Atom Force Field.
JChem Theory Comput 5(7):18031813
6. Schmid N, Eichenberger AP, Choutko A,
Riniker S, Winger M, Mark AE etal (2011)
Definition and testing of the GROMOS force-
field versions 54A7 and 54B7. Eur Biophys J
Biophys Lett 40(7):843856
7. Pronk S, Pall S, Schulz R, Larsson P, Bjelkmar P,
Apostolov R etal (2013) GROMACS 4.5: a
high-throughput and highly parallel open source
molecular simulation toolkit. Bioinformatics
29(7):845854
8. Hess B, Kutzner C, van der Spoel D, Lindahl E
(2008) GROMACS 4: Algorithms for highly efficient, load-balanced, and scalable molecular simulation. J Chem Theory Comput 4(3):435447
Chapter 5
Molecular Dynamics Simulations of Membrane Proteins
Philip C. Biggin and Peter J. Bond
Abstract
Membrane protein structures are underrepresented in the Protein Data Bank (PDB) due to difficulties
associated with expression and crystallization. As such, it is one area where computational studies, particularly Molecular Dynamics (MD) simulations, can provide useful additional information. Recently, there has
been substantial progress in the simulation of lipid bilayers and membrane proteins embedded within
them. Initial efforts at simulating membrane proteins embedded within a lipid bilayer were relatively slow
and interactive processes, but recent advances now mean that the setup and running of membrane protein
simulations is somewhat more straightforward, though not without its problems. In this chapter, we outline practical methods for setting up and running MD simulations of a membrane protein embedded
within a lipid bilayer and discuss methodologies that are likely to contribute future improvements.
Key words Molecular dynamics, Simulation, Computational, Membrane proteins, Ion channels
Introduction
Membrane proteins are thought to constitute approximately 30 %
of genomes [1]. Furthermore it has been estimated that over half
of all drug targets are membrane proteins [2]. However, due to
problems associated with expression and crystallization, the number of high-resolution crystal structures is less than 1 % of the total
number of structures (see http://blanco.biomol.uci.edu/mpstruc/
for a maintained list of membrane protein structures). The situation is further complicated by the fact that many membrane proteins undergo quite large conformational changes in order to
complete their function (for example transporter proteins [35]
which cycle between at least two distinct states). Crystallography
will at best only be able to capture a time and space averaged snapshot of these states. Computer simulations on the other hand, and
in particular Molecular Dynamics (MD) simulations, are useful
tools that in addition to providing information on the stability of a
membrane protein can also provide insight into the manner in
which these conformational changes can proceed. Thus there has
Andreas Kukol (ed.), Molecular Modeling of Proteins, Methods in Molecular Biology, vol. 1215,
DOI 10.1007/978-1-4939-1465-4_5, Springer Science+Business Media New York 2015
91
92
Theory
The underlying theory for molecular dynamics simulations is
covered in Chapters 1 and 3, and therefore in this section we briefly
discuss some specific considerations that researchers should bear in
mind when performing simulations of membrane proteins. Perhaps
the most important of these is the timescale of the problem that
is under consideration and the resource that is available. The many
different aspects of membrane dynamics span a large timescale ranging from a few picoseconds (for a protein side-chain to rotate)
though to minutes and longer (for flip-flop motion of lipids). Indeed,
where resources are minimal and only an approximate representation
of the bilayer is required, one may be content with using a slab of
octane to represent the hydrophobic core of the bilayer [14] or even
a hybrid model [15]. Substantial recent efforts have been made to
approximate the lipid molecules in a different way by using a coarsegrain approach, where typically 4 atoms are represented by one particle [16]. These methods have become popular as they allow much
longer timescale events to be explored, but of course they are less
detailed than a fully atomistic simulation.
93
Methods
There are obviously two main components to a membrane protein
simulation: the actual protein and the lipid bilayer it is to be embedded in. Although we focus on the issues concerning the whole
system, it is worth briefly reviewing practical considerations for
these individual components.
3.1 Preparation
of the Protein
94
question one is trying to address with the simulation. For the case
where only a few atoms are missing from a small number of
side-chains one can manually build in the missing atoms using an
interactive modelling program such PyMOL [22] or What-If [23]
(see Note 1). For the more complicated case where whole loops are
missing, typically one has to resort to programs which can build
random structures which are geometrically correct such as Modeller
[24, 25]. Indeed, in some cases, it may be that construction of an
entire homology model is required (covered in Chapters 15, 16
and 17). Another related consideration is how to deal with the
termini in the structure. Frequently, the structure is not the whole
sequence of the protein, and therefore charged termini may not
be appropriate. One common procedure has been to build on capping groups that help to best mimic the continuing protein chain
(see Note 2). A simpler approach involves simply protonating the
C-terminus and deprotonating the N-terminus.
In all but the very high-resolution structures, one will still have
to add hydrogen atoms, as these will not be present in the PDB
file. Although this is a very simple process, there are decisions to be
made even for this process: (1) First, the choice of force-field is
importantin particular, whether it is an all atom, such as the
CHARMM parameter sets [26, 27], or a united-atom model in
which only polar hydrogens are explicit, as in the GROMOS [2830]
and Berger lipid [31] sets. It is worth bearing in mind that lipid
force fields are continually under refinement to improve agreement with experimental data [29, 32, 33]. There are many force
fields available for simulating membranes [34], and recent efforts
towards systematically comparing them and assessing their relative
strengths and weaknesses have been reported [35]. (2) Secondly,
the protonation states of ionizable side-chains in proteins must be
considered. United-atom force fields will give the benefit of
reduced computational effort due to reduction in the number of
particles, but all-atom models might be preferred in some cases
where greater accuracy is required. Various programs exist to calculate the pKa of ionisable side-chains (PROPKA [36, 37], H++ [38,
39], WHAT IF [23]), several of which also exist as online servers
(see Note 1). A Graphical User Interface (GUI) has recently been
developed as a plug-in for VMD [40] to help interpret the results
of PROPKA-based pKa predictions [41]. Most of the programs
rely upon calculating an estimate of the free energy (via the thermodynamic cycle) of protonating the residue within its proteinaceous environment. It may be the case that the protonation state
is not important, in which case default ionization states at pH 7 are
assumed. However, there are examples where the protonation state
may be critical as exemplified by the protonation state of Glu71 in
KcsA [4245]. The position of the hydrogens on histidine residues
should also be considered carefully, usually by simple visual inspection to optimize local hydrogen bonding.
95
3.3 Setup
of the Protein
in the Membrane
96
97
Fig. 1 (a) shows the protein BtuB (dark molecular surface), embedded in the bilayer after the removal of overlapping lipids (only protein and lipid are shown in this figure for clarity). Lipid atoms are shown as van der
Waals spheres. During the course of the equilibration phase, lipid molecules will move in around the protein as
shown in (b) which is an equilibrated system
98
the lipids move towards the protein and the system equilibrates.
The length of this equilibration phase is usually determined by
monitoring the area per lipid as a function of time. After a period
of time (typically between 1 and 3 ns for systems with 512 lipids)
one should see this plateau off. This value can be checked against
experimental data, although this can be difficult to come by for
exactly the same system. Before unconstrained production or further dynamics can be performed it is best to allow the protein to
relax in stages. There are many different approaches reported in
the literature, which can appear to be rather subjective, but the
underlying philosophy is to work back from the backbone of the
protein (see Note 6).
3.4 An Alternative
CoarseGrained Method
99
Fig. 2 Illustration of the how the atomistic model translates into the coarse-grained model for the KcsA
potassium channel. Aromatic particles are shown as black van der Waals spheres, hydrophobic or backbone
particles are shown in light grey, and polar/charged particles are shown in dark grey
100
Fig. 3 (a) Shows the random starting configuration of the coarse-grained simulation of KcsA with dipalmitoylphosphatidylcholine (DPPC). KcsA is drawn as a black backbone trace. Lipid acyl chain particles are drawn as
light grey van der Waals spheres, glycerol backbone particles are shown in dark grey, and lipid headgroups
(including the phosphates) are drawn as black spheres. Water molecules are not shown for clarity. (b) is the
configuration after 200 ns, which clearly shows that the system has evolved into a bilayer arrangement with
KcsA embedded within it
101
102
The last step is to actually run your atomistic simulation. The primary
emphasis has been on using parameters and ensembles that best
reproduce the properties of lipid bilayers in the absence of proteins. A full review of these considerations is beyond the scope of
this chapter, but the interested reader is referred to several articles
that discuss sources of error and the best choice of parameters in
membrane simulations [17, 103105].
There are many properties that one could check in the simulation, but probably the most useful is the area per lipid, which gives
an indication of molecular packing and the membrane fluidity. It is
also a property that is sensitive to simulation set up whilst also
being a reasonably reliable indicator that other properties will
also be correct. It is important to remember here what your question islarge undulations across large membrane patches will
require much longer simulation time than a study of waterheadgroup interactions for example.
Finally, there are practical considerations such as disk-space
and storage of very large trajectories (see Note 7), a problem that
is presumably going to parallel the increase in computer power.
Conclusions
We have discussed two approaches that can be used to set up and
perform molecular dynamics simulations of membrane proteins.
The advantage of the first atomistic approach is that it is easy to use
and generally applicable. A disadvantage of this approach is that to
some extent it depends on a subjective positioning of the protein
within the bilayer in terms of its overall tilt and its disposition along
the bilayer normal. The second approach, via the use of coarsegrain methodologies, allows one to circumvent these problems.
The combination of both of these methodologies allows one
to explore a wide range of time and length scales with respect to
membrane proteins and should provide valuable information on
their structure and function.
Notes
1. There is also an online server version of the What-If program
(http://swift.cmbi.ru.nl/servers/html/index.html) that provides useful tools features to rebuild missing atoms in sidechains. Stereochemical checking tools are also available at this
site (useful if you are starting from a model). Similarly, online
servers now exist for pKa calculations of ionisable side-chains,
103
Table 1
Lipid configurations available for download
PI
URL
Lipids
Scott Feller
http://www.lipid.wabash.edu/
POPC,DOPC,
DPPC, SDPC
Helmut
Heller
http://heller.userweb.mwn.de/membrane/membrane.html
POPC
Wonpil Im
http://www.charmm-gui.org/?doc = input/membrane
Many
combinations
possible
Mikko
Karttunen
http://www.softsimu.net/downloads.shtml
DMTAP, DMP,
DPPC
Peter
Tieleman
http://wcm.ucalgary.ca/tieleman/downloads
DPC micelles,
POPC, DMPC,
DPPC, PLPC
Alexander
Lyubartsev
http://people.su.se/~jjm/Stockholm_Lipids/Downloads.html
Various
Stockholm
lipids
Jochen Hub
http://cmb.bio.uni-goettingen.de/downloads.html
Lipid patches
with cholesterol
Oliver
Beckstein
http://lipidbook.bioch.ox.ac.uk/
Various
104
Acknowledgements
We thank the Leverhulme Trust and Unilever for support and
Dr Jorge Pikunic for the BtuB coordinates and useful discussions.
References
1. Wallin E, von Heijne G (1998) Genome-wide
analysis of integral membrane proteins from
eubacterial, archean, and eukaryotic organisms. Protein Sci 7:10291038
2. Terstappen GC, Reggiani A (2001) In silico
research in drug discovery. Trends Pharmacol
Sci 22:2326
3. Lemieux MJ, Huang Y, Wang DN (2004)
The structural basis of substrate translocation
by the Escherichia coli glycerol-3-phosphate
transporter: a member of the major facilitator superfamily. Curr Opin Struct Biol 14:
405412
4. Guan L, Kaback HR (2006) Lessons from
lactose permease. Annu Rev Biophys Biomol
Struct 35:6791
5. Gether U, Andersen PH, Larsson OM,
Schousboe A (2006) Neurotransmitter transporters: molecular function of important
drug targets. Trends Pharmacol Sci 27:
375383
27.
28.
29.
30.
31.
32.
33.
34.
35.
36.
37.
38.
39.
105
106
40.
41.
42.
43.
44.
45.
46.
47.
48.
49.
50.
51.
52.
53.
54.
55.
56.
57.
58.
59.
60.
61.
62.
63.
64.
65.
66.
107
108
94.
95.
96.
97.
98.
99.
100.
101.
102.
103.
104.
105.
106.
107.
Chapter 6
Membrane-Associated Proteins andPeptides
MarcF.Lensink
Abstract
This chapter discusses the practical aspects of setting up molecular dynamics simulations of membrane-
associated proteins and peptides, and the analysis thereof. Topology files for selected lipids are provided
and selected analysis tools presented. These include tools for the creation of lipid bilayers of mixed lipid
content (DOPE) and easy extraction of lipid coordinates (g_zcoor, g_xycoor), the calculation of helical
axes (g_helixaxis) and aromatic order parameters (g_arom), the determination of peptide- or protein-
interacting lipids (g_under), and the investigation of lipid-specific interactions through the calculation of
lipid-bridged residueresidue contacts (g_prolip).
Key words Molecular dynamics, Lipid bilayer, Membrane, Peptidelipid interaction, Phospholipid,
Cholesterol, GROMACS, Helix axis, DOPE, Lipid order parameter, Specific interaction
1 Introduction
The underrepresentation of membrane protein structures in the
Protein Data Bank [1] is a direct result of the inherent difficulty of
membrane protein crystallization [2], but it stands in sharp contrast with the relevance of membrane proteins to cellular functioning. Roughly 30% of all genomic sequences encode for membrane
proteins, in fact most major processes in the cell are initiated at the
membrane surface. Due to the lack of atomic resolution structural
data molecular modeling and simulation techniques are expected
to play an increasingly relevant role in the study of membrane-
related systems. Membrane protein simulations complicate matters
with respect to soluble proteins at a number of levels. The generally larger size of membrane proteins makes for slower convergence, the presence of a lipid bilayer imposes a larger system box
size, both adding up to longer simulation times. The longer simulation times produce larger trajectory files, which are more complicated and take longer to process. However, the bilayer plane
typically aligns with one of the primary system axes, offering an
easy frame of reference.
Andreas Kukol (ed.), Molecular Modeling of Proteins, Methods in Molecular Biology, vol. 1215,
DOI 10.1007/978-1-4939-1465-4_6, Springer Science+Business Media New York 2015
109
110
Marc F. Lensink
This chapter presents some tips and tricks for the simulation of
membrane-bound proteins and peptides. These include simulation
setup and the creation of lipid bilayers of mixed content, but also
selected analysis tools are presented, that for example allow the
easy extraction of coordinates from the trajectory, the calculation
of helical axes and aromatic order parameters, the determination of
peptide- or protein-interacting lipids, and the investigation of specific proteinlipid interactions through the calculation of lipid-
bridged residueresidue contacts. The tips and tricks presented
inthis chapter refer to the GROMACS [3] suite of programs
(see Note 1), but their principles are generally applicable to other
simulation packages. The presented simulation setup and analysis
tips are based on two simulation studies: the association of a cationic peptide to a neutral and charged lipid bilayer [4], and the
detection of proteinlipid binding in an integral membrane protein
[5]. The analysis tools I wrote are either shell-script, or using the
GROMACS C programming libraries. Most of these are package-
independent since they only require a trajectory file, which is no
more than a sequence of structures.
2 System Setup
The general setup of molecular dynamics simulations requires
three input files: a structure file, describing the atomic positions of
the molecules in the system; a topology file, in which the inter-
atomic bonded and non-bonded connections are defined; and a
parameter file, supplying the algorithm with the necessary runtime
parameters. The first two of these I will briefly discuss. Programs or
files in this section are printed in bold typeface.
2.1 The
CoordinateFile
2.1.1 Situation A:
Placing aPeptide onTop
oftheLipid Bilayer
In this case the coordinates of peptide and lipids do not occupy the
same space and a simple concatenation of input files suffices.
1. Prepare a PDB file with the coordinates of your peptide.
2. Rotate the structure to adopt the wanted orientation: parallel
or perpendicular to the bilayer.
3. Have its geometric center coincide with that of the lipid bilayer
and increase the z coordinates (see Note 2) by 56nm
(see Note 3).
111
The common approach in setting up molecular dynamics simulations is to translate a complete coordinate filecontaining the
atomic coordinates of the bilayer, protein, water and counterions
into a topology using the residue building blocks. In spite of the
maturation of molecular dynamics force fields for the simulation of
proteins, lipid parameters are not as well integrated into these as
one would like. A useful approach therefore is to separate the
112
Marc F. Lensink
phosphate
P O4
CH 2
head group
POPC:
R=
+
(CH2 )2 N(CH3 ) 3
POPE:
R=
+
(CH2 )2 NH 3
POPG:
R=
CH 2 (CHOH)2
POPA:
nothing
POPS:
R=
CH
CH 2
( CH 2 )7
CH
CH 2 CH CO 2
NH+
3
double bond
carbonyl
( CH 2 )14
CH 3
CH
( CH 2 )7
CH 3
Fig. 1 Molecular structure of selected phospholipids, here with a sn1 palmitoyl and sn2 oleoyl tail. The different
head groups determine the overall charge of the molecule: PG, PA and PS carry a negative charge, while PE
and PC are neutral. The dashed boxes indicate commonly defined groups of atoms
c reation of topology files for protein and lipids and use a container
to combine them. The full procedure then becomes:
1. Extract the coordinates of your protein and process these with
pdb2gmx to create a topology. In case you have a small peptide, consider capping the C- and/or N-terminus (see Note 8).
2. Cut everything from the topology file that is not referring to
the molecule definition and save the result into a file called
protein.itp. This file you can then include at the appropriate
position in your global container topol.top.
3. Collect the topological descriptions for the other molecules in
the system, i.e., lipids, water, and counterions, and combine
the topology files (see Note 9).
2.3 The Lipid Bilayer
2.4 Modification
ofLipid Topology
andMixed-Lipid
Bilayers
113
114
Marc F. Lensink
Fig. 2 A solvated bilayer containing 90% DPPC and 10% cholesterol molecules. DPPC displayed as black
wireframe, cholesterol as blue sticks, and water molecules as red spheres. The configuration was created with
DOPE, the figure prepared with PyMol (The PyMOL Molecular Graphics System, version 1.6.0, Schrdinger, LLC)
115
3 Analysis
The dynamics of a protein is strongly affected by the presence of a
lipid bilayer. Backbone hydrogen bond shielding [9] and a
decreased dielectric constant in the membrane core [10] promote
the formation of secondary structure, both or . In addition, the
bilayer environment places a restraining force on the protein
dynamics due to the decreased fluidity with respect to a soluble
environment. Such external forces are characterized by slow-
motion displacements of secondary structure elements, readily
identified from RMSD plots after fitting to a common reference
frame, typically the protein transmembrane domain. An additional
frame of reference exists in the surface plane of the bilayer, ignoring eventual curvature effects. The orientation of a protein with
respect to the membrane it is binding to is especially relevant in the
case of peptidemembrane association.
3.1 Coordinate
Frame inBilayer
Simulations, g_zcoor
andg_xycoor
3.2 Calculation
ofHelical
Axis, g_helixaxis
116
Marc F. Lensink
b
1.0
SL
SN = 1/2
SL = 1/2
SN = 1/2
SL = 1
0.5
0.0
SN = 1
S=
1/2
L
SN
-0.5
8
12
Time (ns)
16
20
Fig. 3 Aromatic order parameters. (a) Visualization of aromatic order parameters. Solid and dashed arrows
represent SL and SN, resp. When either arrow is aligned with the normal to the bilayer plane (long arrow), the
respective order parameter equals 1. (b) Concomitant behavior of aromatic order parameters, for a single
tryptophan residue during a 20ns molecular dynamics simulation. Both order parameters cannot simultaneously be aligned to the z axis (equal 1), but they can be orthogonal to it (equal). Notice the immediate
decrease in SL after an increase of SN
117
3.5 Calculating
Properties
ofInteractingLipids
118
Marc F. Lensink
0.30
0.30
0.25
0.25
0.20
0.20
0.15
0.15
0.10
0.10
0.05
0.05
0.00
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Carbon atom number
9 10 11 12 13
Scd
Scd
sn2 Oleoyl
119
0.00
Fig. 4 Lipid deuterium order parameters calculated over a 50ns molecular dynamics trajectory of a 16-residue
peptide bound to a lipid bilayer. Solid circles denote order parameters calculated over all lipids, open squares
are for peptide-interacting lipids only. Peptide-interacting lipids here account for about 12% of the lipid bilayer
(15 lipids), or 2025% of the bilayer leaflet
transmembrane helices. But in addition, the lipid carbonyl or phosphate groups can act as acceptor for hydrogen bonds emanating
from the protein, or salt bridges may be formed between opposing
charges, e.g., between phosphate and arginine or lysine.
Hydrophobic interactions are the weakest of these and the strongest interactions are made by salt bridges, which are in fact a particularly strong form of a hydrogen bond. Lipids surrounding the
protein and anchoring it in the bilayer are called annular lipids.
Annular lipids show increased residence times but exchange with
bulk lipids occurs on a regular basis. For sufficiently long simulations such effects can be quantified in the lipid diffusion. However,
there is increasing evidence for the existence of non-annular lipid
binding sites, where specifically bound lipids are necessary to
achieve biological function [17, 18].
Molecular dynamics simulations may detect such strong interactions through the occurrence of lipid-mediated salt bridges, where a
single lipid bridges both a negatively and positively charged at the
same time. A striking example is a phosphatidylethanolamine (PE)
lipid binding simultaneously two neighboring charged residues
D68 and K69in lactose permease [5]. The binding to K69 is
nonspecificany phospholipid contains a phosphate groupbut
120
Marc F. Lensink
Fig. 5 Example of a lipid-mediated salt bridge. The figure shows a POPE lipid
bound to both Asp-68 as well as Lys-69 of lactose permease (LacY). LacY is
drawn in cartoon representation, the PE lipid and residues 68 and 69in ball-and-
stick. The bond lengths displayed are between the hydrogen bond donor and the
hydrogen atom itself
one gets the persistence factor F, which exists for both the donor
(Fdonor) as well as acceptor (Facceptor) interaction. The persistence
factor is an indication of the strength of interaction and is typically
correlated with residue conservation [5].
3.8 Downloadable
Files
121
to
calculated
lipid-bridged
residueresidue
and topology files for POPS, POPC, POPE, and POPG lipids
are made available to the scientific community (see Note 11).
Gromacs needs to be installed (see Note 24), as these programs
dynamically link to the gromacs libraries, but to be able to use
these programs the simulations need not necessarily be performed
by gromacs.
4 Notes
1. http://www.gromacs.org/
2. We assume the z axis aligns with the normal to the bilayer
plane.
3. A typical bilayer has a thickness of 44.5nm. A 16-residue
alpha-helical peptide has a length of about 2.5nm. If we want
to place the perpendicularly to the bilayer plane at a minimum
distance of about 2nm we need to overcome half the bilayer
thickness, half the peptide length, and add the extra 2nm, i.e.,
translate by at least 5.5nm.
4. You can use a solvated bilayer box since the solvation procedure will remove overlapping waters.
5. This is the file vdwradii.dat, which can be copied from the
gromacs topology directory to the working directory.
6. If your protein structure comes from the Protein Data Bank, it
likely features in the Orientation of Proteins in Membranes
database [20, 21]. The database contains membrane protein
structures with a disk of dummy atoms located at the point in
the lipid bilayer (at either side) where the hydrophilic to hydrophobic transfer energy derivative maximizes, i.e., roughly at
the height of the phosphorus atoms in a phospholipid bilayer.
The protein already contains the correct x and y orientation, so
only a translation in the z axis is needed.
7. Steps 3 and 4 can be taken care of by inflategro. After about
eight iterations, the deflation can be increased to 5% per step.
8. Capping is generally necessary to avoid artifacts from a
terminal charge caused by the artificial chain breaking. Take
especially care of capping if the simulations complement
experiments where the peptide was capped at one or both
ends. Capping is easiest performed using the residue topology database by adding a residue with the correct name
atthe terminus; hydrogens are then added automatically.
122
Marc F. Lensink
123
20. For a 128-lipid bilayer this still means that the trajectory has to
be traversed 128 times. When only the peptide-interacting lipids are required, a first step would be the identification of these
lipids to avoid unnecessary processing of the trajectory.
21. Scanning of a trajectory file containing all coordinates in the
system, including water, may become prohibitively slow for
extended simulation times. For many analyses not all coordinates are required and in those cases it is advised to create a
copy of the trajectory file, but containing only those coordinates needed for the analysis. This step usually results in a trajectory that is small enough to avoid the necessity of cutting it
in pieces.
22. This is defined as the combined fractional presence over the
entire simulation and can be calculated through division of the
number of frames the bridge is active by the total number of
frames in the simulation.
23. This is the root mean square fluctuation of atomic positions,
which basically gives information as to how mobile the atom is.
24. The programs compile against gromacs versions 4.5 and 4.6.
Compilation against earlier and later versions may require
minor adaptation of the code.
References
1. Berman H, Henrick K, Nakamura H, Markley
JL (2007) The worldwide Protein Data Bank
(wwPDB): ensuring a single, uniform archive
of PDB data. Nucleic Acids Res 35(Database
issue):D301D303
2. Carpenter EP, Beis K, Cameron AD, Iwata S
(2008) Overcoming the challenges of membrane protein crystallography. Curr Opin
Struct Biol 18(5):581586
3. Pronk S, Pall S, Schulz R, Larsson P, Bjelkmar P,
Apostolov R etal (2013) GROMACS 4.5: a
high-throughput and highly parallel open source
molecular simulation toolkit. Bioinformatics
29(7):845854
4. Lensink MF, Christiaens B, Vandekerckhove J,
Prochiantz A, Rosseneu M (2005) Penetratin-
membrane association: W48/R52/W56 shield
the peptide from the aqueous phase. Biophys J
88(2):939952
5. Lensink MF, Govaerts C, Ruysschaert JM
(2010) Identification of specific lipid-binding
sites in integral membrane proteins. J Biol
Chem 285(14):1051910526
6. Schmidt TH, Kandt C (2012) LAMBADA and
InflateGRO2: efficient membrane alignment
and insertion of membrane proteins for molecular dynamics simulations. J Chem Inf Model
52(10):26572669
124
Marc F. Lensink
Chapter 7
Coarse-Grained Force Fields for Molecular Simulations
Jonathan Barnoud and Luca Monticelli
Abstract
Molecular dynamics (MD) simulations at the atomic scale are a powerful tool to study the structure and
dynamics of model biological systems. However, because of their high computational cost, the time and
length scales of atomistic simulations are limited. Biologically important processes, such as protein folding,
ion channel gating, signal transduction, and membrane remodeling, are difficult to investigate using atomistic simulations. Coarse-graining reduces the computational cost of calculations by reducing the number of
degrees of freedom in the model, allowing simulations of larger systems for longer times. In the first part of
this chapter we review briefly some of the coarse-grained models available for proteins, focusing on the
specific scope of each model. Then we describe in more detail the MARTINI coarse-grained force field, and
we illustrate how to set up and run a simulation of a membrane protein using the Gromacs software package.
We explain step-by-step the preparation of the protein and the membrane, the insertion of the protein in the
membrane, the equilibration of the system, the simulation itself, and the analysis of the trajectory.
Key words Coarse-graining, Molecular dynamics, Force field, MARTINI, Protein, Lipid membrane
Introduction
Atomistic molecular dynamics (MD) simulation provides structural
and dynamic information on molecular systems on a sub-nanometer
length scale, with femtosecond time resolution. It is a powerful
tool to interpret experiments, to predict structure and dynamics of
simple systems, and to get an insight into processes that are difficult
to explore experimentally due to limited length or time resolution.
Biologically relevant phenomena, like protein folding, ion channel
gating, signal transduction, and membrane remodeling often occur
on times scales of microseconds or greater [1]. These time scales
are computationally expensive for atomistic MD simulations, which
rarely extend beyond the microsecond. Sampling is often an issue:
it is not always possible to run long enough or large enough simulations. Therefore, some phenomena are out of reach for state of
the art atomistic MD.
Several strategies exist to increase the range of what can be
sampled. One strategy is to make better use of modern hardware;
Andreas Kukol (ed.), Molecular Modeling of Proteins, Methods in Molecular Biology, vol. 1215,
DOI 10.1007/978-1-4939-1465-4_7, Springer Science+Business Media New York 2015
125
126
Theory
127
128
129
130
131
Materials
We will use the Gromacs software package to prepare, run, and
analyze an MD simulation. We will use Gromacs version 4.5 [44]
(or 4.6 but without GPU acceleration). See Note 1 about how to
use the Martini force field with Gromacs 4.6 and the GPU acceleration. Some additional third party scripts and programs need to
be installed too. We will display plots with xmgrace (http://
plasma-gate.weizmann.ac.il/Grace/). The martinize script can be
132
Methods
We will describe how to perform a molecular dynamics simulation
of a box containing a rhodopsin dimer embedded in a dioleoylphosphatidyl-choline (DOPC) bilayer, with explicit water and ions.
The final box size will be about 13 nm 13 nm 12 nm in the X,
Y, and Z dimension, respectively. The membrane will lie in the XY
plane of that box. Besides defining the chemical content, starting
an MD simulation also requires a description of initial coordinates
and velocities of each particle. The initial velocities will be generated automatically to produce a Maxwell distribution. In Gromacs,
a so-called topology file (TOP) contains all the information on
the chemical content of the simulated system (types and number of
molecules) and on the force field used to calculate their mutual
interactions.
4.1 Preparation
of the Protein
133
Now that the protein is ready, the next step is to add a lipid bilayer.
This can be done with several methods. The main challenge is to
get an equilibrated membrane patch by spending little computational resources. The easiest method is to start from a preequilibrated patch. Such patches can be found on the Martini Web
site for some lipids; the instructions to change the lipid type are
134
Fig. 1 Chain A of 1L9H at the atomistic (left) and coarse-grained (right) resolution. The protein backbone is
represented in dark grey and the side chains in light grey. The coarse-grained structure has been drawn using
the script cg_bonds
135
136
DOPC
504
= Martini
integrator
= steep
nsteps
= 400
nstlist
= 10
rlist
= 1.4
coulombtype
= Shift
rcoulomb_switch
= 0.0
rcoulomb
= 1.2
epsilon_r
= 15
vdw_type
= Shift
rvdw_switch
= 0.9
rvdw
= 1.2
137
Protein_B
DOPC
504
-------------------
138
= Y Y Y Y Y Y
139
140
Fig. 2 Membrane protein system viewed from the side (left) and from the top (right). The protein is represented
as spheres, with the backbone in dark grey and the side chains in light grey. The lipids are represented as licorice, with the polar head in dark grey and the tails in light grey. The black rectangle shows the border of the
simulation box; outside of the box is the periodic image
We notice that the net charge of our system is 3e, because the
protein is not neutral. To neutralize the system we will add sodium
ions. We can replace three water particles with sodium ions by
changing the atom name from W to NA+, and the residue
name from W to ION in the structure (GRO) file and in the
topology (TOP) file. Notice that, in the GRO file, the alignment
of the columns needs to be maintained.
It has been reported that water, in the Martini model, has a
melting temperature higher than 0 C, and it can freeze at room
temperature under certain conditions. To address this issue, version 2.0 of the model introduced a new water particle type with a
larger radius, which interacts with all non-water beads in exactly
the same way as the original water bead, but has slightly different
interaction with water beads: the sigma parameter of the Lennard
Jones interaction with water is increased to 0.57 nm; this way water
packing is perturbed and freezing at room temperature is avoided.
We will replace 5 % of the water particles by these antifreeze
water particles. As the box should contain about 9,990 water
particlesthis number can change between two runs of the procedure, we will replace 500 random water particles with antifreeze
141
water, by changing the atom name and the residue name from
W to WF. We will also reorder the atoms so that ions and
antifreeze beads are grouped; then we name the modified file
hydrated3.gro. The replace_atoms script automatizes such atom
manipulations. Using this script, replacing water beads by ions and
antifreeze water can be done with:
cat hydrated2.gro | replace_atoms.py -n 3 \
-o W -r ION -a NA+ | replace_atoms.py \
-n 500 -o W -r WF -a WF > hydrated3.gro
Now that we changed the content of the box, we need to
update the topology file. topol.top should include the description
of ions (also available on the Martini Web site). The 3 sodium ions,
the 500 antifreeze water particles, and the 9,487 remaining water
particles (this number can change) should be included in the list of
molecules. The topol.top file should now look like:
---- topol.top ---#include "martini_v2.2.itp"
#include "martini_v2.0_lipids.itp"
#include "Protein_A.itp"
#include "Protein_B.itp"
#include "martini_v2.0_ions.itp"
[ system ]
Hydrated DOPC bilayer and rhodopsin dimer
[ molecules ]
Protein_A
Protein_B
DOPC
NA+
WF
W
467
3
500
9487
142
4.5 Minimize
the Energy
and Equilibrate
All the components of our systems are now in place, but the system
most likely still has high energy, due mostly to non-optimal packing of the lipids and to unfavorable lipid-protein contacts. In addition, because we used a large distance criterion to minimize water
overlap when we hydrated the box, the system density is probably
too low. As a first step towards equilibration, we will run an energy
minimization to reduce unfavorable contacts:
grompp -f param_em -c hydrated3.gro \
-p topol.top -o em
mdrun -deffnm em -v
Then we will run a short MD simulation to equilibrate the system
density. Like for the energy minimization, we need a parameter file
(MDP). We will run the simulation for 20 ns with a timestep of 20 fs.
This represents 1,000,000 integration steps. On a modern workstation, this should take about an hour. We will run the simulation at
310 K and 1 bar using the Berendsen weak coupling algorithm for
temperature and pressure [58]. This algorithm may not result in a
correct kinetic energy distribution, so we will use the ParrinelloBussi
thermostat [59] and the ParrinelloRahman barostat [60] for the
production simulation. The ParrinelloBussi and ParrinelloRahman
algorithms tend to produce high fluctuations when temperature and
pressure are too far from their target values, which makes equilibration longer. Write the param_eq.mdp file as follows:
---- param_eq.mdp ---integrator
dt
nsteps
nstcomm
nstxout
nstvout
nstfout
nstlog
nstenergy
nstxtcout
xtc_precision
nstlist
rlist
coulombtype
rcoulomb_switch
rcoulomb
epsilon_r
vdw_type
rvdw_switch
rvdw
tcoupl
tc-grps
tau_t
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
md
0.02
1000000
10
0
0
0
1000
100
1000
100
10
1.4
Shift
0.0
1.2
15
Shift
0.9
1.2
Berendsen
System
4.0
ref_t
Pcoupl
Pcoupltype
tau_p
compressibility
ref_p
gen_vel
gen_temp
constraints
constraint_algorithm
lincs_order
lincs_warnangle
----------------------
=
=
=
=
=
=
=
=
=
=
=
=
143
310
berendsen
semiisotropic
4.0
1e-5 1e-5
1.0 1.0
yes
310
none
Lincs
4
30
It is important to verify that the system is equilibrated before starting a production run. To this end, visual inspection is a useful,
quick first step. The trajectory can be displayed with VMD [47].
During the first few steps of the trajectory, the protein tilts to find
a suitable orientation in the membrane. The box changes size as
water become denser and the area per lipid adjusts.
Visual inspection is not sufficient to determine if the system
reached equilibrium. At the end of the equilibration run, box
dimensions, potential energy and kinetic energy should have converged to stable values. Energies and box dimensions are stored in
eq.edr. They can be extracted using g_energy, and visualized with
xmgrace. Gromacs allows decomposing potential energy by groups
of atoms, called energy groups. It is then possible to check if the
non-bonded interaction between the protein and the lipids
converged. Energy groups need to be specified in the simulation
parameters before the run, though. See the Gromacs manual on
energy_grp on how to use energy groups. Other properties like
density profile or protein orientation can be checked too.
If the system did not reach equilibrium, the equilibration run
should be extended.
4.6 The
Production Run
After equilibration, we can run the actual simulation. The simulation parameter file for a production run is similar to the equilibration run, but some details need to be changed: the duration
of the run and (possibly, but not necessarily) the temperature
and pressure coupling algorithms. We will run the simulation for
1 s, e.g., 50.000.000 steps, so the nsteps parameter has to be
adapted. We will use the ParrinelloBussi thermostat (v-rescale)
and the ParrinelloRahman barostat. We copy param_eq.mdp to
144
=
=
=
=
=
=
=
=
v-rescale
1.0
310
parrinello-rahman
semiisotropic
12.0 12.0
3e-4 3e-4
1.0 1.0
We set gen_vel to no to start the simulation with the velocities generated during the equilibration as they are written in eq.gro.
Then we run the simulation:
grompp -f param_md.mdp -c eq.gro \
-p topol.top -o md
mdrun -deffnm md -v
4.7
Analysis
Once the simulation is done, one should again verify that system
size and energy do not have any drifts. Again, this can be investigated with g_energy.
Analyzing a Martini trajectory is not different from analyzing
any other MD trajectory. For example, one can look at the root
mean square deviation with g_rms or the root mean square fluctuations with g_rmsf (see Fig. 3).
Some Gromacs-related tools will prompt for a group to process, and you may want to run some analyses on the protein main
chain. In Martini, beads from the main chain are typically named
BB, so you can create group for them by using a BB as a command in make_ndx.
Notes
1. Gromacs 4.6 features an improved support for graphics
processing units (GPU). This support requires the use of a new
way to handle neighbor lists, the so-called Verlet cutoff
scheme. Shift functions are replaced by exact cutoffs, changing
the shape of the non-bonded potential. Reducing the cutoff to
1.1 nm instead of 1.2 nm seems to allow the use of this new
algorithm without affecting the properties of common systems
simulated with the Martini model. Overall, the change in the
cutoff scheme results in a speedup by almost 100 %, but at this
time the precise effects remain mostly untested.
2. Using the polarizable water model requires a few changes in
the protocol described here. First of all, we need to replace the
145
1.0
0.9
0.8
RMSD (nm)
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0.0
200
400
600
800
1000
Time (ns)
1.1
Chain A
1.0
Chain B
0.9
RMSF (nm)
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0.0
50
100
150
200
Residues
250
300
Fig. 3 Evolution of the root mean square deviation (RMSD) from the initial structure during the production
run (top panel), and root mean square fluctuation (RMSF) of the backbone, averaged over the entire production run (bottom panel). The RMSD is calculated on the whole protein after a least mean square fit on
the backbone
146
Acknowledgments
The authors thank Juliette Martin and Nicoletta Ceres for their
useful comments on the manuscript.
147
References
1. Dror RO, Dirks RM, Grossman JP, Xu H,
Shaw DE (2012) Biomolecular simulation: a
computational microscope for molecular
biology. Annu Rev Biophys 41:429452
2. Shaw DE, Chao JC, Eastwood MP, Gagliardo
J, Grossman JP, Ho CR et al (2008) Anton, a
special-purpose machine for molecular dynamics simulation. Commun ACM 51:91
3. Lindorff-Larsen K, Piana S, Dror RO, Shaw
DE (2011) How fast-folding proteins fold.
Science 334:517520
4. Bussi G, Laio A, Parrinello M (2006)
Equilibrium free energies from non-equilibrium
metadynamics. Phys Rev Lett 96:090601
5. Sugita Y, Okamoto Y (1999) Replica-exchange
molecular dynamics method for protein folding. Chem Phys Lett 314:141151
6. Torrie GM, Valleau JP (1977) Nonphysical
sampling distributions in Monte Carlo freeenergy estimation: umbrella sampling. J Comput
Phys 23:187199
7. Carbone P, Varzaneh HAK, Chen X, MllerPlathe F (2008) Transferability of coarsegrained force fields: the polymer case. J Chem
Phys 128:064904
8. Levitt M, Warshel A (1975) Computer simulation of protein folding. Nature 253:694698
9. Liwo A, Oldziej S, Pincus MR, Wawak RJ,
Rackovsky S, Scheraga HA (1997) A unitedresidue force field for off-lattice proteinstructure simulations. I. Functional forms and
parameters of long-range side-chain interaction
potentials from protein crystal data. J Comput
Chem 18:849873
10. Maupetit J, Tuffery P, Derreumaux P (2007) A
coarse-grained protein force field for folding
and structure prediction. Proteins 69:394408
11. Bereau T, Deserno M (2009) Generic coarsegrained model for protein folding and aggregation. J Chem Phys 130:235106
12. Pasi M, Lavery R, Ceres N (2013) PaLaCe: a
coarse-grain protein model for studying
mechanical properties. J Chem Theory Comput
9:785793
13. Zacharias M (2003) Protein-protein docking
with a reduced protein model accounting for
side-chain flexibility. Protein Sci 12:
12711282
14. Setny P, Zacharias M (2011) A coarse-grained
force field for Protein-RNA docking. Nucleic
Acids Res 39:91189129
15. Den Otter WK, Renes MR, Briels WJ (2010)
Asymmetry as the key to clathrin cage assembly. Biophys J 99:12311238
148
30.
31.
32.
33.
34.
35.
36.
37.
38.
39.
40.
41.
42.
149
Chapter 8
Tackling Sampling Challenges inBiomolecular Simulations
AlessandroBarducci, JimPfaendtner, andMassimilianoBonomi
Abstract
Molecular dynamics (MD) simulations are a powerful tool to give an atomistic insight into the structure
and dynamics of proteins. However, the time scales accessible in standard simulations, which often do not
match those in which interesting biological processes occur, limit their predictive capabilities. Many
advanced sampling techniques have been proposed over the years to overcome this limitation. This chapter
focuses on metadynamics, a method based on the introduction of a time-dependent bias potential to accelerate sampling and recover equilibrium properties of a few descriptors that are able to capture the complexity of a process at a coarse-grained level. The theory of metadynamics and its combination with other
popular sampling techniques such as the replica exchange method is briefly presented. Practical applications of these techniques to the study of the Trp-Cage miniprotein folding are also illustrated. The examples contain a guide for performing these calculations with PLUMED, a plugin to perform enhanced
sampling simulations in combination with many popular MD codes.
Key words Enhanced sampling, Metadynamics, PLUMED, Replica exchange methods, Molecular
dynamics, Collective variables, Free energy
1 Introduction
Uniquely providing insights into the structure and dynamics of
complex biomolecular systems at the atomistic level, MD simulations can play a fundamental role in molecular biology.
Unfortunately, the large numbers of particles that are needed for
an accurate model of biomolecules and the complexity of their
free-energy landscape make simulations computationally expensive
and prevent exhaustive sampling by standard MD in all but the
simplest cases. Recently, the development of dedicated hardware
[1] and distributed computing protocols [2] has in part alleviated
these issues. Nevertheless, the time scales accessible to MD are still
significantly shorter than those typical of several interesting biomolecular processes as well as of many experimental techniques.
To extend the time scales of MD simulations, several advanced
sampling methods have been proposed over the years [3, 4]. Acomprehensive review of such methods is beyond the scope of this chapter.
Andreas Kukol (ed.), Molecular Modeling of Proteins, Methods in Molecular Biology, vol. 1215,
DOI 10.1007/978-1-4939-1465-4_8, Springer Science+Business Media New York 2015
151
152
2 Theory
2.1 Metadynamics
In MetaD, an external history-dependent bias potential is constructed in the space of a few selected degrees of freedom, generally called collective variables (CVs). CVs are functions S of the
microscopic coordinates R of the system:
S ( R ) = (S1 ( R ) , , Sd ( R ) ) ,
(1)
which are able to provide a coarse-grained description of the process under study. In particular, CVs must distinguish the relevant
states of the system and include all the kinetically relevant degrees
of freedom. The MetaD bias potential (VG) can be written as a sum
of Gaussians deposited along the system trajectory in the CVs
space. In the well-tempered approach [11], VG has the following
functional form at time t:
d S R S R t
) i ( )
i (
VG (S ,t ) = dt (t ) exp
2 i2
0
i =1
))
, (2)
153
where i is the width of the Gaussian for the ith CV.The time-
dependent energy rate (t) is defined as:
V (S ,t )
(t ) = 0 exp G
,
kB T
(3)
VG (S , t ) =
T
F (S ) + C ,
T + T
(4)
U ( R )
F (S ) = kBT log dR (S S ( R ) ) exp
,
kBT
(5)
VG (S , t ) = (1 1 ) F (S ) + C .
(6)
In the long-time limit, the CVs probability density P(S, t) can be
written as:
F (S )
P (S , t ) exp
,
kB (T + T )
(7)
154
Fig. 1 Well-tempered MetaD simulation in a one-dimensional model system. (a) The free-energy profile (thick
black line) is characterized by three local minima separated by energy barriers higher than kBT. The sum of the
underlying free energy and the bias potential is shown at different times of the simulation (grey lines). The bias
potential is rescaled by the bias factor, following Eq.6. (b, c) Time series of the CV S (b) and of the Gaussian
height w (c) in the first 10,000 steps of simulation. The system is prepared in S=1. As the bias potential
grows, the Gaussian height w decreases, following the well-tempered recipe of Eq.3. Around t=200, the
system escapes from the initial basin into the first minimum on the right. At this point the deposition rate suddenly increases (c), as expected in well-tempered MetaD when visiting a previously unexplored region of the
CV space. After this basin is completely filled, the system starts diffusing between the first and second minima
(800<t<1,500). When a sufficient amount of bias is accumulated, the system is pushed to visit the third basin
on the right and the deposition rate increases again. Finally, around t=4,500 the underlying free energy is
almost completely compensated by the bias potential. At this point, the system starts diffusing smoothly in the
CV space, while the Gaussian height is progressively decaying to zero
155
Fig. 2 Schematic representation of the PT scheme. (a) N=5 independent copies of the system are simulated
at different temperatures. Periodically, an exchange between replicas at different temperatures (typically
neighbors) is proposed and accepted based on the Metropolis criteria defined in Eqs.8 and 9. The time needed
to span the entire temperature range, i.e., the round-trip time, is often used as measure of PT efficiency. (b)
Quasi-Gaussian potential energy distributions at different temperatures, as typically observed in simulations
of proteins in explicit solvent. The PT acceptance probability ultimately depends on the overlap of the potential
energy distributions at two temperatures. The number of replicas needed to guarantee a similar overlap at a
fixed temperature range scales as the square root of the number of degrees of freedom, thus making PT computationally prohibitive in large systems
)}
(8)
with
1
1
jPT
,k =
kBT j kBTk
(9)
U R j U ( Rk ) ,
where Rj and Rk are the configurations at temperature Tj and Tk,
respectively. Equation9 indicates that the acceptance probability is
ultimately determined by the overlap between the energy distributions of two replicas (Fig.2b).
One advantage of PT is that there is no need to select a priori
an arbitrary set of CVs, once the temperatures are chosen. However,
the efficiency of the algorithm depends on the benefits provided by
( ( )
156
(( ) )
1 (j)
j
VG S R j , t VG( ) (S ( Rk ) , t )
kBT j
(( ) )
1 (k )
k
VG (S ( Rk ) , t ) VG( ) S R j , t ,
kBTk
(10)
where VG(j) and VG(k) are the bias potentials acting on the j-th and
k-th replicas, respectively.
PT-MetaD is particularly effective because it compensates for
some of the weaknesses of each method individually taken. The
negative effect of neglecting a slow degree of freedom in the choice
of the MetaD CVs is alleviated by PT, which allows the system to
cross moderately high free-energy barriers on all degrees of freedom. On the other hand, the MetaD bias potential allows crossing
higher barriers on a few selected CVs, in such a way that the sampling efficiency of PT-MetaD is greater than that of PT alone.
Nevertheless, PT-MetaD still suffers from the poor scaling of
computational resources with system size. This issue may be circumvented by including the potential energy of the system among
the set of MetaD CVs, as in the WTE approach [12]. This leads to
the so-called PT-MetaD-WTE scheme [17], in which replica diffusion in temperature space is enhanced by the increased energy fluctuations at all temperatures.
157
3 Materials
Simulations of the Trp-Cage miniprotein have been carried out
[17] using GROMACS version 4.5.3 [18] and the PLUMED plugin version 1.2.2 [10]. However, for didactical purposes the scripts
reported here have been updated to PLUMED version 2 [19].
Figures have been prepared with UCSF Chimera [20] and
Matplotlib [21]. All simulations should be run in parallel on a cluster machine. The reader should refer to GROMACS and PLUMED
user manuals for detailed instructions about how to compile and
execute the codes.
4 Methods
In this section we show a few applications of the enhanced sampling techniques introduced above to study the Trp-Cage miniprotein folding, using an atomistic description of both solute and
solvent degrees of freedom. Trp-cage is a 20-residue protein whose
structure has been determined by NMR [9] (PDB code 1L2Y),
and its folding process has been extensively studied by several
experimental [9, 2225] and computational [2631] techniques.
This section is organized as follows. In Subheading4.1, we
describe the steps needed to prepare and equilibrate the system. In
Subheading4.2, we illustrate a simple MetaD simulation that uses
2 CVs. In Subheadings4.3 and 4.4, we combine MetaD with a
multi-replica approach (PT) and with WTE, respectively. In
Subheading4.5, we present a quantitative analysis of the convergence of the simulations along with an estimate of the error in the
reconstructed FES.
4.1 System
Preparation
4.2 MetaD
158
Fig. 3 Definition of the MetaD CVs used to study the folding of the Trp-Cage
miniprotein. (a) S counts the number of hydrogen bonds (black dotted lines)
formed in and between the -helical regions (orange cartoon) of the NMR native
structure (b). Shc counts the number of contacts in the hydrophobic core, defined
here by residues Y3, W6, P12, and P18 (ball and stick)
rij
1
r0 ,
s ij =
m
rij
1
r0
(11)
where rij is the distance between the two atoms, r0 is a characteristic contact distance, and the pair n, m defines the steepness of the
CV (see Note 1). The first CV (S) describes the number of
backbone-backbone -helical hydrogen bonds formed (Fig.3a):
N H NO
S = s ij ,
(12)
where r0=0.25 nm, n=8, m=12, and the sums are over the hydrogen and oxygen atoms that form an -helical hydrogen bond in the
native state. The second CV (Shc) describes the number of contacts
in the hydrophobic core (Fig.3b):
i =1 j =1
S hc =
i>j,
s ij ,
i , j core
(13)
159
where r0=0.50 nm, n=8, and m=12, and the sum is over representative side-chain Carbon atoms of the residues belonging to the
hydrophobic core (Y3, W6, P12, and P18).
The MetaD bias potential is constructed using an initial deposition rate equal to 2.5kJ/mol every 0.5ps. Each Gaussian has a
width equal to 0.4 for both CVs. The bias factor is set to 8 (see
Note 2). The following input file can be used to run the MetaD
simulation with PLUMED 2:
160
Fig. 4 MetaD simulations of the Trp-Cage miniprotein. Time series of the CVs (a) S and (b) Shc, along with the
Gaussian height (c). The 2 CVs seems to fail in describing all the relevant slow modes of the system, since we
do not observe a smooth exploration of the CVs space in the time scale of this simulation
high number of -helical hydrogen bonds and hydrophobic contacts. At this stage, the deposition of the bias potential suddenly
increases again. During the rest of the simulation, the trajectory
visits conformations with variable number of -helical hydrogen
bonds and hydrophobic contacts, but never returns to the region
of completely unstructured configurations that has been explored
earlier (S<0.5).
The difficulty of the bias potential in pushing Trp-Cage back
and forth from one conformational region to the other is a symptom
that our 2 CVs do not capture all the relevant modes of the system.
Differently, we would observe a smoother exploration of the CV
space and a quasi-diffusive dynamics upon convergence of the bias
potential. This outcome is not totally unexpected, since the configurational ensemble of Trp-Cage is extremely wide and characterized
by several metastable states that these CVs seem not to properly
describe. At this point one can adopt several different strategies:
1. Complementing the existing set of CVs with additional CVs;
2. Devising more effective CVs. A possible choice include CVs
that are able to describe the collective behavior of the folding
process, such as the Path Collective Variables [34] or CVs based
on dimensionality reduction [35, 36];
3. Biasing separately different CVs in a bias-exchange MetaD
approach [37];
4. Combining MetaD with other sampling algorithms.
In the following subsection, we show how to combine MetaD
in S and Shc with PT and the resulting benefit to sampling
efficiency.
4.3 PT-MetaD
161
The first thing to check in a PT-MetaD simulation is the correctness of the REM setup, in particular the temperature distribution. This can be done by analyzing both the average acceptance
rate between neighboring temperatures (reported in the
GROMACS log file) and the overall trajectory of each replica in
temperature space (Fig.5a, see Notes 6 and 7).
As for the MetaD simulation, we examine the evolution of the
system in the CV space along with the deposition rate of the bias
potential at 300K (Fig.6). At variance with single replica MetaD, this
is not a continuous trajectory, due to the exchanges with other temperatures (see Note 6). It is clear from this analysis that the statistics
accumulated at 300K spans the entire range of CVs space throughout
the simulation (Fig.6a,b). This is confirmed by the smooth average
decrease of the bias deposition rate over the simulation time (Fig.6c).
Fig. 5 Temperature diffusion of a representative replica in PT-Metad (a) and PT-MetaD-WTE (b). Time is measured per replica. Despite having only 10 intermediate temperatures instead of 100 to cover the same temperature range (300600K), the diffusion of PT-MetaD-WTE appears to be smooth thanks to the static bias on
energy. The average round-trip times of PT-Metad and PT-MetaD-WTE are 6.3ns and 4.0ns, respectively
162
Fig. 6 PT-MetaD simulation of the Trp-Cage miniprotein. (Discontinuous) time series of (a) S and (b) Shc at
300K, along with the Gaussian height (c). Time is measured per replica. Thanks to the exchange with other
temperatures, an exhaustive sampling of the CVs space is achieved
Fig. 7 PT-MetaD simulation of the Trp-Cage miniprotein. (Continuous) time series of (a) S and (b) Shc for a
representative replica diffusing in temperature. Time is measured per replica. It is crucial to reconstruct the
continuous trajectories of the replicas to assess whether the excursions in temperature do lead to an exhaustive sampling of the CVs space
Even if the behavior is reassuring of the correctness of the simulation setup, further analysis is required to declare convergence.
Indeed, due to the REM protocol, 100 trajectories contribute to
the statistics reported in Fig.6. Since all the replicas are prepared
in different regions of the CV space, the exhaustive sampling
observed in Fig.6 could result from the exchanges between replicas, which individually still suffer from sampling problems.
Therefore we have to check the trajectories of each individual replica across temperatures to assess the diffusion in the CVs space
(Fig.7 and see Note 6). This step is crucial to verify whether diffusion in temperature space can effectively help the system crossing
barriers in degrees of freedom not included in the CVs.
4.4 PT-MetaD-WTE
To reduce the number of replica needed to cover a given temperature range, we couple PT-MetaD with WTE.Enlarging the energy
163
In Fig.8a we show the trajectory in energy space along a preliminary 0.5ns simulation in the NVT ensemble, followed by a
Fig. 8 Sampling WTE at two representative temperatures (300 and 324K). (a) Time series of the potential
energy during a 0.5ns preliminary NVT simulation, followed by a 1ns MetaD simulation used to converge the
bias on the potential energy. While in the NVT part of the trajectory, the potential energy distributions at the two
temperatures are well separated, in the WTE ensemble a significant overlap is obtained. (b, c) Time series of
the ratio of the average potential energy (b) and fluctuations (c) to the canonical values. At the end of the simulation, these ratios are close to the theoretical WTE values (=24)
164
1ns MetaD simulation to converge the WTE bias at two representative temperatures. At the end of the latter simulation, the potential energy averages and fluctuations are close to the theoretical
WTE values (Fig8b and see Note 9).
In a second step, we run a PT-MetaD-WTE simulation using a
history-dependent two-dimensional MetaD bias on S and Shc and
a static bias on the energy. The latter has been stored on a grid and
written to file (BIAS) at the end of the preliminary WTE run
described above (see Note 10). The following input file can be
used to run the PT-MetaD-WTE simulation with PLUMED 2:
The FES as a function of the MetaD CVs can be calculated by integrating the Gaussians deposited along the simulation after proper
rescaling (see Note 11). For the PT-MetaD-WTE simulations, an
additional step needs to be performed, i.e., the removal of the
effect of the static bias on energy. This can be easily done by applying a TorrieValleau correction [39] to the statistics accumulated
in the WTE ensemble.
165
Fig. 9 Trp-Cage miniprotein FES from the PT-MetaD (a) and PT-MetaD-WTE (b) simulations. (c) Convergence can
be assessed by monitoring the free-energy differences between relevant regions of the CVs space as the simulation progresses. Time is measured as the fraction of the total aggregated time (simulation time per replica
multiplied by the number of replicas), i.e., 5.0s and 2.5s for PT-MetaD and PT-MetaD-WTE, respectively
5 Notes
1. While in analysis the simplest definition of contact is a step
function of the interatomic distance, in MetaD we need to use
a continuous and differentiable function in order to calculate
the additional forces due to the bias potential.
2. In order to perform a well-tempered MetaD simulation, one
has to set the following parameters: the Gaussian width i (one
per CV), the initial deposition rate 0, and the well-tempered
bias factor . The Gaussian width should be comparable to the
shape of basins in the underlying FES.This can be estimated a
166
priori by performing short unbiased MD simulations and computing CV fluctuations. Typically, the Gaussian width should
not be greater than 1/3 of the fluctuations. Recently, a more
advanced approach has been proposed to automatically tune
this parameter [40]. The initial deposition rate does not affect
the long time behavior [11]. However, a small initial deposition rate would result in a longer filling time, while a too high
rate might be problematic in the transient regime if the CVs
are not properly chosen. Typically, in simulation of proteins an
initial deposition rate of at most 1kBT per ps is used. The bias
factor affects the probability distribution of the CV in the long
time limit. Therefore, the optimal bias factor should be large
enough to cross all the relevant barriers in the process under
study, and small enough to limit sampling to the relevant
regions of the CV space. As discussed in ref. 11, overestimation
of the optimal value is to be preferred to underestimation.
3. In PT-MetaD simulations, one should carefully choose the
value of the minimum and maximum temperature and how
the replicas are distributed in this interval. The lowest temperature usually corresponds to the temperature of interest
(typically 300K). The highest temperature should guarantee a
fast sampling of all the degrees of freedom other than the CVs.
Replicas should be chosen so that a sufficient overlap between
potential energy distributions of neighboring temperatures is
achieved, thus guaranteeing a good acceptance probability for
the exchanges. The proper distribution depends on the system
specific heat and its dependence on temperature. When simulating proteins in explicit solvent in the NVT ensemble, the
maximum temperature is typically set to 600700K and the
appropriate temperature distribution is given in ref. 41.
4. Equilibration of all the replicas prior to applying the MetaD
bias is of great importance in PT-MetaD.It is indeed not convenient to initiate the simulation from identical conformations
since this would result in a long transient due to the large
accumulation of bias in the initial region of the CVs space.
Equilibration can be achieved either through PT run or multiple NVT simulations.
5. Some of the algorithms routinely used to perform simulations
at constant temperature cannot reproduce the correct energy
fluctuations of the NVT ensemble. Since the exchange process
of the REM protocol is strongly dependent on the potential
energy distributions, one must implement a correct thermostat, such as NoseHoover [42], Langevin, or BussiDonadio
Parrinello [43], to avoid possible artifacts [44].
6. The trr and xtc files produced by GROMACS contain configurations sampled at constant temperature; therefore, due to
the PT exchanges between replicas, these are not continuous
167
168
F (S )
FA = kBT log dS exp
.
kBT
SA
(14)
2
1
dS ( F (S ) FREF (S ) ) , (15)
169
F (S ) F (S ) FREF (S )
dKL ( F (S ) , FREF (S ) ) = dS exp REF
, (16)
kBT
kBT
which assigns a weight to each point based on its reference
probability density.
Since free energies are always defined modulo an irrelevant
additive constant, one should optimally align the two profiles
in order to minimize the distance between them (using
Eqs.15, 16, or other metrics). When using RMSD as metrics,
optimal alignment can be achieved by calculating the average
free energy in the volume of interest and subtract it to the
profile, so that each free energy is offset to have its average
value at zero.
Acknowledgements
AB thanks the Swiss National Science Foundation for financial
support under the Ambizione grant PZ00P2_136856. J.P.
acknowledges the support of NSF award CMMI-1032368. The
simulations of Trp-Cage miniprotein were made possible in part by
the National Science Foundation through TeraGrid resources provided by NICS.These simulations were also facilitated through the
use of computational, storage, and networking infrastructure provided by the Hyak supercomputer system, supported in part by the
University of Washington eScience Institute.
References
1. Shaw DE, Maragakis P, Lindorff-Larsen K
etal (2010) Atomic-level characterization of
the structural dynamics of proteins. Science
330:341346
2. Beberg AL, Ensign DL, Jayachandran G,
Khaliq S, Pande VS (2009) Folding@home:
lessons from eight years of volunteer distributed
computing,
IEEE
International
Symposium on, Parallel & Distributed
Processing, 2009. IPDPS 2009, 23-29 May
2009, Rome, pp.16241631
3. Chipot C, Pohorille A (2007) Free energy calculations: theory and applications in chemistry
and biology. Springer, Berlin
4. Dellago C, Bolhuis PG (2009) Transition path
sampling and other advanced simulation techniques for rare events. Adv Polym Sci
221:167233
170
171
45.
Ceriotti M, Brain GAR, Riordan O,
Manolopoulos DE (2012) The inefficiency of
re-weighted sampling and the curse of system
size in high-order path integration. P Roy Soc
a-Math Phys 468:217
46. Angioletti-Uberti S, Ceriotti M, Lee PD,
Finnis MW (2010) Solid-liquid interface free
energy through metadynamics simulations.
Phys Rev B 81:125416
47. Berteotti A, Barducci A, Parrinello M (2011)
Effect of urea on the beta-hairpin conformational ensemble and protein denaturation
mechanism. J Am Chem Soc 133:
1720017206
48. Sutto L, DAbramo M, Gervasio FL (2010)
Comparing the efficiency of biased and unbiased molecular dynamics in reconstructing the
free energy landscape of met-enkephalin.
JChem Theory Comput 6:36403646
49. Kullback S, Leibler RA (1951) On Information
and Sufficiency. Ann Math Stat 22:142143
Chapter 9
Calculation ofBinding Free Energies
VytautasGapsys, ServaasMichielssens, Jan HenningPeters,
BertL.deGroot, andHadasLeonov
Abstract
Molecular dynamics simulations enable access to free energy differences governing the driving force
underlying all biological processes. In the current chapter we describe alchemical methods allowing the
calculation of relative free energy differences. We concentrate on the binding free energies that can be obtained
using non-equilibrium approaches based on the Crooks Fluctuation Theorem. Together with the theoretical
background, the chapter covers practical aspects of hybrid topology generation, simulation setup, and free
energy estimation. An important aspect of the validation of a simulation setup is illustrated by means of calculating free energy differences along a full thermodynamic cycle. We provide a number of examples, including
proteinligand and proteinprotein binding as well as ligand solvation free energy calculations.
Key words Free energy, Molecular dynamics, Alchemical transitions, Proteinligand binding, Protein
protein interaction, Non-equilibrium methods, Hybrid topology, Crooks Fluctuation Theorem
1 Introduction
Whether or not a process happens spontaneously is determined
by its free energy. This is because, without an external source of
energy, systems evolve to their lowest free energy state. Likewise,
the rate at which that state is reached depends on free energy
barriers along the pathways to that minimum. Hence, free energies
are of central importance as they determine, e.g., binding affinities (spontaneous binding or not) or protein folding (the folded
state is usually the free energy minimum). In addition, as barriers
are linked to rates via rate theory, free energy barriers determine
binding, folding, permeation, and reaction kinetics. Therefore,
free energies are among the most critical thermodynamic quantities
to accurately be derived by computational techniques, not only
because they play such a fundamental role, but also because they
can be directly and quantitatively compared to experimental data.
Andreas Kukol (ed.), Molecular Modeling of Proteins, Methods in Molecular Biology, vol. 1215,
DOI 10.1007/978-1-4939-1465-4_9, Springer Science+Business Media New York 2015
173
174
F = U - TS
(1)
G = H - TS
(2)
175
p (x ) e -G (x ) / kBT
(3)
p (A)
= e -(G (A )-G (B )) / kBT = e - DG / kBT
p (B)
(4)
176
2 Theory
2.1 Definition
ofFreeEnergy
e - b FA
e -bF
(5)
1 pB
1 Q
ln
= - ln B
b pA
b QA
(6)
Q (N ,V ,T ) =
1
e - b H (p1 pN , q1 qN )dp1 dpN dq1 dqN
h N!
3N
(7)
where H(p,q) is a Hamiltonian of a system, q and p denote coordinates and momenta, respectively, h is Plancks constant. For a
multi-particle system, integration over all the degrees of freedom is
not computationally feasible. Hence, a simulation is often used to
sample the accessible phase space volume. In a simulation, high
177
energy microstates will be visited only rarely (or will not be visited
at all), hence, rendering this approach unsuitable for the estimation
of absolute free energies. For the free energy differences, however,
the inaccessible phase space regions will be discarded for both
states, A and B, thus allowing for an accurate assessment of F due
to cancellation of errors.
Up to now, we considered a canonical ensemble with the associated canonical partition function Q and Helmholtz free energy F.
In practice, however, experimental measurements are usually
performed at isothermal-isobaric conditions generating an NPT
ensemble. In such a case, a partition function is defined as
Q (N , P ,T ) =
1
e - b (H (p1 pN , q1 qN )+ PV )dVdp1 dpN dq1 dqN
h 3N N !
(8)
1 QB
1
ln
= - ln e - b (H B (p, q)-H A (p, q))
b QA
b
(9)
178
f (H A (p, q) - H B (p, q) + C )
1
ln
b
f (H B (p, q) - H A (p, q) - C )
+C
(10)
f (H
B
(11)
179
1
0
H
l
dl
l
(12)
(13)
180
1
0
H
dl
l
(14)
DG = W
b s W
2
(15)
1
DG = - ln e - bW
b
(16)
181
= e b (W - DG )
(17)
P f (W )
Pr (-W )
= bW - bDG
(18)
Plotting the left hand of Eq.18 against the work values yields
a line with the slope . The line intercepts the work axis at a value
equal to DG . While the approach is easy to implement, there are
several caveats regarding such a direct estimation. Firstly, a sufficient
overlap between the work histograms is achieved only for nearequilibrium transitions where little work is dissipated. Secondly,
only the work values from the overlap region will contribute to the
free energy estimate, whereas the rest of the measurements will not
be used.
To alleviate the latter problems, the work histograms can be
approximated by an analytical distribution. Following from the
CFT, the intersection point of the two distributions corresponding
to the forward and reverse transitions marks a work value equal to
G. Nanda etal. [33] proposed using a universal probability density function [36] allowing to account for the asymmetry of the
distributions. Goette and Grubmller showed that in practice a
Gaussian approximation also yields accurate free energy estimates
[35]. They derived a Crooks Gaussian Intersection (CGI) estimator which is expressed as
Wf
nf
2
DG =
s f
- Wr
2
s r
nr
1
2
( Wf
2
s f s r
1
2
s f
nf
+ Wr
2
1 - 1 ln s r
+
2
)
nr
s 2f s r2 s f
(19)
1
2
s r
182
where W f
, Wr
nf
nf
nr
and - Wr
nr
i =1
nr
1
1 + exp(ln
nf
nr
+ b (Wi - DG ))
1
n
^
j =1
1 + exp(ln r - b (W j - DG ))
nf
(20)
s DG =
1
b n f +r
-1
n f +r n f +r
-
+
nf
nr
1
2 + 2 cosh(ln
nf
nr
+ b (Wi - DG ))
n f +r
(21)
As discussed in the previous sections, alchemical free energy calculations explore unphysical pathways by combining Hamiltonians of
physical states with an external parameter . Having two (or more)
separate Hamiltonians implies the necessity to define multiple
topologies for the system at every end state. To be able to couple
Hamiltonians, a mapping between the topologies needs to be
established. Two approaches of constructing the topologies for
alchemical free energy calculations have been introduced [38, 39].
183
Transitions exploiting unphysical pathways across a thermodynamic cycle may require particle creation or annihilation. This is
always the case for a dual topology approach, whereas for a single
topology particle creation/annihilation is required only when the
184
qi q j
4p e 0e r (a Q (1 - l) + rijp )1/ p
1
1
+ 4le ij
(22)
185
create unwanted additional minima in the potential, where particles are kept in a close proximity throughout a transition, eventually leading to a strong repulsion when reaching an end state. The
strong repulsions result in large work dissipation decreasing accuracy of the free energy estimation. A different soft-
core the
approach was proposed to solve the problems of singularity points,
numerical instabilities and additional minima [45]. In the latter
approach the softening of the non-bonded interactions is applied
at the force level by modifying the non-bonded interactions such
that a finite, but non-zero, force is reached at short inter-particle
distances.
3 Topology Generation
Topology generation is the particular aspect of an alchemical free
energy setup that makes it different from a regular molecular
dynamics simulation. The topology must describe both states of a
molecule undergoing a transition. Therefore, regardless of which
topology approach one chooses, single or dual, a mapping between
the atoms of the two states needs to be established.
At the first step, a mapping algorithm, when provided with the
structures of two molecules, should list atoms that need to be
morphed into one another or be turned into dummies. To automate such a process several approaches are available. A graph theory based connectivity analysis of the molecules can be used to find
a subset of connected atoms for morphing, e.g. a maximum common subgraph algorithm. The atoms not falling within the identified subset would be marked to become dummies in one of the
states. The drawback of a graph based approach is the fact that
while the atoms mapped for morphing may be close to each other
in a graph representation, in Cartesian coordinates the distance
between them may be large, resulting in potential convergence
issues in the simulations.
A different approach to atom mapping is based on a Euclidean
distance criterion. For the two superimposed molecules distances
between all the atom pairs need to be calculated. By defining a
threshold value (e.g., 0.5) pairs of atoms with distances below
the threshold are selected for morphing. The threshold parameter
can be adjusted depending on a specific situation. However, one
needs to be careful and avoid introducing unreasonable mappings
of spatially distant atoms. Creating fragments in a molecule connected via dummies should be avoided, since bonded interactions
of the dummies would restrain the degrees of freedom that need
not to be restricted. Similarly, breaking ring systems when morphing atoms may lead to stability and convergence issues. Therefore, it
is better to follow a dual topology approach and create/annihilate
intact rings.
186
187
4 Thermodynamic Cycles
As previously mentioned, the absolute free energy of a system is
difficult to determine, but fortunately most problems can be formulated in terms of relative free energies. Free energy differences
are both more approachable and contain important information
about the system. The change in free energy due to binding
(Gbinding) can be determined experimentally (e.g., by means of
calorimetry). Although absolute binding affinities can be calculated using, for example, umbrella sampling, such calculations are
usually cumbersome, as the whole binding/unbinding process
needs to be taken into account, while its path is unknown in most
cases. Therefore, investigating the effect of a change of the system
(e.g., an amino acid mutation or a ligand modification) on binding
is usually more feasible. For this, the double free energy difference
(Gmutation,binding) is calculated.
The alchemical methods calculate the work needed to move
a system from one state to another through unphysical pathways.
As the free energy of a system is a function of its state, the free
energy difference found between the states is independent of the
path taken between them. Sufficient sampling is of critical importance in all free energy methods, and the computational difficulty
of reaching convergence dramatically increases with the magnitude
of the perturbation.
Binding free energies usually involve a large perturbationin
one state, both binding partners are free in solution, in the other
they are in a complex. The phase space overlap between these two
states is often small resulting in a slow convergence. An alchemical
transition from state A (a wild-type protein or a ligand) to the state
B (a mutated or modified molecule), while physically impossible,
requires a much smaller perturbation. As we are interested in the
difference in the binding free energies between the states A and B,
188
we can make use of the fact that the free energy of a state does not
depend on the path taken to reach it. Hence, the free energy differences along a closed cycle of reactions (like the one depicted in
Fig.2) will always add up to zero. This feature allows calculation of
the double differences in free energy (G) of binding, thermostability, partitioning in different solvents, etc. between two states
of a system
(23)
189
190
independent. In practice, placing the structures3nm apart is sufficient to obtain accurate free energy estimates. To prevent an
interaction between the solvated molecule and the protein due to
motions during the simulation, a position restraint on a single
atom of the molecule free in solution can be imposed. Once a system is set up this way, the estimated free energy corresponds to the
G of binding.
Even if a charge changing mutation/modification is set up to
remain in a neutral simulation box during an alchemical transition, some unwanted electrostatic artifacts may persist due to the
finite size and periodicity effects [62].
191
Fig. 4 (a) Closed thermodynamic cycle involving three states. (b) Including the
effects of the dummy atoms a closed thermodynamic cycle with three states is
transformed to one with six states
using longer simulations would be preferred. However, using longer simulations also holds the risk that they will get trapped in
artificial minima caused by force field artifacts. To avoid this we
recommend using multiple short simulations as done in this example. This approach ensures better sampling and reduces the risk of
getting trapped in artificial minima. Furthermore, a more rigorous
and straightforward error estimation could be performed using
this approach, provided that a sufficient number of transitions is
achieved for each independent simulation. In such a case the free
energy can be calculated for each trajectory separately and the error
can be evaluated for the deviation among the independent G
estimates.
In the example above we used a convenient double-system/
single-box setup. However, in case of a closed cycle is constructed
by considering the branches of a thermodynamic cycle separately,
one additional caveat in terms of topology construction needs to
be taken into account. Both single and dual topology procedures
mostly involve dummy atoms, and those need to be considered
in a closed cycle. A simple thermodynamic cycle, as represented in
Fig.4a, containing three vertices, might end up in a cycle with six
vertices (Fig.4b), and edges containing only dummy transitions
that are not easily accessible [40, 41]. In a typical free energy simulation one is often interested in the difference between two G
values, where the contribution of the dummy atoms cancels out,
e.g. in protein thermostability calculations the effect of the dummy
atoms is the same in the reference (unfolded) state as in the folded
state, or for ligand binding affinities the effect is the same for the
ligand in solvent as the ligand bound to a protein. Therefore, one
possibility for the construction of a valid closed thermodynamic
192
Fig. 5 Results for a double system in a single box, having both L2A and A2L
structures in one box. The number of equilibrium simulations of 10ns used can
be read from the x-axis. Each point on the graph consists of 100 non-equilibrium
trajectories of 50ps
193
factors, including first of all the size of the system and the significance of the change, but also the desired precision. It might be
preferable to get fast approximate results in a screening process,
while more computational effort would be spent to obtain a p
recise
value for a specific mutation. The quality of the simulation protocol
can be validated using closed thermodynamic cycles (Subheading 5),
but a closer look at the transition work distributions may also indicate how the results can be improved.
As the CFT is based on the assumption that the transition runs
start from an equilibrium ensemble, improving the sampling of
these ensembles usually yields the greatest improvement to the
quality of the result. This is best achieved by running several parallel simulations than a single long one as molecular dynamics simulations tend to get stuck inlocal energy minima.
The difference in work values between the forward and backward transitions is caused by the fact that the simulations are
performed in non-equilibrium conditions. The work distributions
for forward and backward transitions will be closer to each other
the slower the transitions are. Hence, increasing the length of these
simulations would improve the result if the work distributions are
not properly overlapping.
194
7 Trypsin Inhibitors
Binding affinity estimation for small organic compounds is of high
importance in the search for potential drug candidates. Hence, we
will analyze in more detail a study of alchemical G calculations
for a set of trypsin inhibitors. Talhout etal. [63] performed isothermal calorimetry (ITC) measurements of the binding free energies
for a number of p-n-alkylbenzamidinium molecules. We will use this
set of ligands to illustrate the workflow of the alchemical ligand
binding free energy calculations, and the ITC measurements will
serve us as a reference to assess the quality of our estimates. More
information on the computational studies on this set of trypsin
inhibitors can be found in the publications [45, 63].
195
Fig. 7 Trypsin inhibitor analysis. (a) A set of alkylbenzamidinium molecules. Molecule ordering corresponds
to the pairs of ligands for which the free energy differences were calculated. (b) Structure 3PTB with a cocrystallized benzamidine served as a starting structure for the MD simulations. (c) G values from the ITC
measurements [63] and calculated estimates
7.1 Topology
andStarting Structure
196
7.3 Estimation
oftheFree Energy
Differences
197
198
30
199
Expected
1 kcal/mol dev.
25
2 kcal/mol dev.
20
15
L18V
L18A
L18S
10
5
L18W
L18M
L18I
L18F
5
10
L18T
L18Y
10
5
15
0
10
Experimental binding free energy (kJ/mol)
20
25
200
201
Fig. 12 (a) Non-equilibrium free energy calculation of ATP solvation and desolvation which should serve as a
closed cycle. The left section of the plot shows the work values as a function of the transition number. Since
the transitions are started from consecutive frames it could indicate whether the equilibrium trajectory is drifting. The right region of the plot combines the work values into histograms. (b) Distribution of ATP and Mg
orientations measured according to distance from P and P. P is not shown since its distance does not vary
and fluctuates around 0.3nm
202
Fig. 13 An extended closed cycle of an ATPMg complex annihilated and solvated in solution. The horizontal
arrows are performed in both directions simultaneously for a closed cycle consistency check
shortens, arises from the initial structure, but it gradually equilibrates into the top cluster of state Y within 120ns and never
re-visits the former cluster. Since these conformations do not inter
change, they can be seen as two distinct chemical species. Thus,
turning one chemical species on, and another off does not r epresent
two opposite reactions and will not converge to G=0.
Spontaneous transitions between these conformations do not
occur in MD simulations with a length of 100ns, and they are kept
separated by a steep barrier. However, the barrier is solely maintained
by the highly attractive Coulomb interactions between the phosphates
and the magnesium. Since the free energy is a state function, the
path from the fully solvated to annihilated (uninteracting dummies)
state could be chosen to pass through a reduced charge state, where
the charge on the magnesium ion is scaled to +1, while the charge
onthe phosphate groups is scaled in reverse to maintain the neutrality
of the system. This effectively creates two transitions that need to be
calculated: one from a fully appeared and fully charged state to a
reduced charge state, and then another one into dummies.
To enable the convergence of the closed cycle of annihilating
and solvating an ATPMg complex in solution, the calculation is
decomposed further into the two ATP orientations, such that transitions (annihilation or solvation) will be performed for each orientation (X and Y ) of the ATP1 as depicted by the horizontal arrows
in Fig.13. There is no need to compute the vertical transitions, but
we would like to note that both vertical transitions maintain a
G0. In the reduced charge state, conformations X and Y interchange multiple times, while they are equally populated. As for the
dummy state, the difference between X and Y is merely a difference
of orientation in space, which is imposed by a distance restraint.
The distribution of work values for each of the four horizontal
transitions shown in Fig.13 is plotted in Fig.14. These transitions
are as before, performed in both directions to close a thermodynamic cycle. Four hundred and fifty transitions were performed in
1
Distance restraints on the P, P and the Mg+2 will keep the atoms in their
respective orientation in one of the two top clusters shown in Fig.12b.
Fig. 14 Results from non-equilibrium transitions as depicted by the horizontal arrows in Fig.13. The snapshot
number represents the transitions
each direction, their length was increased to 2ns. Indeed, this time
the resulting G values are mostly within 1.5kJ/mol away from
zero2, except for the last transition (reduced charge state into dummies in state Y ). This state is much more flexible than its fully
charged version, and might need more time to converge, i.e. via
longer equilibrium simulations, however, electrostatic artifacts
could remain due to the nature of the mutation (turning off a
charge) and the finite size of the system. In principle, determining
whether the generated equilibrium ensemble is indeed at equilibrium is difficult. However, sometimes examining the series of work
values taken from consecutive initial snapshots from the equilibrium ensemble may indicate whether there is a drift and whether
more equilibration time is needed. For example, if we would only
have a quarter of the equilibrium trajectory of state X +2 to +1 and
the corresponding transitions (first 120 transitions in Fig.14), the
drift in work values might have hinted at a trajectory drift, but
further equilibration and transitions from later snapshots show that
those work values are reproduced again and again.
2
A shorter transition time of 200ps also gave a result that was fairly close to
zero, but longer times were used to reduce the error.
204
Table 1
Parameters and suggested values for the non-equilibrium free
energy calculations
Parameter
Value
init-lambda
0 or 1
delta-lambda (equilibration)
delta-lambda (transition)
1/nsteps
nstdhdl
sc-coul
yes
sc-alpha
0.3
sc-sigma
0.25
sc-power
(24)
10 Notes
1. In the Gromacs simulation package the free energy code is
activated by setting the flag free-energy=yes in a molecular dynamics parameter (mdp) file. Triggering this option automatically enables the H / l output to an external file. The
initial state of a system is set by defining init-lambda to
be equal to 0 or 1 for the states A and B, respectively. Setting
the two aforementioned parameters is sufficient to perform an
equilibrium sampling simulation at one of the end states. For
the transition runs, an increment in needs to be specified by
setting the parameter delta-lambda to a non-zero value.
delta-lambda has to be estimated such that an end state is
205
206
207
25. Bruckner S, Boresch S (2011) Efficiency of
alchemical free energy simulations. II. Impro
vements for thermodynamic integration.
J Comput Chem 32(7):13201333
26. Jarzynski C (1997) Nonequilibrium equality
for free energy differences. Phys Rev Lett
78(14):26902693
27. Cuendet MA (2006) The Jarzynski identity
derived from general hamiltonian or non-
hamiltonian dynamics reproducing NVT or
NPT ensembles. J Chem Phys 125:144109
28. Hummer G (2001) Fast-growth thermodynamic integration: error and efficiency analysis.
J Chem Phys 114:73307337
29. Gore J, Ritort F, Bustamante C (2003) Bias
and error in estimates of equilibrium free-
energy differences from nonequilibrium measurements. Proc Natl Acad Sci USA 100(22):
1256412569
30. Crooks GE (1998) Nonequilibrium measurements of free energy differences for microscopically reversible Markovian systems. J Stat Phys
90(56):14811487
31. Crooks GE (1999) Entropy production fluctuation theorem and the nonequilibrium work
relation for free energy differences. Phys Rev E
60(3):27212726
32. Chelli R, Marsili S, Barducci A, Procacci P
(2007) Recovering the Crooks equation for
dynamical systems in the isothermal-isobaric
ensemble: a strategy based on the equations of
motion. J Chem Phys 126:044502
33. Nanda H, Lu N, Woolf TB (2005) Using non-
Gaussian density functional fits to improve relative free energy calculations. J Chem Phys
122(13):134110-1134110-8
34. Maragakis P, Ritort F, Bustamante C, Karplus
M, Crooks GE (2008) Bayesian estimates of
free energies from nonequilibrium work data in
the presence of instrument noise. J Chem Phys
129:024102
35. Goette M, Grubmller H (2009) Accuracy and
convergence of free energy differences calculated from nonequilibrium switching processes.
J Comput Chem 30(3):447456
36. Bramwell ST, Christensen K, Fortin J-Y,
Holdsworth PCW, Jensen HJ, Lise S, Lpez
JM, Nicodemi M, Pinton J-F, Sellitto M
(2000) Universal fluctuations in correlated systems. Phys Rev Lett 84(17):37443747
37. Massey FJ Jr (1951) The Kolmogorov-Smirnov
test for goodness of fit. J Am Stat Assoc
46(253):6878
38. Pearlman DA, Kollman PA (1991) The overlooked bond-stretching contribution in free
energy perturbation calculations. J Chem Phys
94:45324545
208
209
68. Hornak V, Abel R, Okur A, Strockbine B,
Roitberg A, Simmerling C (2006) Comparison
of multiple Amber force fields and development of improved protein backbone parameters. Proteins Struct Funct Bioinform 65(3):
712725
69. Mobley DL, Chodera JD, Dill KA (2006) On
the use of orientational restraints and symmetry corrections in alchemical free energy cal
culations. J Chem Phys 125(8):084902. doi:
10.1063/1.2221683.
http://link.aip.org/
link/?JCP/125/084902/1
70. Shirts MR, Pitera JW, Swope WC, Pande VS
(2003) Extremely precise free energy calculations of amino acid side chain analogs: comparison of common molecular mechanics
force fields for proteins. J Chem Phys 119(11):
57405761
71. Shirts MR, Mobley DL, Chodera JD, Pande
VS (2007) Accurate and efficient corrections
for missing dispersion interactions in mole
cular simulations. J Phys Chem B 111(45):
1305213063
Part II
Conformational Change
Chapter 10
The Use ofExperimental Structures
toModel Protein Dynamics
AtaurR.Katebi, KannanSankar, KejueJia, andRobertL.Jernigan
Abstract
The number of solved protein structures submitted in the Protein Data Bank (PDB) has increased dramatically in recent years. For some specific proteins, this number is very highfor example, there are over 550
solved structures for HIV-1 protease, one protein that is essential for the life cycle of human immunodeficiency virus (HIV) which causes acquired immunodeficiency syndrome (AIDS) in humans. The large
number of structures for the same protein and its variants include a sample of different conformational
states of the protein. A rich set of structures solved experimentally for the same protein has information
buried within the dataset that can explain the functional dynamics and structural mechanism of the protein. To extract the dynamics information and functional mechanism from the experimental structures, this
chapter focuses on two methodsPrincipal Component Analysis (PCA) and Elastic Network Models
(ENM). PCA is a widely used statistical dimensionality reduction technique to classify and visualize high-
dimensional data. On the other hand, ENMs are well-established simple biophysical method for modeling
the functionally important global motions of proteins. This chapter covers the basics of these two.
Moreover, an improved ENM version that utilizes the variations found within a given set of structures for
a protein is described. As a practical example, we have extracted the functional dynamics and mechanism
of HIV-1 protease dimeric structure by using a set of 329 PDB structures of this protein. We have
described, step by step, how to select a set of protein structures, how to extract the needed information
from the PDB files for PCA, how to extract the dynamics information using PCA, how to calculate ENM
modes, how to measure the congruency between the dynamics computed from the principal components
(PCs) and the ENM modes, and how to compute entropies using the PCs. We provide the computer
programs or references to software tools to accomplish each step and show how to use these programs and
tools. We also include computer programs to generate movies based on PCs and ENM modes and describe
how to visualize them.
Key words HIV-1 protease, Principal component analysis, Elastic network model, Protein dynamics,
Acquired immunodeficiency syndrome, Protein data bank
1 Introduction
There are large numbers of structures in the protein data bank
(PDB [1]) for many categories of enzymes. Shown in Fig.1 are the
most abundant enzyme structures ordered by enzyme commission
(EC) numbers. Some other examples for individual EC categories,
Andreas Kukol (ed.), Molecular Modeling of Proteins, Methods in Molecular Biology, vol. 1215,
DOI 10.1007/978-1-4939-1465-4_10, Springer Science+Business Media New York 2015
213
214
Fig. 1 Numbers of related protein structures available for extracting protein functional dynamicssnapshot of the PDB statistics for the largest categories of
enzymes (08/30/2013). In total, there are over 17,000 enzyme structures, and a
significant number of structures for many diverse enzyme types. The most common structure on the left of this histogram with 1,285 structures is EC 3.2.1.17 that
includes lysozymes, and at the right side is 5.2.1.8 acetylcholinesterases with 337
different structures (taken from enzyme classification data provided by PDB: http://
www.pdb.org/pdb/statistics/histogram.do?mdcat=entity&mditem=pdbx_
ec&name=Enzyme%20Classification) [1])
215
216
2 Theory
2.1 Principal
Component Analysis
(PCA)
217
cij = (ri - ri ) rj - rj
(1)
where brackets indicate averages over the entire set of structures.
The covariance matrix C can be decomposed as
C = P DP T ,
(2)
g
DHD T ,
2
(3)
H = M LM T ,
(4)
218
Pi M j
Pi M j
(5)
where Pi is the ith PC for model P and Mj is the jth PC or normal
mode for model M.A perfect match yields an overlap value of 1.
They also defined the cumulative overlap (CO) between the first k
vectors of M and Pi as
1
k
2
CO ( k ) = Oij2
j =1
(6)
which measures how well the first k PCs for model M together can
capture the motion of a single PC for model P.
2.5 Coarse-Grained
Global Entropies
Calculated from
Principal Component
Analysis
(7)
where PCi is the ith PC, and i is the ith eigenvalue, N is the total
number of eigenvalues.
Andricioaei etal. also reported a similar result for entropy calculation from the covariance matrices of the atomic fluctuations as
shown in equation 7 of their paper [29]. It should be noted that
this expression is different from that for normal modes of the elastic network models, which because of the averaging normally
involved the inverse of the eigenvalues.
i =1
3 Materials
There are a huge number of available HIV-1 protease structures in
the PDB (564 X-ray and three NMR structures as of 07/26/2013),
which provides a remarkably rich set of different conformational
219
Fig. 2 Description of HIV-1 protease homo-dimer and its critical structural components that facilitate the functional dynamics (a) HIV-1 protease has two symmetric subunitssubunit A (red) and subunit B (blue). (b) Each
subunit has several structural components that are important for its coordinated motions. Fulcrum (orange,
residues 921) is a comparatively less mobile region that swings up and down similar to the flap elbow. E-34
(blue)Hinge residue which is responsible for transmitting the motion from the fulcrum to the flap region. Flap
elbow (magenta, residues 3742)Hinge residue E-34 drives the motion of this region to transfer the dynamics further away from the fulcrum to the upper flap region. This loop can make top-down and bottom-up
swings. When the flap elbow swings from top to bottom, the flap domain opens up, and when it swings upward
the flap domain closes. The Flap domain (residues 4358) consists of flap tip (yellow, residues 4952) and
-hairpin flaps (dark orange, residues 4348 and 5358). Opening and closing of the flap domains enable the
protein to bind ligands and release its products after proteolysis. Cantilever (green, residues 5975) functions
as a base for the flap domain. The C-terminal -hairpin flap is held by the N-terminal end of the cantilever and
this arrangement is important to control the swinging of the flap [30, 31]
220
We have used 329 PDB structures of HIV-1 protease for the computations to extract protein dynamics from experimental structures.
The PDB Ids of the data set are here (see Notes 1 and 2):
1A8G
1A8K
1A94
1AXA
1B6J
1B6K
1B6L
1B6M 1B6P
1BDL
1BWA 1BWB
1C6X
1C6Y
1C6Z
1C70
1D4S
1D4Y
1DAZ 1DIF
1EBW 1EBY
1EBZ
1EC0
1EC1
1EC2
1EC3
1F7A
1FEJ
1FFF
1FFI
1FG8
1FGC
1FQX
1G2K
1G35
1GNM
1HIV
1HOS 1HPO
1HPS
1HPV
1HPX
1HSG
1HSH
1HTE
1HVJ
1HVK 1HVL
1HVR 1HVS
1IZH
1IZI
1K6C
1K6P
1K6V
1KJ7
1K1U
1LZQ 1M0B
1K2B
1FF0
1K2C
1BV9
1FG6
1K6T
1AJV
1AJX
1KJ4
1KJF
1KJG
1KJH
1MT7
1MT8
1MT9
1NH0 1NPA
1NPV
1QBR
1QBS
1RPI
1RQ9 1RV7
1SDT
1SDU 1SDV
1SGU
1SH9
1SP5
1T3R
1T7I
1T7J
1T7K
1TCX
1TW7 1U8G
1VIJ
1VIK
1XL2
1XL5
1YT9
1YTG
1YTH
1Z8C
1ZBG
1ZLF
1ZPK
1ZSF
1ZSR
2A1E
2A4F
2AID
2AOF
2AQU 2AVM
2AVO
2AVS
2AVV
2AZC
2B7Z
2BB9
2BBB
2BPV
2BPW 2BPX
2BPY
2BPZ
2BQV
2CEJ
2F80
2F81
2F8G
2FDD
2FDE
2FGU
2FGV
2FNS
2FNT
2FXD
2FXE
2HB3
2HC0
2HS1
2HS2
2I4D
2I4U
2I4V
2I4W
2I4X
2IDW
2IEN
2IEO
2J9J
2J9K
2JE4
2O4L
2O4P
2O4S
2P3A
2P3B
2P3C
2P3D
2PK5
2PK6
2PQZ
2PYN
2Q3K
2Q63
2Q64
2QAK
2QCI
2QI1
2QI3
2QI4
2QI5
2QI6
2QI7
2Z4O
1MER 1MES
3A2O
3AID
2R3T
2R3W
2R43
2R5P
2R5Q
2RKF
2UPJ
2UXZ
3BGB 3BGC
3BVA
3BVB
3CKT
3CYW
3CYX
3D1X
3D3T
(continued)
221
(continued)
3FX5
3GI5
3GI6
3I7E
3KF0
3KFN
3KFR
3KFS
3LZS
3NU6 3NU9
3NUJ
3NUO 3O9F
3O9G
3O9H 3O9I
3OK9
3OTS
3QAA 3R4B
3S43
3S53
3S56
3S85
3SO9
3T11
3U7S
3UCB 3UF3
3UHL
4DQB 4DQC
4EJK
4EJL
4FAE
4FL8
4FLG
4HVP
4I8W
4J54
7HVP 7UPJ
8HVP
9HVP
4FM6
4I8Z
4J55
3S54
4J5J
4PHV
4 Methods
To successfully complete the procedures described in this section,
one needs the following software/programs:
Perl 5Several perl scripts are included here. Perl programming language [32] can be downloaded free at www.perl.org.
PythonA python script is used to calculate the internal distances between residue pairs for the set of 329 protein structures. A Python environment can be downloaded at http://
www.python.org/.
MatlabSeveral Matlab scripts are included here that can be executed in a Matlab programming environment [33]. Matlab
product site is http://www.mathworks.com/products/matlab/.
MAVENsThis software was developed in the Jernigan lab [12].
In our Matlab code, we have invoked several MAVEN functions:
ANM.mThis is a function from MAVEN [12] used in
experimentalDynamics.m to compute ENM normal modes
from a given PDB structure.
modeAnimator.mThis is a function from MAVEN used
in experimentalDynamics.m to visualize the ENM modes
and PCs by creating movies.
readPDB.m, writePDB.mThese two Matlab functions from
MAVEN are used to read and write PDB files, respectively.
CompareVectors.mThis function from MAVEN is used
in experimentalDynamics.m to compare the directions of
PCs and ENM modes.
plot_compareVectors.mThis function from MAVEN
plots the results obtained from the above CompareVectors.m.
mat2vec.mThis function converts a matrix to a vector.
222
Download and save the following perl scripts in the same folder
experimentalDynamics. Run these perl scripts in the same sequence
as they are listed below:
223
Table 1
Summary of the steps for extracting biomolecular dynamics
Program/file name
Function
retainFirstAltLocation.pl
Retains the first alternate location for each ATOM and HETATM when
multiple locations for that ATOM/HETATM exist. It operates
on a set of PDB files.
replaceHETATM.pl
retainCA.pl
Copies the CA atoms from a set of PDB files with no TER keyword
between chains to comply with the MUSTANG input file format.
pdbIds.txt
This file list the PDB ids for 329 PDB structures used here.
readAlignedPDBcoordinates.m
internal.py
calc_Entropy_PC.m
The above files, the files used from MAVEN, other accessory files and dataset can be downloaded at http://ribosome.
bb.iastate.edu/4papers/2013/ataur/experimentalDynamics/
224
Schema 1 The records of ATOM type for residue 8 and modified residue 67 of the PDB file 2p3a
ATOM
ATOM
ATOM
ATOM
ATOM
ATOM
72
73
74
75
76
77
N
N
CA
CA
C
C
AARG
BARG
AARG
BARG
AARG
BARG
Res No
Atom No
Record id
A
A
A
A
A
A
225
Cart. Coordinates
8
8
8
8
8
8
26.288
26.517
25.875
26.053
26.929
27.135
-8.483
-8.547
-8.023
-8.180
-8.501
-8.490
-5.941
-6.064
-4.614
-4.733
-3.624
-3.723
0.60
0.40
0.60
0.40
0.60
0.40
23.05
23.23
21.71
22.23
21.10
21.09
N
N
C
C
C
C
572
573
574
575
588
N
N
CA
CA
C
ACME
BCME
ACME
BCME
ACME
A
A
A
A
A
67
67
67
67
67
31.550
31.558
32.726
32.776
33.921
-12.012
-11.938
-12.421
-12.485
-12.522
8.379
8.292
7.660
7.653
8.646
0.70
0.30
0.70
0.30
0.70
29.54
29.11
31.90
30.89
32.23
N
N
C
C
C
HETATM
589
BCME A
67
33.963
-12.613
8.623
0.30
31.53
ATOM
72
74
ATOM
76
ATOM
N AARG A
CA AARG A
C AARG A
Res No
Atom No
Record id
Cart. Coordinates
26.288
-8.483
-5.941
8
8
25.875
-8.023
-4.614
26.929
-8.501
-3.624
0.60
0.60
0.60
23.05
21.71
21.10
572
ACME A
67
31.550
-12.012
8.379
0.70
29.54
HETATM
574
CA ACME A
67
32.726
-12.421
7.660
0.70
31.90
HETATM
588
67
33.921
-12.522
8.646
0.70
32.23
ACME A
Res No
Atom No
Record id
226
Cart. Coordinates
ATOM
72
AARG A
26.288
-8.483
-5.941
0.60
23.05
ATOM
ATOM
74
76
CA AARG A
C AARG A
8
8
25.875
26.929
-8.023
-8.501
-4.614
-3.624
0.60
0.60
21.71
21.10
C
C
572
574
588
N ACME A
CA ACME A
C ACME A
67
67
67
31.550
32.726
33.921
-12.012
-12.421
-12.522
8.379
7.660
8.646
0.70
0.70
0.70
29.54
31.90
32.23
N
C
C
74
Res No
Atom No
ATOM
Record id
CA AARG A
227
Cart. Coordinates
25.875
-8.023
-4.614
0.60
21.71
574
CA ACME A
67
32.726
-12.421
7.660
0.70
31.90
data-CA: This folder has all the backbone PDB files for multiple structural alignments.
Description: This file has the path of the source directory
where MUSTANG will find the input files for multiple structural alignment. After the path information, this file also has
the list of the PDB file names that MUSTANG will read from
the source directory. The list of the filenames in this file is in
228
the same order as the list of the PDB Ids in the pdbIds.txt file
which has the 329 PDB Ids that are listed in Subheading3.2.
Update the line in description file that records the path of the
source directory for the input files (path to the files in data-CA
subfolder) that would be aligned.
Run the following command to execute MUSTANG:
mustang-3.2.1 -f description -o alignAll -F fasta -r ON
This will create the following two files:
pdbIds.txt alignAll.pdb
This will create a subfolder alignedPDBs in the experimentalDynamics folder. This subfolder will have the 329 PDB files with
the aligned C atoms of each structure. So when the Cartesian
coordinates of each file will be placed in a matrix such that each
row corresponds to the coordinates of one PDB Id, this matrix can
be used for principal component analysis (see Note 4).
4.2 Use ofCartesian
PCs toExtract
Functional Dynamics
fromtheProtein
Structures
Matlab script experimentalDynamics.m reads the Cartesian coordinates of the structures from the MUSTANG aligned files and
perform PCA on them.
4.2.1 Significance
ofPrincipal Components
(PCs)
Fig. 3 Distributions of the 329 PDB structures by PCA. (a) Distribution of the structures on a PC1-PC2 plot.
(b) Distribution of the structures on a PC1PC3 plot. (c) Distribution of the structures on a PC2PC3 plot. In plots
a and b, open structures are located on the left side; closed structures are located on the right side; and the
intermediate structures fall in between. Distribution of structures on PC2PC3 plot (panel c) is based on primarily
on the conformational differences along the flap elbow region. PC1, PC2, and PC3 capture 30%, 20%, and 7%
of the variances in the dataset, respectively
230
Matlab program experimentalDynamics.m has the code to compute the ANM modes by using the MAVEN function ANM.m,
and it then computes the overlap and the cumulative overlaps with
the previously computed PCs by using another MAVEN function
CompareDynamics.m. Figure5, generated by MAVEN function
plot_compareDynamics.m, shows the overlaps between the first
ten PCs and the first ten ANM modes. The highest overlap is 60%
found between PC1 and ANM mode 3.
Table2 shows the cumulative overlaps between PCs and the
ANM modes. The cumulative overlap between each of the first and
the second PCs and the first 20 modes is above 80%. Interestingly,
the cumulative overlap reaches 80% between the second PC and
the first six modes. This clearly indicates that given an appropriate
experimental dataset the motions captured by the PCs conform
quite closely with the ANM motions.
231
Fig. 4 Visualization of the first three PCs of HIV-1 protease on the structures. (a)
Structures showing the closed form (left, PDB 1ebw) and open form (right, PDB
1rpi) of HIV-1 protease. The two subunits are shown in red and blue color and in
ribbon diagram. (b) Snapshots of the structures displaced along the directions of
PC1 shown in connected line segment. The direction of motions of the protein
along each PC is shown with a black arrow. It can be seen that the openingclosing motion of the flaps can be easily identified from the extrema of PC1. Two
extrema are shown for each motion in each row, together with arrows that indicate the directions for transition to the other structure. (c) PC2 images are shown
looking down from the top of those in PC1 and PC3. PC2 is a twisting of the flap
regions whereas (d) PC3 is a hinge motion between the core and flaps, with the
core and flaps moving to and fro relative to one another
232
Fig. 5 Overlap between PCs and ANM modes. PC1 and mode 3 gives the highest overlap 60%
Table 2
Cumulative overlap between the first three PCs and sets of the ANM modes
ANM modes/PCs
PC1
PC2
PC3
3 modes
0.62
0.71
0.44
6 modes
0.64
0.80
0.54
10 modes
0.77
0.83
0.59
20 modes
0.80
0.85
0.65
CO between a PC and ANM modes is shown in bold type if it is greater than 0.80
Table 3
Overlaps between PCs and the new ANM modes
PCs/newANM modes
Mode 1
Mode 2
Mode 3
PC1
0.09
0.79
0.40
PC2
0.34
0.01
0.24
PC3
0.34
0.01
0.10
Table 4
Cumulative overlaps between PCs and the new ANM modes
New ANM modes/PCs
PC1
PC2
PC3
3 modes
0.90
0.42
0.35
6 modes
0.91
0.44
0.41
20 modes
0.95
0.89
0.84
233
overlap between PC2 and the first three modified ANM 42%; on
the other hand this value between PC2 and the first three conventional ANM modes is 71%, a much higher value. Therefore, in
some cases cumulative overlap between a PC and the new ANM
modes gets improved compared to the similar values between a PC
and the conventional ANM modes. But when 20 new ANM modes
are included, the values are constantly higher.
Taken together, this suggests that modified ANM can improve
the performance of the ANM models.
4.4 Computing
Entropy UsingPCs
5 Conclusion
This chapter gives the background of two important methods
PCA and ENM.By following the steps with the set of 329 HIV-1
PDB structures, one can get a hands-on experience on how to
234
6 Notes
1. Selecting a set of structures: There are 564 HIV-1 X-ray structures in PDB (07/26/2013). Among them, 329 PDB structures are selected so that the MUSTANG structural alignment
does not produce any gaps in the corresponding aligned
sequences. If a different set of structures is selected that produces gaps after multiple structural alignment, the residues in
a structure that fall along the gaps on the alignment need to be
removed before the PCA calculation.
2. Construction of the selected dataset: It is important to select a
dataset that represents the whole conformational landscape of
a protein structure. In panels A and B of Fig.3, the open and
closed structures are clustered on the left and the right side,
respectively, and the intermediate conformations (1aid, 3t11,
4ej8, etc) span the middle region. Though the number of
235
Acknowledgments
We gratefully acknowledge the support provided by NIH Grant
R01GM072014 and NSF Grant MCB-1021785.
We used several Matlab functions from MAVEN by Zimmerman
etal. [12] as mentioned in Subheading4. MAVEN is also useful to
compute PCs, PC-plot, ANM modes, overlap between PCs and
ANM modes.
References
1. Berman HM, Westbrook J, Feng Z, Gilliland G,
Bhat TN, Weissig H, Shindyalov IN, Bourne PE
(2000) The protein data bank. Nucleic Acids
Res 28(1):235242, PMCID:PMC102472
2. Hotelling H (1993) Analysis of a complex of
statistical variables into principal components.
JEduc Psychol 24:417441
3. Manly B (1986) Multivariate statisticsa
primer. Chapman & Hall, Boca Raton
4. Pearson K (1901) On lines and planes of closest fit to systems of points in space. Philos Mag
2(6):559572
5. Amadei A, Linssen AB, Berendsen HJ (1993)
Essential dynamics of proteins. Proteins
17:412425
6. Amadei A, Linssen AB, de Groot BL, van
Aalten DM, Berendsen HJ (1996) An efficient
236
Chapter 11
Computing Ensembles of Transitions with Molecular
Dynamics Simulations
Juan R. Perilla and Thomas B. Woolf
Abstract
A molecular understanding of conformational change is important for connecting structure and function.
Without the ability to sample on the meaningful large-scale conformational changes, the ability to infer
biological function and to understand the effect of mutations and changes in environment is not possible.
Our Dynamic Importance Sampling method (DIMS), part of the CHARMM simulation package, is a
method that enables sampling over ensembles of transition intermediates. This chapter outlines the context
for the method and the usage within the program.
Key words Conformational transition, Sampling intermediates, Relative free energy, Statistical
mechanics of proteins, Structurefunction
Introduction
Starting from a seminal paper by McCammon et al. [1], the field of
protein molecular dynamics has evolved rapidly. This reflects the
interest of a broad section of the biophysics community in understanding how a detailed X-ray or NMR structure can be connected
to measured function [2]. The force-fields that are integrated on
the computer to enable the sampling of motions have also improved
dramatically along with the hardware that is enabling more and
more complex systems of growing size and timescales to be
explored on a detailed level with the use of computers. Probably
the most extensive tests of force-fields and simulation times have
come from the DE Shaw group and their recent sets of microsecond to millisecond simulations of BPTI and of a WW-domain [3].
In this case the computations were able to reach conformational
substates that were not obvious from the X-ray structures, but
that are fully consistent with the available experimental information. The exciting news from their results is that the force-fields
currently in use seem to be capable of generating insights on
much longer time-scales than had ever before been attempted.
Andreas Kukol (ed.), Molecular Modeling of Proteins, Methods in Molecular Biology, vol. 1215,
DOI 10.1007/978-1-4939-1465-4_11, Springer Science+Business Media New York 2015
237
238
The less exciting news is that reaching these time scales, even with
supercomputer access, is challenging for larger systems and for
complex conformational change. Thus there remains a need to
develop methods to enhance sampling of rare events that are poorly
sampled in a general molecular dynamics trajectory.
We suggest that understanding large conformational change
is the next real frontier for the molecular dynamics community.
The finding that fluctuations from the X-ray or NMR starting
points agree with experiment is now pretty well agreed by most
researchers. The DE Shaw results thus further confirm and extend
the validity of the molecular dynamics method. However, this validation of the method still leaves many questions for how to best
apply it to a given biophysical question. If we want to know how
a channel gates or how a kinase switches states, then it is not obvious that the best solution is the DE Shaw one of simply running
the calculations for a very, very long time. Phrased another way, the
number of interesting biophysical questions far outnumbers the
available computer cycles. So, we need to find ways to efficiently
use the computer resources that we do have access to and to most
confidently sample on the interesting biological changes. Phrased
another way, to reach these important biologically inspired questions demands a new level of computation that will not be readily
met by even the most advanced computer systems. In general the
initial excitement about the molecular dynamics method as a panacea then led to disappointments and is now in a slowly growing
re-enthusiasm as the computer speeds have reached a stage where
more elaborate calculations with greater statistical accuracy can be
performed. This is also reflected in the recent Nobel Prize awards
for the development of the approach.
1.1 Why Transitions
Are Important
239
Fig. 1 Human epidermal growth factor receptor (HER3) exhibits a large conformational change that prevents exposure of the dimerization arm (blue and pink )
by interactions between domains I and IV (blue and green ) [5, 6]
The basic idea of molecular dynamics is the integration of a coupled set of differential equations. By specifying the initial positions
(xyz-space) through the use of a structure determined by X-ray or
NMR methods, and by drawing an initial assignment of velocities
to be consistent with a thermodynamic temperature, the equations
of motion in classical space are well defined. The problems with the
methods currently used are largely due to limitations in the time-scale
of sampling and the relative accuracy of the potential functions.
The potential function defines the forces that are used in the
F = m a = (x) that underlies the solution of the coupled set of
240
241
242
1.3 Importance
Sampling and SDE
243
244
parameter along a complex multidimensional terrain. In one situation the transition state is fully sampled along the order parameter.
In contrast, for another situation, the order parameter creates a
systematic error in sampling along the transition surface, suggesting that the barrier is found about half-way along the order parameter, rather than realizing that the bottleneck is closer to the
starting state in the projection of the dimensions. This problem
can only be considered even more difficult for a multidimensional
system with an even larger number of degrees of freedom that need
to be averaged out in order to get good sampling.
245
Fig. 2 Most dominant mode from PCA analysis for the human epidermal growth
factor receptor (HER3). The direction of the motion is similar to the conformational change observed in crystal structures (Fig. 1)
246
4.1 Loading
the Structures
247
Fig. 3 If a trial molecular dynamics step is towards the target B then the motion
is accepted. A motion away from the target is only accepted with a certain probability, and this probability decreases as the trial move is away from the target.
The algorithm is similar to a Brownian ratchet system with the general direction
of the random walk being towards the target [83]
! target configuration
OPEN READ UNIT 1 CARD NAME target.crd
READ COOR CARD UNIT 1
CLOSE UNIT 1
COOR COPY DIMS
! starting configuration
OPEN READ UNIT 1 CARD NAME start.crd
READ COOR CARD UNIT 1
CLOSE UNIT 1
4.2
Setting Up DIMS
! set up DCAR-DIMS
248
This example gently moves the system toward the target without
a restriction on the total time. The rejection constant in this
example is set to =1 105, selection of the rejection constant is
system dependent, and must be explored before production runs.
If the barrier height is not high enough under some conditions this
algorithm will not converge. When the barriers to conformation
change are small this approach will converge with a better DIMS
or OnsagerMachlup (OM) score. As DIMS uses by default RMSD
as the progress parameter, it is generally required to align the two
structures in order to eliminate rotations and translations. In the
previous example the structures are aligned every 1,000 steps using
the second atom selection for the alignment.
As previously mentioned, DIMS can also use a bias based on
the Normal Modes from the initial structure, and recalculate them
as the simulation progresses. Calculation of the modes is performed
by using the block normal mode method available in CHARMM.
A typical usage of the NM-biasing method is:
DIMS DBNM DSCALe 0.1 SKIP 500 BSKIP 50 NBIAs 27
- ! dims options
SERL GENR SCAL 0.5882 TMEM 420 MEMO 20 MEMA 400 NMOD 30 - ! BNM options
COFF 2.0 HARD
- ! NM Hard Cutoff
ORIEnt 20
- ! DIMS selection
COMB 3 NBES 15
NWINDow 12
MTRA @I NMUNit 10
DSUNIT
249
Notes
Trajectories generated by DIMS are completely independent [46],
therefore ensembles of transitions can be generated by running
multiple instances of CHARMM. Additionally, DIMS can also
compute the OnsagerMachlup action functional for each trajectory as the simulation progresses, this calculation can be activated
by adding the flag OMSC to the DYNA command [46]. The current implementation of DIMS is not limited to just RMSD as
progress variable, it can also use interatomic distances, angles, and
dihedrals, as well as, combinations of these in order to generate
collective variables. This flexibility allows, for instance, the use of
native contacts of the target structure as the progress variable.
References
1. McCammon JA, Gellin B, Karplus M (1977)
Dynamics of folded proteins. Nature
267:585590
2. Schlick T, Collepardo-Guevara R, Halvorsen
LA, Jung S, Xiao X (2011) Biomolecular modeling and simulation: a field coming of age. Q
Rev Biophys 138
3. Shaw DE et al (2010) Atomic-level characterization of the structural dynamics of proteins.
Science 330:341346
4. Creighton TE (1993) Proteins: structures and
molecular properties. Macmillan, New York
5. Ferguson KM et al (2003) EGF activates its
receptor by removing interactions that autoinhibit ectodomain dimerization. Mol Cell
11:507517
6. Perilla JR, Leahy DJ, Woolf TB (2013) Molecular
dynamics simulations of transitions for ECD epidermal growth factor receptors show key differences between human and drosophila forms of
the receptors. Proteins 81:11131126
7. Gerstein M, Lesk AM, Chothia C (1994)
Structural mechanisms for domain movements
in proteins. Biochemistry 33:67396749
8. Fischer S (1992) Conjugate peak refinement:
an algorithm for finding reaction paths and
accurate transition states in systems with many
degrees of freedom. Chem Phys Lett 194:
252261
9. Gruia AD, Bondar A-N, Smith JC, Fischer S
(2005) Mechanism of a molecular valve in the
halorhodopsin chloride pump. Structure
13:617627
10. Elber R, Karplus M (1987) A method for
determining reaction paths in large molecules:
application to myoglobin. Chem Phys Lett
139:375380
250
21. Pratt LR (1986) A statistical method for identifying transition states in high dimensional
problems. J Chem Phys 85:50455048
22. Chandler D, Pratt LR (1976) Statistical
mechanics of chemical equilibria and intramolecular structures of nonrigid molecules in
condensed phases. J Chem Phys 65:
29252940
23. Bolhuis PG, Chandler D (2000) Transition
path sampling of cavitation between molecular
scale solvophobic surfaces. J Chem Phys 113:
81548160
24. Huo S, Straub JE (1997) The MaxFlux algorithm for calculating variationally optimized
reaction paths for conformational transitions in
many body systems at finite temperature. J Chem
Phys 107:50005006
25. Ren W, Eijnden EV, Maragakis P, Weinan E
(2005) Transition pathways in complex systems: application of the finite-temperature
string method to the alanine dipeptide. J Chem
Phys 123:134109
26. Maragliano L, Fischer A, Vanden-Eijnden E,
Ciccotti G (2006) String method in collective
variables: minimum free energy paths and isocommittor surfaces. J Chem Phys 125:24106
27. Eastman P, Gronbech-Jensen N, Doniach S
(2001) Simulation of protein folding by reaction path annealing. J Chem Phys 114:
38233841
28. Onsager L, Machlup S (1953) Fluctuations
and irreversible processes. Phys Rev
91:15051512
29. Jnsson H, Mills G, Jacobsen KW (1998)
Classical and quantum dynamics in condensed
phase simulations. In Berne BJ, Coker
DF. Proceedings of the International School of
Physics. LERICI, Villa Marigola. pp. 385404
30. Crehuet R, Field MJ (2003) A temperaturedependent nudged-elastic-band algorithm.
J Chem Phys 118:95639571
31. Peters B, Heyden A, Bell A, Chakraborty A
(2004) A growing string method for determining transition states: comparison to the nudged
elastic band and string methods. J Chem Phys
120:78777886
32. Trygubenko S, Wales D (2004) A doubly
nudged elastic band method for finding transition states. J Chem Phys 120:20822094
33. Mathews D, Case D (2006) Nudged elastic
band calculation of minimal energy paths for
the conformational change of a GG noncanonical pair. J Mol Biol 357:16831693
34. Kuczera K, Jas GS, Elber R (2009) Kinetics of
helix unfolding: molecular dynamics simula-
35.
36.
37.
38.
39.
40.
41.
42.
43.
44.
45.
46.
47.
48.
49.
50.
51.
52.
53.
54.
55.
56.
57.
58.
59.
60.
61.
62.
251
252
Chapter 12
Accelerated Molecular Dynamics andProtein
Conformational Change: ATheoretical andPractical
Guide Using aMembrane Embedded Model
Neurotransmitter Transporter
PatrickC.Gedeon, JamesR.Thomas, andJeffryD.Madura
Abstract
Molecular dynamics simulation provides a powerful and accurate method to model protein conformational
change, yet timescale limitations often prevent direct assessment of the kinetic properties of interest.
A large number of molecular dynamic steps are necessary for rare events to occur, which allow a system to
overcome energy barriers and conformationally transition from one potential energy minimum to another.
For many proteins, the energy landscape is further complicated by a multitude of potential energy wells,
each separated by high free-energy barriers and each potentially representative of a functionally important
protein conformation. To overcome these obstacles, accelerated molecular dynamics utilizes a robust bias
potential function to simulate the transition between different potential energy minima. This straightforward approach more efficiently samples conformational space in comparison to classical molecular dynamics simulation, does not require advanced knowledge of the potential energy landscape and converges to
the proper canonical distribution. Here, we review the theory behind accelerated molecular dynamics and
discuss the approach in the context of modeling protein conformational change. As a practical example,
we provide a detailed, step-by-step explanation of how to perform an accelerated molecular dynamics
simulation using a model neurotransmitter transporter embedded in a lipid cell membrane. Changes in
protein conformation of relevance to the substrate transport cycle are then examined using principle
component analysis.
Key words Biological transport, Membranes, Molecular dynamics simulation, Neurotransmitter
transport proteins, Protein conformation
1 Introduction
Classic molecular dynamics (cMD) simulations are used to study
the kinetic behavior of proteins. By implementing fundamental
laws of motion, the technique is able, on an atom-by-atom basis,
to accurately predict the dynamic behavior of proteins in various
modeled environments. The technique effectively samples conformational space in a time-dependent manner, and if conducted for
Andreas Kukol (ed.), Molecular Modeling of Proteins, Methods in Molecular Biology, vol. 1215,
DOI 10.1007/978-1-4939-1465-4_12, Springer Science+Business Media New York 2015
253
254
255
2 Theory
Expanding on a previous hyperdynamics method explored by
Voter [20, 21], the aMD method allows for assessment of protein
conformational change by reducing the computational time a simulated protein spends in a potential energy basin, allowing the system
to transverse potential energy barriers more readily. This is accomplished by adding a bias potential to the true potential when the
systems potential energy falls below a threshold level (Fig.1). While
Voters method of implementing the bias potential requires the
Hessian matrix to be diagonalized at each time step in order for
identification of transition state regions, the aMD method is based
on a simpler bias potential proposed by Steiner etal. [22] and
implemented by Rahman and Tully [23]. In this puddles method,
the bias potential is selected so that the produced modified potentials near the minima remain constant if the true potential of the
system falls below a selected threshold level. Accordingly, diagonalization of the Hessian matrix is not required at each step, allowing for the simulation method to be applied to larger systems such
as proteins.
Specifically, a nonnegative, continuous bias boost potential
function V(r) is defined such that when the true potential of a
system, V(r), falls below a specified boost energy, E, the simulation
is carried out using the modified potential V*(r)=V(r)+V(r),
3000
V(r)
3500
4000
4500
256
2000
2500
1.0
1.5
2.0
2.5
3.0
3.5
4.0
V (r), V (r) E ,
V * (r) =
V (r) + DV (r), V (r) < E
(1)
Dt i* = Dt e
bDV r (t i )
(2)
257
t * = Dt i* = Dt e
t* = t e
bDV r (t i )
bDV r (t i )
bDV r (t i )
(3)
(4)
e
is the boost factor, a measure of the simulation
where
acceleration extent, and N is the total number of simulation steps.
Importantly, the aMD method converges to the canonical distribution, allowing for accurate determination of equilibrium and
other thermodynamic properties. The phase space for the modified
potential can be reweighted at each point by multiplying individual
configurations by the bias strength at each configuration, resulting
in a corrected ensemble average equivalent to that observed with
the normal potential [24, 25]. This approach as well as other
reweighting approaches allowing for accurate free energy calculation of the resulting trajectories continue to be explored.
In the original method described by Rahman and Tully, the
boost potential V(r) is defined as EV(r). With this method,
when the true potential energy is below the threshold boost energy
E, the modified potential energy, V*(r)=E. This produces flat
regions or puddles at potential energy basins. While this method
is computationally inexpensive, complications due to the discontinuity at points where the unmodified potential meets the modified
potential result in the need for special computations which increase
the computational burden overall. More importantly, at a high
boost energy, the flat modified potential exists at a level higher
than most transition state regions. As a result the system may
undergo a random walk and is slow to converge.
Alternatively, the aMD method utilizes a snow drift approach
which fills the minima, producing a more smooth landscape. The
shape of the underlying potential energy surface is maintained even
at a high boost energy E. The method results in a smooth transition where the unmodified potential energy above the boost energy
meets the modified potential energy. In order to accomplish this,
V(r) is defined by the equation
( E - V (r))
,
a + ( E - V (r) )
2
DV (r) =
(5)
4000
5000
V(r)
=1
= 100
= 500
= 1000
= 5000
3000
V(r)
2000
258
259
3000
2000
V(r)
4000
5000
V(r)
=1
= 100
= 500
= 1000
= 5000
260
While aMD simulation allows rapid sampling of multiple conformational states, analyzing the resultant trajectories can be problematic
due to the large amount of data produced. While the traditional root
mean square deviation (RMSD) method can be used to distinguish
between different conformational states, this method is not as effective as the system undergoes transitions between subtle yet significant
conformational states yielding low RMSD values. Instead, principal
component analysis (PCA) can be used as a more sensitive method to
distinguish between different conformational states.
PCA reduces the dimensionality of large data sets by calculating a covariance matrix and it eigenvectors. Vectors with the highest eigenvalues become the most significant principal components.
When principal components are plotted against each other, similar
structures cluster. Each cluster then theoretically represents a different protein conformational state.
Since PCA can be calculated using coordinates from any subset
of atoms within a given protein, atom selection can have a large
effect on observed outcomes. A common protocol used to avoid
sample noise from random fluctuations is to calculate the PCA only
for backbone carbon atoms. Alternatively, specific residues or segments of a given protein can be selected based on experimental
observations and isolated in a PCA analysis. For example, following aMD simulation of the leucine transporter, the combination of
coordinate positions for the backbone carbon atoms of the transmembrane helical domains 1b and 6a was a better discriminator of
conformations than the calculations of either the whole structure
or any other helical domains alone or in combination [15].
3 Methods
Below we present a protocol for performing aMD simulation and
assessing protein conformational change. As an example, we highlight our recent work modeling the bacterial leucine transporter
(LeuT), a homologue of the eukaryotic Na+/Cl-dependent neurotransporters responsible for terminating synaptic transmission
by driving the cellular uptake of neurotransmitters, including the
biogenic amines. These proteins are the targets of numerous pharmacological compounds and their dysfunction is associated with
disorders of the nervous system. Through the use of aMD simulation, we have gained insight to the function of this class of proteins [26, 15].
In order to provide the reader with a guide such that an equivalent technique can be extended to the study of any protein of
interest, we will discuss (1) building a simulation environment
suitable for aMD simulation and the study of protein conformational change, (2) preparatory energy minimizations, heating the
system, cMD equilibration, and aMD production runs, and (3)
analysis of protein conformational change using PCA.
261
3.1 Computer
Software
3.2 Obtain
theStarting Protein
Coordinates
3.3 Build
Coordinates foraLipid
Membrane
262
263
264
Fig. 4 LeuT positioned within the POPE lipid membrane. Nonpolar residues (white) at the membrane interface
in contact with the hydrophobic lipid tails of the POPE membrane and polar residues (green) at the membrane
interface either in contact with the polar head groups our out of the transverse plane of the lipid bilayer
265
266
Review of the headers in the 2A65 PDB file indicates that the four
most N- and C-terminal residues and residues N133 and A134 are
missing. We will build the missing non-terminal residues (N133
and A134) into our structure, but leave out the terminal residues.
This task can be accomplished in UCSF Chimera and MODELLER
as follows:
1. In UCSF Chimera load the membrane aligned 2A65 PDB file
by selecting FileOpen and navigating to the PDB structure.
While it is often not necessary to specify the file type it is a
good practice to do so to avoid any potential unexpected
errors.
2. Select ToolsStructure EditingModel/Refine Loops. This
brings up sequence information for the protein. Missing residues are highlighted with red boxes. Selecting Model/Refine
Loops will also open UCSF Chimeras interface to MODELLER.
3. In the Model Loops/Refine Structure window select non-
terminal missing structure. This selects residues that are found
in the PDB SEQRES record, but missing from the PDB file
coordinates. This selection only selects residues constrained at
both ends by existing structures, so the N- and C-terminal
missing residues will not be built.
4. Other building parameters can also be modified. In this example, we allow 0 residues adjacent to the missing regions to
move, generate 1 model, use the Discrete Optimized Protein
Energy (DOPE) energy score [32], and run MODELLER
267
268
269
270
3.8 Generate
aNonstandard POPE
Lipid Unit Using
Antechamber
271
Prior to performing production calculations such as energy minimizations or MD simulations in AMBER, it is necessary to define:
(1) a prmtop file that contains the required force field parameters and a description of the molecular topology and (2) an
inpcrd file that contains the atomic coordinates. An inpcrd file
can also contain atomic velocities and periodic box information if
defined. Here, AMBERs LEaP program will be used to generate
the files necessary for production calculations.
In order to realistically model protein behavior, it is furthermore necessary to solvate the system with water molecules and
ions. Any missing atoms including hydrogen atoms and N- and
C-terminus specific atoms which have not previously been included
can be added at this point. Conveniently, LEaP will perform all of
272
saveamberparm
system
leut_system.inpcrd
quit
273
leut_system.prmtop
274
$AMBERHOME/bin/sander -O -i minimize.in -o
minimize.out c leut_system.inpcrd -p leut_system.prmtop -r leut_system_min.rst
Where -O specifies that output files should be overwrite any
existing files, -I specifies the name of the mdin file, -o specifies the name of the output file, -c specifies the starting inpcrd
file, -p specifies the prmtop file, and -r specifies the name of
the final coordinates following minimization.
Beginning with the rst file from the previous minimization,
the following mdin file can be used to minimize the lipids as well:
Minimization of solvent and lipids
&cntrl
imin
= 1,
maxcyc = 1000,
ncyc
= 400,
ntb
= 1,
ntr
= 1,
cut
= 12
/
Hold protein fixed
250.0
RES 1 511
END
END
Finally, all atoms in the simulation are minimized with the following mdin file:
Minimization of all atoms
&cntrl
imin
= 1,
maxcyc = 1000,
ncyc
= 400,
ntb
= 1,
ntr
= 0,
cut
= 12
/
3.11 Heat
theSystem
andPerform anInitial
cMD Equilibration
Following energy minimization the system is ready for cMD equilibration. The system temperature is linearly increased from 0 to
310K (tempi=0.0; temp0=310.0) in order to prevent excessive
and sudden solute fluctuations. A weak restraint is placed on the
protein and lipid residues to further aid in this regard. A Langevin
temperature equilibration scheme (ntt=3) will be used to equalize
and maintain the system temperature using a collision frequency of
1.0ps1 (gamma_ln=1.0).
While equilibration will ultimately be performed using constant
pressure and temperature parameters, as the system is heating up the
pressure that is calculated can be inaccurate, leading to issues if constant pressure parameters are employed. The use of restraints with
275
276
obtain information about the system that will be required for aMD
production runs.
The following cMD.in script can be used to accomplish this:
cMD equilibration to obtain parameters necessary for aMD production runs
&cntrl
imin = 0, irest = 1, ntx = 7,
ntb = 2, pres0 = 1.0, ntp = 2, taup = 2.0,
cut = 12.0,
ntr = 0,
ntc = 2, ntf = 2,
tempi = 310.0, temp0 = 310.0,
ntt = 3, gamma_ln = 1.0,
nstlim = 500000, dt = 0.002
ntpr = 2000, ntwx = 2000, ntwr = 2000
/
In the cMD.in script above, the parameters that have
changed in comparison to the heat.in script are defined as follows. Since the simulation is being restarted, the time step is read
in from the previous run (irest=1) and the coordinates being read
in are in ASCIII restart format (ntx=7). Anisotropic pressure scaling will be used (ntp=2) to maintain a constant pressure (ntb=2)
with an average pressure of 1atm (pres0=1.0) and a relaxation
time of 2ps (taup=2.0). No position restraints are used (ntr=0).
One nanosecond of simulation time will be obtained
(nstlim=500000, dt=0.002). The following command can be
used to execute the script:
$AMBERHOME/bin/sander -O -i cMD.in -o cMD.
out c leut_system_heated.rst -p leut_system.
prmtop -r leut_system_cMD_equil.rst
3.12 Perform aMD
ProductionRuns
The cMD equilibrated system and data obtained from the cMD
equilibration can now be used for aMD production runs. aMD
production runs are conducted in a very similar fashion to the
cMD equilibration described in Subheading3.11, with the only
difference being the definition of additional aMD specific
parameters. The following aMD.in script can be used for
aMD production runs:
aMD production run
&cntrl
imin=0, irest=1, ntx=7,
ntb = 2, pres0 = 1.0, ntp = 2, taup = 2.0,
cut = 12.0,
ntr = 0,
ntc = 2, ntf = 2,
tempi = 310.0, temp0 = 310.0,
ntt = 3, gamma_ln = 1.0,
277
278
library(bio3d)
pdb <-read.pdb("LeuT_frame1.pdb")
dcd <- read.dcd("LeuT_trj.dcd")
Select the atoms of interest for the PCA analysis. The atom.select
command pulls the indices of atoms which correspond to the atom
selection. For our analysis, we choose to use all alpha carbon positions of select residues. The elety parameter of the atom.select
command specifies the atom type (CA=alpha carbon), and the
resno parameter can be used to select residues by number.
279
fit.xyz(pdb$xyz[ca.ind$xyz],
The PCA of the coordinates can be taken based on the trajectory with the pca.xyz command.
trj.pca <- pca.xyz(trj.fit)
With the Bio3D package installed, the plot command has
been overloaded to create a default PCA plot with four graphs.
Three are the z-scores of the first three principal components
plotted against each other in two dimensions. The last is a scree plot
representing how much of the variance of the data set is captured
by each principal component (Fig.5).
plot(trj.pca)
The plot points can be computationally clustered and colored
by cluster. This can be done by creating a distance matrix of the
principal components of interest (principal components 1, 2, and 3
were used in this analysis).
d <- dist(trj.pca$z[,1:3])
A dendrogram of the distance matrix can be calculated and
plotted for visualization. The plot of the dendrogram can be used
to determine the number of clusters desired from the analysis.
hc <- hclust(d)
plot(hc)
The cutree command can be used to create a color vector which
will color each point based on the number of groups desired. It
takes two arguments, the dendrogram and k which is the desired
number of clusters. Here, the LeuT data appeared to fall into seven
clusters. The PCA plots can be colored by replotting the PCA and
using the output from the cutree command as an argument to the
col parameter (Fig.6). For structure analysis, representative structures from each cluster can be isolated and analyzed in molecular
visualization software of choice like VMD.
grps <- cutree(hc,k=7)
plot(trj.pca,col=grps)
3.15 Validation
andUse ofthePCA
PCA of aMD data is useful for finding protein segments that are
most involved in structural changes. The RMSF (root mean square
fluctuations) calculation can be used to determine how much each
residue moves during the trajectory.
rf <- rmsf(trj.fit)
280
281
Fig. 6 Bio3D plot of PCA data colored by cluster after calculation of cluster groups
282
Fig. 7 Root mean square fluctuations graph superposed with the $au vectors
representing the amount of variance captured by each principal component. This
graph has all residues included, and the carboxy terminus appears to dominate
the principal component
283
Fig. 8 RMSF graph superposed with the $au vectors of the principal components
after adjusting the PCA by removing residues 506512
barplot(rf.adj,col="purple",border="purple"
,main="Adjusted")
par(new=TRUE)
plot(trj.pca.adj$au[,1],type="l",col="orang
e",lwd=3)
points(trj.pca.adj$au[,2],type="l",col="gre
en",lwd=3)
points(trj.pca.adj$au[,3],type="l",col="sky
blue",lwd=3)
Once a suitable RMSF plot has been obtained, the PCA can be
reclustered and replotted to visualize the new PCA result (Fig.9).
d <- dist(trj.pca.adj$z[,1:3])
hc <- hclust(d)
grps <- cutree(hc,k=7)
plot(trj.pca.adj,col=grps)
The fluctuations captured by each principal component can
also be visualized as a trajectory in VMD.Use the mktrj.pca command to generate a VMD trajectory for visualizing captured
fluctuations in each principle component desired. Load the PDB
file into VMD and change the representation to Trace or Tube
since the PCA only printed out the alpha carbons.
284
Fig. 9 Adjusted PCA plot of LeuT with residues 506512 removed and each point colored by cluster
mktrj.pca(trj.pca,pc=1,mag=1, file="PC1.pdb")
As mentioned above in the background, it may be necessary to
reweigh the PCA plots to determine the validity of each point. The
following steps are necessary in order to accomplish this:
1. Extract the V[r(ti)] for each structural point from the simulation log files using text manipulation tools like grep and awk.
2. Load these values into R.
3. Calculate e
bDV r (t i )
3.16 Compare
theResulting
Conformations
withAdditional
Structural Data
285
References
1. Lindorff-Larsen K, Piana S, Dror RO, Shaw
DE (2011) How fast-folding proteins fold.
Science 334(6055):517520. doi:10.1126/
science.1208351
2. Shaw DE, Maragakis P, Lindorff-Larsen K,
Piana S, Dror RO, Eastwood MP, Bank JA,
Jumper JM, Salmon JK, Shan Y, Wriggers W
(2010) Atomic-level characterization of the
structural dynamics of proteins. Science
330(6002):341346.
doi:10.1126/
science.1187409
3. Genchev GZ, Kallberg M, Gursoy G, Mittal A,
Dubey L, Perisic O, Feng G, Langlois R, Lu H
(2009) Mechanical signaling on the single protein level studied using steered molecular
dynamics. Cell Biochem Biophys 55(3):141
152. doi:10.1007/s12013-009-9064-5
4. Baker JL, Biais N, Tama F (2013) Steered
molecular dynamics simulations of a type IV
pilus probe initial stages of a force-induced
conformational transition. PLoS Comput Biol
9(4):e1003032.
doi:10.1371/journal.
pcbi.1003032
5. Forti F, Boechi L, Estrin DA, Marti MA (2011)
Comparing and combining implicit ligand
sampling with multiple steered molecular
286
287
30.
Grant BJ, Rodrigues AP, ElSawy KM,
McCammon JA, Caves LS (2006) Bio3d: an
R package for the comparative analysis of
protein
structures.
Bioinformatics
22(21):26952696. doi:10.1093/bioinformatics/btl461
31. Yamashita A, Singh SK, Kawate T, Jin Y,
Gouaux E (2005) Crystal structure of a bacterial homologue of Na+/Cl-dependent
neurotransmitter
transporters.
Nature
437(7056):215223.
doi:10.1038/
nature03978
32. Shen MY, Sali A (2006) Statistical potential
for assessment and prediction of protein
structures.
Protein
15(11):25072524.
doi:10.1110/ps.062416606
33. Aksimentiev A, Sotomayor M, Wells D (2012)
Membrane proteins tutorial. Theoretical and
Computational Biophysics Group, University
of Illinois at Urbana-Champaign, Champaign
Chapter 13
Simulations andExperiments inProtein Folding
GiovanniSettanni
Abstract
The interplay between simulations and experiments of protein folding has largely contributed to the
elucidation of many important aspects of the phenomenon. In this chapter, I briefly describe the experiments which provide information on the kinetics of the protein folding process, and help to characterize
the folding transition state. Then, I show how to probe the kinetics of protein folding using molecular
dynamics simulations, how to compare the simulations with the experiments and how to help and rationalize
the latter, ultimately offering a molecular picture of the process. After the production of suitable molecular
dynamics simulation data in the form of trajectories, the procedure involves sequentially the identification
of the stable states of the protein, the identification of the transition pathways connecting the stable states,
the identification of the transition state conformations, comparison with experimental results, and finally, the
identification of the molecular determinants or reaction coordinates of the folding process, that is, the
features that clearly help distinguishing the transition state from the stable states.
Key words Kinetics, Transition state, Committor, Phi-value, Clustering, Kinetic network
1 Introduction
Since its discovery [1], the phenomenon of reversible folding of
proteins has engaged generations of scientists from many different
fields. After decades of research the emerging picture often used
to describe the protein folding process, at least in its simplest declination, is that of a first order phase transition between the native
state of the protein, and the unfolded/denatured state, the former
being enthalpically stabilized by favorable interactions internal to
the protein chain and with the solvent, the latter being entropically
stabilized by the large number of conformations that the protein
chain can sample within this state.
This rationalization of the protein folding process, led to more
focused attempts to identify the determinants of the process.
Indeed, a first order phase transition implies the crossing of a free
energy barrier, which, then, represents a bottleneck in the transition
process. The height of the free energy barrier, which can be measured by experiments on the folding kinetics, and the way it changes
Andreas Kukol (ed.), Molecular Modeling of Proteins, Methods in Molecular Biology, vol. 1215,
DOI 10.1007/978-1-4939-1465-4_13, Springer Science+Business Media New York 2015
289
290
Giovanni Settanni
WT
DGmut
- U - DG - U
DG
mut
F -U
- DG
WT
F -U
DDG-U
DDGF -U
k mut
f
= ln WT
k
f
/ DDGF -U
(1)
where GFU is the folding free energy, i.e., the free energy difference between folded and unfolded state (see Note 1). In the classical interpretation, phi-values close to one mean that the mutation
has produced the same effect on the free energy of the transition
and folded state (the free energy of the unfolded state is used as
reference), thus, the conformation of the mutated residue must be
similar in the two states. On the other hand, a phi-value close to
zero means that the mutation affected mostly the free energy of
the folded state and not the transition state, thus the conformation
of the residue in the transition state must be similar to its unfolded
state. Intermediate and non-classical (<0 or >1) phi-values
cannot be interpreted straightforwardly and each case requires
specific attention.
Systematic measurements of phi-values of many amino acids,
sometime termed alanine/glycine scans, are being made available
for many proteins. These measurements help to identify either
which amino acids take part in the formation of a folding nucleus
around which the overall folding process takes place (the amino
acids with large ), or the presence of a diffuse transition state,
where there are no amino acids with large . Early results [4] have
shown that in many cases it is possible to correlate the phi-value of
an amino acid to the fraction of native atomic contacts formed at
the transition states. This relationship offers both a way to test
simulations, and to interpret simulation data where experimental
data are not available [5].
291
292
Giovanni Settanni
2 Materials
All the simulations and analysis are performed on Linux workstations/clusters. In case study 1 simulations are performed using the
program GROMACS [16]. In case study 2 CHARMM [17] is used
to run the simulations. In either case, analysis of the trajectories is
performed using the program WORDOM [18]. Shell scripts and
AWK scripts are used for data analysis. The programs GNUPLOT
(http://www.gnuplot.info) and GRACE (http://plasma-gate.weizmann.ac.il/Grace/) are used for plotting graphs. The program
VMD [19] is used for visualization of protein structures and trajectories. The native conformation of the simulated Trpz1 peptide is
obtained from the Protein Data Bank, with pdbid 1LE0 [20].
3 Methods
3.1 Models, Force
Fields, Simulations
Techniques
293
Transition events between the stable states of the system can be identified only after the stable states of the system have been identified.
To this extent, a set of observables relevant for the description of the
protein/peptide are selected and the time series of these observables
are measured along the trajectories. In the case of the Trpz1 peptide,
the root mean square deviation (RMSD) of the C atoms from the
native conformation and the number of native hydrogen bonds
formed along the backbone (HB) represent relevant observables.
Many ways are available to perform these measurements. I will show
how to use the program WORDOM (http://wordom.sourceforge.
net/), which is particularly fast and easy to handle. WORDOM analysis can be invoked using the following syntax:
where the wordom input file .wrd contains the following for
RMSD calculations:
294
Giovanni Settanni
295
Fig. 1 (a) representative time series of the observables HB, RMSD and VIR for case study 1, simulations at
450K.The horizontal lines mark the boundaries of each state as identified from the histograms of the observables. These boundaries mark the starting and ending points of transition pathways (regions shaded in
magenta). (b) Histogram of the observables from case study 1, simulations at 450K. Lines around the populated regions indicate the boundaries of the stable states of the peptide. Adapted from ref. 10 with permission
from Elsevier. (c) Histogram of the observables for case study 2. The boundaries of the native (blue) and
denatured (red) state are shown
296
Giovanni Settanni
3.4 Identifying
theTransition State
Ensemble
297
298
Giovanni Settanni
In case study 1, we have measured the committor for few conformations along the transition pathways then we need to find out, by
analyzing those conformations, if there are observables that may
correlate with the committor. We identified several possible observables and tested them by measuring the degree of correlation with
the committor (Fig.3). In practice, this can be achieved by:
(a) Using wordom or VMD to measure possible observables along
the trajectory.
(b) Extracting (e.g., using awk) the value of the observables for
the conformations with known committor.
(c) Plotting the committor vs the observable.
In Fig.3 only a fraction of all the tested observables is shown.
Visual inspection of the structures with pc~0.5 and comparison
with those from the two stable states must be used to guide the
selection of the relevant observables. As it is often the case, the
observable that best represents the transition is a complex
combination of factors that describes the reciprocal position of
several groups of atoms. In the case of Trpz1 we found that the
reciprocal position of the four tryptophan side chains is crucial in
determining the stage of the transition. When the distance between
the side chains of Trp9 and Trp4 becomes smaller than the distance
between Trp4 and Trp2, then the committor increases. In other
words, when the Trp side chains align with each other as in the
native state, although the packing may not yet be native, then the
committor increases and those conformations are more likely to
fold rather than unfold.
In case study 2, we have measured an approximate committor
c for all the sampled conformations, then we can test if this
approximation represent a good reaction variable. To this extent
we adopt the procedure suggested by Hummer and coworkers
[34] where we compute the conditional probability p(TP|c)
that conformations with a given c belong to a transition pathway.
In practice, we need to:
(a) Combine the c data with the RMSD and HB data in one file,
so that for every sequential snapshot of the trajectory we have
at the same time c, RMSD and HB.
299
Fig. 2 Conformations of the Trpz1 peptide with pc>0.7 (a), pc~0.5 (b) and pc<0.2 (c) from case study 1, simulations at 450K.Adapted from ref. 10 with permission from Elsevier. Superimposed conformations of the native
(d), the TS (e) and the denatured state (f) in case study 2. Adapted with permission from ref. 25. Copyright (2011)
American Chemical Society. In all cases the Trp side chains have been rendered as licorice sticks, while the
other side chains have been hidden for clarity. In df the backbone has been colored according to the prevalent
secondary structure (red extended beta-sheet, blue beta-bridge, yellow turn, gray random coil)
(b) Write a script (e.g., using awk), that bins the conformations
according to c and counts the fraction of conformations in
each c bin that belong to a transition pathway (i.e., p(TP|c)).
(c) Plot the resulting p(TP|c).
300
Giovanni Settanni
Fig. 3 Committor pc (folding probability) plotted as a function of several observables for case study 1. The difference in the distance between the C atom paris of Trp2-Trp4 and Trp9-Trp4 (Dzip) has a good correlation with
pc especially in the transition region, thus it may represent a good reaction coordinate
301
Fig. 4 (a) Distribution of p'c for case study 2. (b) The conditional probability to be
on a transition pathway given the value of p'c (p(TP|p'c)). Adapted with permission
from ref. 25. Copyright 2011 American Chemical Society
iNC ( R )
rTSE ( i )
iNC ( R ) r N ( i )
(2)
302
Giovanni Settanni
where the sums are extended to all the native atomic contacts
NC(R) of residue R. N(i) and TSE(i) are the fraction of conformations where the contacts i is formed in the native (N) and TS
ensemble, respectively. The assumption about the correspondence
of the phi-value and the structural phi-value S has been later
verified by measuring in silico the folding kinetics of a model peptide and its mutants using the techniques described above [33].
In practice, to measure S along the trajectory, we need to
count the atomic contacts along the trajectory. As per commonly
adopted definition, an atomic contact is present when the distance
between a pair of heavy atoms is lower than 6.0. For the contact
calculations, we initially include all the possible pairs of heavy
atoms, with the exclusion of those involving the same residue,
nearest neighbor residues and backbone atoms. To do that we create an AWK script that reads the pdb file of the peptide and generates a WORDOM input file listing all the mentioned pairs of atoms.
The resulting WORDOM input file has the following structure:
303
Fig. 5 Structural phi-values S(R) for the TSE identified in case study 2. Adapted
with permission from ref. 25. Copyright 2011 American Chemical Society
304
Giovanni Settanni
4 Notes
1. Experimentally, the unfolding phi-value is often preferred to
the folding phi-value as it is affected by smaller uncertainties
DDG- F
k mut
Funf =
= ln unf
/ DDGU - F . The two phis are linked
WT
DDGU - F
kunf
by the following relationship: f=1 unf.
2. In this kind of simulations, mostly unfolding transitions are
observed, although by citing the reversibility of the Newtonian
dynamics, they have been assumed to provide information on
the folding pathway, as well [5]. In general, however, such a
large change in environmental conditions may induce changes
in the free energy landscape of the protein and on the folding/
unfolding pathways. Extensive test of the method with the
experiments has shown that this does not occur often, possibly
because of the robustness of the folding/unfolding pathways
of proteins that are the end result of a very long evolution
process.
3. Notwithstanding the conditions used for the simulations, the
limited size of the system and the simulation protocol involving
a slow heating phase prevent the observation of the liquid-vapor
phase transition of water.
4. For the choice of the right cutoff , when the native state of the
peptide is known, our suggested strategy consists in measuring
the dRMSD with respect to the native state along the trajectory
and producing an histogram of dRMSD values. This histogram
typically shows either a small peak at low dRMSD or, at least, a
shoulder. That peak is due to the other conformations of the
native state observed along the trajectory and its location in
terms of dRMSD is a good starting point for the clustering cutoff
because the native state is possibly the narrowest free energy
minimum to be observed in the simulations.
5. In the present case, clusters are portions of n-spheres in the
space of distance matrices, where n is the number of independent
elements of the distance matrix n=N(N1)/2 with N the
number of atoms used for the distance matrix.
6. The number of visited clusters changes in a discrete way. To
measure the VIR, we smooth the number of visited clusters by
convoluting it with a narrow Gaussian kernel (width 100ps)
and then taking the derivative.
7. In reality, for the case study 2 it is possible to identify four
different states using a Markov-state-model approach, however from the slowest relaxation in the system it is possible to
establish that the largest free energy barrier separates the denatured states from the variably folded states. Within those broad
305
306
Giovanni Settanni
Part III
Protein Structure Determination
Chapter 14
Comparative Modeling of Proteins
Gerald H. Lushington
Abstract
Much of the biochemistry that underlies health, medicine, and numerous biotechnology applications is
regulated by proteins, whereby the ability of proteins to effect such processes is dictated by the threedimensional structural assembly of the proteins. Thus, a detailed understanding of biochemistry requires
not only knowledge of the constituent sequence of proteins, but also a detailed understanding of how that
sequence folds spatially. Three-dimensional analysis of protein structures is thus proving to be a critical
mode of biological and medical discovery in the early twenty-first century, providing fundamental insight
into function that produces useful biochemistry and dysfunction that leads to disease. The large number
of distinct proteins precludes rigorous laboratory characterization of the complete structural proteome,
but fortunately efficient in silico structure prediction is possible for many proteins that have not been
experimentally characterized. One technique that continues to provide accurate and efficient protein structure predictions, called comparative modeling, has become a critical tool in many biological disciplines.
The discussion herein is an updated version of a previous 2008 treatise focusing on the general philosophy
of comparative modeling methods and on specific strategies for successfully achieving reliable and accurate
models. The chapter discusses basic aspects of template selection, sequence alignment, spatial alignment,
loop and gap modeling, side chain modeling, structural refinement and validation, and provides an important new discussion on automated computational tools for protein structure prediction.
Key words Proteins, Comparative modeling, Homology, Threading, Sequence alignment, Structure
alignment, Loop modeling, Structure refinement, Structure validation
Abbreviations
AA
BSE
CASP
CATH
C
DNA
H-bond
MD
NMR
NR
PDB
Andreas Kukol (ed.), Molecular Modeling of Proteins, Methods in Molecular Biology, vol. 1215,
DOI 10.1007/978-1-4939-1465-4_14, Springer Science+Business Media New York 2015
309
310
Gerald H. Lushington
PrPC
PrPSc
ps
PSI
RMSD
3D
Introduction
One of the wonders of life is its tremendous diversity, not only in
terms of organisms of vastly different sizes and characteristics but
even within any one single organism. The lowly bivalve, for example, is composed of different tissues that range across an incredible
breadth of color, transparency, flexibility, hardness, adhesiveness,
and electrical conductivity. Such variation is owed primarily to proteins: a class of materials composed of a modest set of ~20 distinct
amino acids (AA) building blocks that evolution has chemically
permuted into the most diverse collection of unique, naturally
occurring substances of any molecular class in existence. By varying the length and sequence of constituent AA chains, proteins can
assemble to form the fundamental matrices of materials that are
harder than stone, softer than soap, as translucent as glass, as
opaque as soot, soluble in water or grease, excellent conductors or
insulators, and among the most efficient known fluorophores,
capacitors, diodes, and catalysts known to man. One of the most
important keys to rationalizing, exploiting, and refining such attributes, and seeking corrective measures when they go awry, is a
detailed understanding of the three-dimensional (3D) assembled
structure(s) of proteins, for it is in this form that proteins adopt
their unique functional properties and exert their intended influence on their surrounding environment.
Beginning with the first atomic-level resolution of a protein
structure (whale myoglobin by Kendrew et al. [1]), 3D protein
models have provided a wealth of insight into biomolecular properties and processes, inspiring a growing thirst for structural detail.
However, whereas biomolecular sequencing has become highly
amenable to efficient, high throughput characterization, the experimental resolution of the 3D protein structure remains timeconsuming and frequently poses significant challenges for
traditional characterization techniques such as X-ray crystallography, with eventual success contingent on luck, persistence, ingenuity, or through the application of innovative techniques [2].
Seminal early work by Anfinsen that will be discussed later, implied,
however, that the structure of any protein was uniquely dictated by
its constituent sequence [3], which suggests that the combination
of a comprehensive catalog of protein sequences and an accurate
understanding of the relationship between sequence and structure
311
could provide the basis for correspondingly comprehensive understanding of the structural proteome. In this spirit, the Protein
Structure Initiative (PSI) (http://www.nigms.nih.gov/Research/
FeaturedPrograms/PSI/) was initiated in 2000 with the expressed
goal of making the three-dimensional, atomic-level structures of
most proteins easily obtainable from knowledge of their corresponding
DNA sequences. As for the human genome project, such objectives are of a scope (around 100,000 proteins, not counting posttranslational modifications and conformational variants) that
requires that conventional resolution methods such as X-ray crystallography and NMR by supplemented by efficient and analytically rigorous computational modeling techniques that perceive
and exploit relationships between primary AA sequence and the
biologically observed 3D structural manifestation of the protein.
This chapter thus offers a brief discussion of in silico protein structure prediction, focusing mainly on comparative modeling.
Whereas other papers (e.g., Baker and ali [4], Mart-Renom et al.
[5]) provide comprehensive reviews of the underlying method
development and research achievements in the field, this chapter
discusses the motivation for using comparative modeling and outlines the practical considerations to be made in assembling such a
model. In the years since the original version of this chapter was
composed [6], computational developments have taken place that
have substantially changed the practice of computational structural
biology: many of the protocols described within the original text
have been effectively implemented as systematically automated
protocols that are often capable of producing plausible (and often
excellent) results with minimal human intervention. This revised
version recognizes this valuable service as an important contribution to the efficient acquisition of knowledge and has added a section which profiles some of the best current (ca. 2013) resources
for automated comparative modeling. However, prudent application of automated protocols still requires the confidence of understanding the underlying manipulations and computations, thus the
meat of this chapter remains a detailed discussion of how protein
structure predictions are computationally effected.
312
Gerald H. Lushington
313
314
Gerald H. Lushington
315
316
Gerald H. Lushington
Methods
In practice, comparative modeling is best viewed not as one technique
but rather as a strategy for assembling information from various component methods (including assembly and associative techniques)
toward a unified 3D structure prediction. In general, these component steps can be approximately summarized as follows:
1. identify template proteins with structural similarity to the target as gauged (optimally) from sequence-based homology, or
from physicochemical similarity.
2. align the target sequence with all relevant template sequences
according to the same arguments of homology or physicochemical similarity employed in step 1.
3. spatially align all of the template structures into a single framework, and use the sequence alignment to project the target
protein backbone onto this framework.
4. estimate structures for target protein fragments that are illrepresented by the template manifold, or else omit them from
the predicted structure.
5. align target side chains with analogous side chains from the
template structures, or intelligently guess their disposition
according to known spatial and torsional preferences.
6. refine unphysical contacts and strains via conformational
searches, and,
7. evaluate the final relaxed model for physical tenability.
Each step above entails various methodological and strategic considerations, some of which provide opportunities for
iterative feedback to prior steps, as is shown graphically in Fig. 1.
These considerations will be elaborated upon in the remainder
of this section.
317
Fig. 1 Flow diagram for comparative modeling of proteins showing standard process (solid arrows) and
feedback/refinement mechanisms (dashed arrows)
3.1 Template
Identification
318
Gerald H. Lushington
319
320
Gerald H. Lushington
Most of the various programs commonly used for template identification generally also yield a tentative sequence alignment relative
to the target. In homologous cases with greater than 50 % targettemplate sequence conservation over the mutually aligned portion
of the structure, it is generally assumed that the alignment prediction algorithm will produce a qualitatively reliable alignment with
only modest local misalignments (no positional errors more than
several residues). Over a data set of broadly varying protein similarity, the PROSPECT-II assessment of threading reliability was that
the program could achieve about a 60 % average accuracy in
321
322
Gerald H. Lushington
323
Conserved disulfide bonds and salt bridges are typically incorporated into the target model directly during the backbone assembly
process. Beyond this, side chain positions for highly conserved
324
Gerald H. Lushington
Refinement
325
Validation
326
Gerald H. Lushington
327
328
Gerald H. Lushington
329
and validated structure model can open many doors for subsequent
analysis. In addition to valuable insight derived from simple visual
inspection, the model can form a reliable basis for many other
modeling analyses, as are discussed extensively in other chapters of
this book.
References
1. Kendrew JC, Bodo G, Dintzis HM, Parrish RG,
Wyckoff H, Phillips DC (1958) A threedimensional model of the myoglobin molecule
obtained by x-ray analysis. Nature 181:662666
2. Konermann L, Pan Y (2012) Exploring membrane protein structural features by oxidative
labeling and mass spectrometry. Expert Rev
Proteomics 9:497504
3. Anfinsen CB, Redfield RR, Choate WI, Page J,
Carroll WR (1965) Studies on the gross structure, cross-linkages, and terminal sequences in
ribonuclease. J Biol Chem 207:201210
4. Baker D, ali A (2001) Protein structure prediction and structural genomics. Science 294:
9396
5. Mart-Renom MA, Stuart AC, Fiser A, Snchez
R, Melo R, Sali A (2000) Comparative protein
structure modeling of genes and genomes.
Annu Rev Biophys Biomol Struct 29:291325
6. Lushington GH (2008) Comparative modeling
of proteins. Methods Mol Biol 443:199212
7. Dayhoff MO (1972) Atlas of protein sequence
and structure. National Biomedical Research
Foundation,
Georgetown
University,
Washington DC
8. Schaeffer RD, Daggett V (2011) Protein folds
and protein folding. Protein Eng Des Sel
24:1119
9. Pearl F, Todd A, Sillitoe I, Dibley M, Redfern
O, Lewis T, Bennett C, Marsden R, Grant A,
Lee D, Akpor A, Maibaum M, Harrison A,
Dallman T, Reeves G, Diboun I, Addou S, Lise
S, Johnston C, Sillero A, Thornton J, Orengo
C (2005) The CATH domain structure database and related resources Gene3D and DHS
provide comprehensive domain family information for genome analysis. Nucleic Acids Res
33:D247D251
10. Murzin AG, Brenner SE, Hubbard T, Chothia C
(1995) SCOP: a structural classification of proteins database for the investigation of sequences
and structures. J Mol Biol 247:536540
11. Prusiner SB (1991) Molecular biology of prion
diseases. Science 252:15151522
12. Takagi F, Koga N, Takada S (2003) How protein thermodynamics and folding mechanisms
are altered by the chaperonin cage: molecular
13.
14.
15.
16.
17.
18.
19.
20.
21.
22.
23.
24.
330
Gerald H. Lushington
Chapter 15
De Novo Membrane Protein Structure Prediction
Timothy Nugent
Abstract
Recent advances in identifying residueresidue contacts from large multiple sequence alignments have
enabled impressive gains to be made in the field of protein structure prediction. In this chapter, we discuss
these advances and provide a step-by-step guide to applying the latest tools to the de novo modelling of
alpha-helical transmembrane proteins. As a practical example, we demonstrate the process of building an
accurate 3D model of a G protein-coupled receptor, correctly orientated in the membrane, using only its
primary protein sequence.
Key words Transmembrane protein, De novo modelling, Contact prediction, Structural bioinformatics
Introduction
Membrane proteins are encoded by approximately 30 % of the
genes of a typical genome and perform crucial roles in a diverse
range of essential biological processes including transport of ions
and small molecules, intercellular communication and signal transduction. They are also important drug targets, with estimates suggesting that about 60 % of current drug targets are membrane
proteins [1]. Recently, there has been encouraging progress in
structure determination led by structural genomics initiatives that
explicitly target integral membrane proteins, resulting in increasing
coverage of important protein families [2, 3]. Despite this, coverage of membrane protein fold space remains sparse as only about
1 % of structures in the Protein Data Bank (PDB) describe membrane proteins, of which about 300 are unique. However, the technical difficulties associated with purification and structure
determination by X-ray crystallography and NMR spectroscopy are
likely to prohibit a rapid increase in these numbers. Computational
structure prediction therefore provides a vital alternative approach
with which to further our understanding of both the structure and
function of this important class of proteins.
Andreas Kukol (ed.), Molecular Modeling of Proteins, Methods in Molecular Biology, vol. 1215,
DOI 10.1007/978-1-4939-1465-4_15, Springer Science+Business Media New York 2015
331
332
Timothy Nugent
333
334
Timothy Nugent
335
Methods
This chapter will describe the process of generating a 3D model of
an alpha-helical membrane protein starting with only its primary
sequence and in the absence of structural homologues. A number
of steps are required that will be discussed in detail: (1) predicting
residueresidue contacts, (2) predicting secondary structure and
transmembrane topology, (3) generating candidate 3D structures,
(4) recombining candidate structures to generate a final model, (5)
refinement, (6) orientating of the refined model in the membrane,
and (7) model quality assessment. To reproduce these steps you will
need access to a UNIX/Linux workstation with a number of software packages and databases installed. While we will focus on tools
developed in-house at UCL, many of the methods can easily be
substituted or combined with other programs. As a target sequence,
we will use the 329-residue bovine rhodopsin protein (PDB code
1GZM), a prototypical G protein-coupled receptor (GPCR).
2.1 Predicting
ResidueResidue
Contacts
336
Timothy Nugent
337
338
Timothy Nugent
TRATIO 0.6
MAXFRAGS 5
MAXFRAGS2 25
CONFILE 1gzmA.con
ZFILE 1gzmA.zcoord
------------------------------Three files are referenced: the first is an alignment file, although
the format is slightly different to that used by PSICOV. Instead,
the first three lines from the 1gzmA.ess file are added to the beginning of the PSICOV alignment file. Additionally, the second line is
set to the size of the PSICOV alignment file. The file can therefore
be generated as follows:
head -3 1gzmA.ess > 1gzmA_FILM3.aln
cat 1gzmA.aln >> 1gzmA_FILM3.aln
Then simply change the second line in 1gzmA_FILM3.aln to
the alignment size, which can be determined with:
cat 1gzmA.aln | wc l
The file should then appear as follows, assuming the original
PSICOV alignment file contained 1,969 sequences:
----------1gzmA_FILM3.aln---------1gzmA
1969
CCCCCCCChhhhCCCCCCCCCCCCCCCCCCCCChHHHHHHHHHHHH
HHHHHHHHHHHHHHHHHhhCCCCChhHH
MNGTEGPNFYVPFSNKTGVVRSPFEAPQYYLAEPWQFSMLAAYMFL
LIMLGFPINFLTLYVTVQHKKLRTPLNY
etc.
----------------------------------The other two files contained in 1gzmA.nfpar are the contacts
predicted by PSICOV1gzmA.conand the predicted Z-axis
coordinate file1gzmA.zcoord. In the FILM3 paper, the use of
Z-coordinate distance constraints generated lower energy models
in eight cases out of 28; the use of these coordinates can therefore
be suppressed by removing or commenting out the corresponding
line. The remaining options in the 1gzmA.nfpar control aspects of
the Replica Exchange Monte Carlo function, which in general can
be left at their default values. The MAXSTEPS parameter is the
total number of fragment swaps to make, POOLSIZE is the number of replica conformations to use, and INITEMP and TRATIO
indicate the starting temperature and temperate ratio between
each replica. MAXFRAG and MAXFRAG2 are the minimum and
339
all
models
found
in
models
340
Timothy Nugent
1gzmA.con
341
AGVAFYIFTHQGSDFGPIFMTIPAFFAKTSAVYNPVIYIMMNKQ
FRNCMVTTLCCGKNDDE*
>P1;SEQ
sequence:SEQ:::::SEQ::0.00:0.00
MNGTEGPNFYVPFSNKTGVVRSPFEAPQYYLAEPWQFSMLAA
YMFLLIMLGFPINFLTLYVTVQHKKLRTPLNYILLNLAVADL
FMVFGGFTTTLYTSLHGYFVFGPTGCNLEGFFATLGGEIALW
SLVVLAIERYVVVCKPMSNFRFGENHAIMGVAFTWVMALACA
APPLVGWSRYIPEGMQCSCGIDYYTPHEETNNESFVIYMFVV
HFIIPLIVIFFCYGQLVFTVKEAAAQQQESATTQKAEKEV
TRMVIIMVIAFLICWLPYAGVAFYIFTHQGSDFGPIFMTIPAF
FAKTSAVYNPVIYIMMNKQFRNCMVTTLCCGKNDDE*
-----------------------------------In this file, the name of the structure in the first and second lines
must match the name of the final recombined model, in this case
1gzmA_recomb. The MODELLER Python script is as follows:
----------1gzmA_refine.py---------from modeller import *
from modeller.automodel import *
log.verbose()
env = environ()
class MyModel(automodel):
def special_restraints(self, aln):
rsr = self.restraints
at = self.atoms
rsr.add(secondary_structure.alpha(self.
residue_range(39,63)))
rsr.add(secondary_structure.alpha(self.
residue_range(73,96)))
rsr.add(secondary_structure.alpha(self.
residue_range(109,133)))
rsr.add(secondary_structure.alpha(self.
residue_range(154,173)))
rsr.add(secondary_structure.alpha(self.
residue_range(203,224)))
rsr.add(secondary_structure.alpha(self.
residue_range(252,274)))
rsr.add(secondary_structure.alpha(self.
residue_range(287,309)))
a = MyModel(env, alnfile
= '1gzmA_refine.ali',
knowns = '1gzmA_recomb', sequence = '1gzmA_
recomb')
342
Timothy Nugent
a.starting_model= 1
a.ending_model
= 1
a.md_level = refine.slow
a.make()
----------------------------------Here, the residue ranges to which alpha-helical secondary
structure restraints are applied, according to MEMSAT-SVM
transmembrane helix boundary predictions, can be added using
the rsr.add command. The alignment file and model name must be
referenced accordingly on the following line. The actual refinement step is initiated by the refine.slow command, which uses
molecular dynamics with simulated annealing [43]. Finally, run the
script using Python:
python 1gzmA_refine.py
This will generate the final recombined and refined model
(Fig. 1).
2.6 Orientation
of the Model
in the Membrane
343
Fig. 2 The model orientated in the membrane. The blue plane indicates the
membrane inner leaflet; the red plane is the membrane outer leaflet
344
Timothy Nugent
1gzmA_refined_EMBED.
345
346
Timothy Nugent
Fig. 4 Transmembrane helices are coloured in orange. It is clear that all seven
helices lie within the membrane plane
The example used here, bovine rhodopsin, represents a good target, consisting of a single transmembrane domain with enough
aligned sequences to produce a reliable model. However, other
347
Conclusions
This chapter should provide a useful introduction to de novo 3D
modelling of alpha-helical membrane protein using predicted contacts. Presented here are a number of powerful tools that, in combination, are capable of generating accurate models of large
transmembrane protein domains. Such models should be particularly useful for directing experimental studies on families where
structural data is unavailable. It is clear that the use of contacts
predicted by methods such as PSICOV provide extremely powerful constraints for de novo modelling, and it is likely that this strategy will become applicable to even more protein families as
sequence databases continue to grow. We estimate that the PFAM
database [51] contains more than 500 single architecture transmembrane domains with >400 aligned sequencesenough to
accurately predict contacts using methods such as PSICOV, plmDCA, or PconsCbut no experimentally determined 3D structure. Applying FILM3 to these families has the potential to
significantly expand our knowledge of transmembrane fold space,
and it is likely that many of these families will be of significant biomedical and pharmacological interest.
348
Timothy Nugent
Notes
1. The TM-score is intended to be a more accurate measure of
structural alignment compared to rmsd or GDT. Scores are in
the range (0, 1], with 1 indicating a perfect match between
two structures, scores below 0.20 typically correspond to randomly chosen unrelated proteins, while scores >0.5 are roughly
the same fold [49].
2. PSICOV can be downloaded from http://bioinfadmin.cs.ucl.
ac.uk/downloads/PSICOV/. Follow the included compilation instructions to build the PSICOV binary.
3. HHblits binaries and source code and accompanying databases
can be downloaded from: http://toolkit.genzentrum.lmu.de/
hhblits/.
4. HMMER binaries and source code can be downloaded from
http://hmmer.janelia.org/.
5. PSIPRED can be downloaded from http://bioinfadmin.
cs.ucl.ac.uk/downloads/psipred/. The NCBI toolkit (ftp://
ftp.ncbi.nih.gov) and PSI-BLAST (ftp://ftp.ncbi.nih.gov/
blast) are also required. Configure the PSIPRED script by
adding the NCBI binary directory and database paths.
Follow the included compilation instructions to build the
PSIPRED binary.
6. Download MEMSAT-SVM from http://bioinfadmin.cs.ucl.
ac.uk/downloads/memsat-svm/, configuring it in exactly the
same way as PSIPRED.
7. FILM3 can be downloaded from http:vbioinfadmin.cs.ucl.
ac.uk/downloads/FILM3/. Compile the three programs as
per the instructions.
8. ProFit can be downloaded from http:vwww.bioinf.org.uk/
software/profit/.
9. Download MODELLER from http:vsalilab.org/modeller/.
You will need to register to receive the license key required to
run it.
10. MEMEMBED can be downloaded from: http://bioinf.cs.ucl.
ac.uk/downloads/memembed/. Follow the included compilation instructions to build the MEMEMBED binary.
References
1. Hopkins AL, Groom CR (2002) The druggable genome. Nat Rev Drug Discov 1:727730
2. Kloppmann E, Punta M, Rost B (2012)
Structural genomics plucks high-hanging
membrane proteins. Curr Opin Struct Biol
22:326332
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
349
350
Timothy Nugent
43.
44.
45.
46.
47.
48.
49.
50.
51.
Chapter 16
NMR-Based Modeling and Refinement
of Protein 3D Structures
Wim F. Vranken, Geerten W. Vuister,
and Alexandre M.J.J. Bonvin
Abstract
NMR is a well-established method to characterize the structure and dynamics of biomolecules in solution.
High-quality structures can now be produced thanks to both experimental advances and computational
developments that incorporate new NMR parameters and improved protocols and force fields in the structure calculation and refinement process. In this chapter, we give a short overview of the various types of
NMR data that can provide structural information, and then focus on the structure calculation methodology itself. We discuss and illustrate with tutorial examples classical structure calculation, refinement, and
structure validation approaches.
Key words NMR, Structure calculation, Structure refinement, Structure validation
Introduction
The first step of a structure determination by NMR spectroscopy
consists of the acquisition of NMR data, typically using heteronuclear multidimensional experiments, that allow the assignment
of all atoms/spins of a molecule (1H, 15N, 13C) to their chemical
shift values (Fig. 1). Once this chemical shift assignment step is
completed, 13C- and 15N-edited 3D NOESY spectra are generally
used to obtain inter-atomic distances from nuclear Overhauser
effects (NOE). These NOESY spectra provide the most detailed
structural information that can be obtained from NMR and are
still the most common core data used to define the 3D structure of
the protein [1, 2]. In addition to distance information, other
parameters, such as J-couplings [3], residual dipolar couplings
(RDCs) [4], paramagnetic relaxation enhancements (PRE) and
pseudo-contact shifts [5] can be measured, providing additional
information to define the protein structure. Recent developments
have enabled the calculation of the structures of relatively small
Andreas Kukol (ed.), Molecular Modeling of Proteins, Methods in Molecular Biology, vol. 1215,
DOI 10.1007/978-1-4939-1465-4_16, Springer Science+Business Media New York 2015
351
352
Fig. 1 Schematic overview of the structure calculation process in NMR. This chapter deals with the boxes in
black, with indication of the software covered
proteins (less than ~12 kDa) from chemical shift values alone [68],
a procedure that will be briefly outlined in this paper.
The experimental NMR parameters are converted to in silico
restraints and 3D structures are generated from restrained molecular dynamics simulations following usually some form of molecular
dynamics simulated annealing scheme (MD/SA) [9]. Multiple
structures are calculated in this way, starting from the same experimental data but different random starting conditions. Provided
that enough data of sufficient quality are available, the structures
will converge onto the same overall fold. These structures are nowadays often further refined in explicit solvent (water), which has
been shown to significantly improve their quality [10, 11]. Finally,
the structures that best satisfy as many experimental restraints as
possible, together with proper general chemical properties of proteins (such as bond lengths and angles), are then selected to form
an ensemble of structures that represents the definitive solution of
the structure calculation process.
In this chapter we will discuss the classical NMR structure
calculation, refinement, and validation methods, with some reference to new chemical shift-based approaches. These will be illustrated with tutorial examples making use of the programs CYANA
353
[12] and CNS [13, 14] using the RECOORD [11] approach for
water refinement [10], and the CS-ROSETTA protocol [7] based
on chemical shift data. These are followed by a description of
structure validation with the program CING [15].
Theory
This section gives a very brief overview of the NMR data relevant
in the determination of 3D protein structures.
2.1.2
Classical protein structure determination by NMR relies on obtaining a dense network of distance restraints derived from nuclear
Overhauser effects (NOEs) between nearby hydrogen atoms in a
protein [1, 2]. Together, these restraints provide the essential
information for defining the tertiary structure of a protein.
The NOE originates from cross-relaxation between dipolar
coupled spins as a result of through-space spinspin interactions
that result in the transfer of magnetization from one spin to
another. The NOE approximately scales with the distance r between
the two spins as 1/r6. Because of this 1/r6 dependency, NOEs are
NOEs
354
355
atoms that are not at a fixed distance from each other, there is also
a distance dependence and hence usually only RDCs measured for
inter-nuclear vectors with a fixed distance are used in the structure
calculations. Residual dipolar couplings can be added as orientational restraints to the target function of the structure calculation
algorithm [34].
2.1.6 Diffusion
Anisotropy
2.1.7 Paramagnetic
Relaxation Effects
If a paramagnetic metal ion is present in a protein, or if it is introduced via for example a chelating agent chemically bound to the
protein, the NMR signals of the nuclei in a shell around it will be
affected [37] by several effects including contact and pseudocontact shifts, relaxation rate enhancements, and cross-correlation
effects. Analogously to RDC and diffusion anisotropy, these,
depending on their type, can provide both distance and orientation
information which can be converted into restraints to be used in
various structure calculation softwares [38, 39].
2.2 Structure
Calculation Software
The experimental information sources discussed above can be converted into restraints that can be used in the structure calculation
process. Several computer programs have provisions for using the
experimental NMR restraints; the most commonly used ones are
CNS [13, 14], Xplor-NIH [40] and CYANA [12, 41], although
many others are available, e.g. SCULPTOR [42], the SANDER
module of AMBER [43], GROMACS [44] and YASARA [45].
Structure calculations in essence transform the experimental
data (as restraints) into in silico atomic coordinate information. The
calculations are usually based on some molecular dynamic simulated annealing protocol performed in torsion angle and/or
Cartesian space, followed by a final refinement phase in explicit solvent (water). A general feature of all these protocols is the usage of
a target function: lower values of this function for a calculated
structure indicates better agreement with the experimental data and
with known molecular information. This molecular information is
defined by a force field that contains physical energy terms for interactions such as van der Waals interactions and electrostatics, as well
as terms describing the molecular geometry such as bond lengths,
bond angles, etc. During the initial stages of a structure calculation
356
Typically a large pool of structures is generated during the structure calculation process, from which a final ensemble of best
structures is then selected. This choice of an ensemble of structures, rather than one single one, reflects the uncertainty in the
experimental NMR data: often structures that agree with the
experimental data equally well but differ locally (such as in loop
regions) can be obtained. The most widely used structure selection
procedure is based on the agreement with the experimental data
(rather arbitrarily defined as a small number of restraint violations)
and a low (overall) energy of the structures. Typically ensembles
containing the 20 lowest energy models are selected, although this
number is arbitrary. Ideally, the selected ensemble should represent
the available conformational space accessible to the structure while
simultaneously satisfying the experimental restraints. From this
ensemble, a representative structure is usually defined; no real consensus exists, however, on how it should be selected. The wwPDB
NMR validation taskforce recommends to select the structure that
differs the least from all other structures within the ensemble, i.e.
the mediod (the structure with the lowest atom coordinate RMSD
from all other structures)[46].
The final ensemble is subsequently subjected to structure validation procedures in order to verify its quality. It is useful to distinguish the well-defined from the ill-defined regions of the ensemble
during this process: these can be defined with, for example,
CYRANGE [47], FindCore [48] or circular variance methods
[49]. In practice, the quality indicators that are most commonly
used to assess especially the well-defined regions of an NMR
ensemble are [50]:
357
Table 1
Internet resources of NMR-related programs and databases mentioned in this chapter
Software
Internet address
Purpose
CNS
http://cns.csb.yale.edu/v1.3
RECOORD
http://www.ebi.ac.uk/pdbeapps/nmr/recoord/
PDBe
http://www.pdbe.org/
BMRB
http://www.bmrb.wisc.edu
CCPN
http://www.ccpn.ac.uk
CING
http://nmr.cmbi.ru.nl/icing/
PSVS
http://psvs-1_3.nesg.org/
TALOS+
http://spin.niddk.nih.gov/bax/
nmrserver/talos/
CYANA
http://www.cyana.org
http://haddock.science.uu.nl/
enmr/services/CS-ROSETTA3
TALOS+
http://haddock.science.uu.nl/
enmr/services/TALOS
CYANA
http://www.enmr.eu/webportal/
cyana.html
FormatConverter
http://haddock.chem.uu.nl/
enmr/format-converter.html
Methods
In this tutorial section we describe the procedures to generate various types of NMR restraints and how they can then be used in
structure calculations using CYANA or CNS with the RECOORD
scripts. The CS-ROSETTA chemical shift-only approach is also
discussed, followed by a description of structure validation using
the CING webserver. A classical, NOE distance-based structure
358
formatConverter
2. Under the Project menu option, click New and enter a name
for your project; you will not be able to use the FormatConverter
until you have done this.
3. Under the Import menu option, go to Single files, select the
type of data you want to import, then the name of the software
the data comes from. A window will pop up, click on the Select
file button, navigate to the file, and click Select. Click
IMPORT to start the importing process.
4. Depending on the type of data, you might get popups that ask
you to validate information or provide additional input. Press
on the ? button in these popups for additional information.
5. After successful import, repeat the process from step 3. Due to
the way the CCPN framework stores information, you will
need to import both the sequence of your molecule and experimental information to create a complete CCPN project. At
this point you will get a prompt to initiate the linkResonances
process. Click Yes when asked, and Yes again to perfom this
359
360
3. Under the Prediction Options tab, you can deselect the Apply
Offset Correction tick mark if you are sure your chemical
shifts are correctly referenced (see Note 2).
4. Under the Submission Details tab, enter your email address
(twice) and click Submit to start TALOS+. The screen will
change, and you will after a few minutes receive an email from
the NMR Server Agent with details on where to retrieve the
results.
5. Create a directory where you want to store the results, open
the email, and save all six files to this directory (detailed information on these files is available from the TALOS+ server).
6. In order to convert the TALOS+ predictions to accurate dihedral angle restraints, they have to be manually inspected.
Connect to the jRAMA+ online viewer: http://spin.niddk.
nih.gov/bax/software/TALOS+/JRAMA+.
The first time you access this page you have to explicitly state
that you trust this server; click the tick mark box followed by Run.
7. Click on the TALOS+ image in the top left corner; a pop-up
will appear. In this popup, click on the File menu option (top
left) and select Open Prediction Files. In the file window that
appears, navigate to the directory you created to store the
TALOS+ results, and select the pred.tab file. A set of other
windows will now appear; the RCI-S2 and Secondary Structure
Plot gives an overview of the chemical shift predicted backbone
dynamics (from RCI [58]) and secondary structure for your
protein.
8. Go to the window that contains your protein sequence with a
color-coded box for each amino acid. The color coding indicates the following:
(a) Green: the prediction is Good for this residue and can
be used to create a dihedral angle restraint.
(b) Yellow: the prediction is Ambiguous and should be
manually inspected before use.
(c) Red: the prediction is Bad and should not be used.
(d) Blue: the residue is highly Dynamic and a prediction is
not possible.
If you click on a box, the other windows will update to show
detailed information about the prediction for this residue, and
you can examine the / distributions of the detected matches
and override the TALOS+ decisions on which residues should
be included in the prediction and which ones are outliers. You
can change the prediction status of this residue selecting a box
at the bottom of the main windowonly residues with
Good status will be written out the dihedral restraint file as
discussed in the next point.
361
362
Fig. 2 Example of TALOS output from the WeNMR server based on the chemical shifts for entry 2lci. (The server
is accessible via the NMR ->Chemical Shifts menu of www.wenmr.eu)
363
364
Fig. 3 CS-ROSETTA web form on the WeNMR server. (The server is accessible via the NMR -> Structure
Calculation menu of www.wenmr.eu)
365
366
Note that you can adjust the RMSD range to only include
well-defined regions of the protein structure (if this information is known).
(f) A CYANA file with information for the structure calculation run (AUTO.cya) containing the lines:
peaks
prot
constraints
tolerance
structures
steps
randomseed
:=
:=
:=
:=
:=
:=
:=
peaks1.xpk,peaks2.xpk
shifts.prot
dihedrals.aco
0.030,0.040,0.25
100,20
10000
34983434
#
#
#
#
#
#
#
The tolerances determine which atoms (based on their individual chemical shift values) are assigned to the NOESY peaks;
wider tolerances will result in CYANA detecting more possible
assignments for the NOESY peaks. If the tolerances are too narrow, less assignment possibilities will be found, but correct
assignments might be missed. The above protocol assumes that
the NOESY peaks are not yet assigned; additional protocols are
available from the CYANA Web site and other resources.
2. Start CYANA with the command:
cyana
367
For the structure calculation part we are going to describe the use
of the program CNS [13, 14] with a simulated annealing protocol
derived from ARIA [64] followed by refinement in explicit solvent
[10]. All the scripts mentioned in this section can be downloaded
from the RECOORD [11] Webpage (see Table 1).
1. Download: Create a folder where you will run the calculations, download there the tar file containing the RECOORD
scripts and decompress it:
mkdir struct-calc
cd struct-calc/
wget http://www.ebi.ac.uk/pdbeapps/nmr/data/recoord/RECOORDscripts-cns1.3.tgz
tar xzfv RECOORDscripts-cns1.3.tgz
In case the wget command does not work, use a Web browser
to download the scripts manually from the RECOORD webpage (see Table 1). This tutorial uses CNS version 1.3.
2. Initialise: Before starting the calculations, you need to set up
your current path for the scripts to work. In order to do this,
you need to edit the file changeScriptsDir.sh located
in RECOORDscripts-cns1.3 and change the directory path for
newDir in line 8 with your current path and execute it:
368
./changeScriptsDir.sh
3. Get data: Make sure you have a working version of CNS set
up; the last step is then to create a working directory and
assigning a project name for the protein you are working on.
This project name will be used to generate the file names at the
different stages of the protocol. We will use as example the
OR36 data for the PDB 2LCI structure with the corresponding NMR restraints available for this entry from the
BioMagResBank (BMRB) [65]. First download and rename
the PDB structure file:
mkdir 2lci
cd 2lci
wget http://www.ebi.ac.uk/pdbe-srv/view/files/2lci.ent.gz
gunzip pdb2lci.ent.gz
mv pdb2lci.ent 2lci.pdb
Then get the NMR restraints from this link:
http://www.bmrb.wisc.edu/servlets/MRGridServlet?pdb_id=2l
ci&min_items=0&block_text_type=3-converted-DOCR
In the result table, click on the number in the distance row
under the XPLOR/CNS column, then click on the link in
the mrblock_id column. Copy and paste these restraints in a
text file called unambig.tbl in the 2lci/ directory (see Note 6).
Alternatively you can export CNS restraints from CCPN projects (see Note 7) or use the data for OR36 directly from the
restraints/cns/ directory of the example project.
4. Generation of molecular topology files: We can generate the
molecular topology either from the primary sequence or from
a PDB coordinate file, depending on availability (see Note 8).
We will use here the downloaded PDB file:
../RECOORDscripts-cns1.3/generate.sh 2lci.pdb
369
../RECOORDscripts-cns1.3/generate_extended.sh 2lci_cns.mtf
The extended structure is in the file 2lci_cns_extended.
pdb. You should check the ERRORS_generate_extended
file for errors, and again check the generated file in your favourite molecule viewer.
6. Simulated annealing stage: For the structure calculation
itself, we can use three different types of restraints (if available): unambig.tbl (NOE distance restraints), hbonds.
tbl (hydrogen bond restraints) and dihedrals.tbl (dihedral angle restraints). The script annealing.sh will generate a CNS parameter file (run.cns) with all details and
specifications for the structure calculation protocol, and will
start the calculation. This script should be run from a higher
level than the previous two:
cd ..
RECOORDscripts-cns1.3/annealing.sh 2lci
Individual job files will be generated and executed for each
model you want to calculate. By default two models will be
generated in the created str/ folder, with name similar to
2lci_cns_[1-2].pdb. The CNS input and output files can
be found in the directory cnsRef/, together with possible
error files (see Note 9). The header of every PDB file generated
contains information about violations and energy values.
7. Water refinement stage: Once the simulated annealing phase
is finished and all resulting structures have been written into
the str/ directory, we can proceed to water refinement:
RECOORDscripts-cns1.3/re_h2o.sh 2lci
In the str/ directory, a new directory called wt/ will be created, the best energy structures will be copied there and subsequently refined (see Note 10).
370
3.6 Structure
Validation and Quality
Assessment
371
Fig. 4 iCing server and results. (a) iCing server upload page. Selection area is indicated by the red box. (b)
Summary page for entry 2lci. (c) Residue-specific page for Leu14 of entry 2lci. The residue has a red ROG
score as result of poor sidechain 1-2 dihedral angle conformation (Janin plot not visible), Many related elements (previous and next residues, restraints, chemical shifts, etc.) can directly be accessed through the links
on this page. (d) DihedralByResidue plots. A quick overview is obtained by scrolling down. Individual residue
pages can be accessed through the links
372
373
data involving the featured residue, as well as the critiques defining its ROG scores. Thus, a residue page presents a comprehensive account of all relevant information
pertaining to a specific residue. Hyperlinks connect the page to
all other relevant pages; e.g. the previous and next residue in
the chain, but also all residues connected through restraints or
all its atoms and their chemical shifts. The analysis of the conformation of residues and identification of potential problems
can also conveniently be done using the Dihedral plots
per residue page (Fig. 4d), which displays the relevant
Ramachandran, Janin and D1D2 plots of all residues sequentially, in one scrollable interface from which the relevant residue can also be selected.
6. Other information: It is useful to display the residue-specific
ROG scores mapped onto a structure of the molecule. For
this, macros for the structure visualization programs JMOL
[69], YASARA [45], PyMol (The PyMOL Molecular Graphics
System, Schrdinger, LLC.) and MOLMOL [70] are provided. They can be accessed by following the Programs
flat- > Macros link from the home page. Closely clustered
red/orange residues in the structure are highly suspect and
warrant further investigation.
The CING results are based, in part, upon the results of the
programs PROCHECK_NMR [68] and WHATIF [67]. Direct
links to the output of these programs are also provided.
Notes
1. To use the FormatConverter, install the latest stable version of
the CCPN software, which can be downloaded from http://
www.ccpn.ac.uk/downloads/stable. A less versatile but easier
to use alternative is the FormatConverter web version at
http://haddock.chem.uu.nl/enmr/format-converter.html.
Further information on the FormatConverter is available from
http://www.ccpn.ac.uk/software/fcfolder, as well as a detailed
tutorial at http://www.ccpn.ac.uk/software/tutorials/intro.
Examples of full CCPN projects that can be used for structure
calculation are available from the RECOORD project at
http://www.ebi.ac.uk/pdbe-apps/nmr/recoord/ by clicking
the CCPN projects link near the bottom of the page.
2. Chemical shifts are calculated from absolute frequencies relative
to a reference frequency; this way the positions of NMR resonances can be expressed independently of the magnetic field
strength. If this reference frequency is not correctly set all chemical shift values calculated with it will be equally offset from their
true value. There are a host of programs available to check if
374
375
376
377
378
36.
37.
38.
39.
40.
41.
42.
43.
44.
45.
46.
47.
48.
49.
379
380
Part IV
ProteinLigand Interactions
Chapter 17
Methods for Predicting ProteinLigand Binding Sites
Zhong-Ru Xie and Ming-Jing Hwang
Abstract
Ligand binding is required for many proteins to function properly. A large number of bioinformatics tools
have been developed to predict ligand binding sites as a first step in understanding a proteins function or
to facilitate docking computations in virtual screening based drug design. The prediction usually requires
only the three-dimensional structure (experimentally determined or computationally modeled) of the
target protein to be searched for ligand binding site(s), and Web servers have been built, allowing the free
and simple use of prediction tools. In this chapter, we review the underlying concepts of the methods used
by various tools, and discuss their different features and the related issues of ligand binding site prediction.
Some cautionary notes about the use of these tools are also provided.
Key words Structural bioinformatics, Proteinligand interaction, Protein surface grid, Molecular probe,
Surface pocket and cavity, Ligand binding site prediction, Bioinformatics software and servers
Introduction
Interaction with a ligand molecule is essential for many proteins to
carry out their biological function. This interaction is generally
specific, not only in terms of the molecules involved in the interaction, but also in the location (i.e., the site of ligand binding) in
which the interaction takes place. In order to gain knowledge
about the interaction and, by extension, the proteins function
and how to influence its activity by, for example, designing small
molecule drugs, considerable efforts have been made to develop
methods that can predict ligand binding sites (LBSs) of proteins
computationally, and a very large number of bioinformatics tools
are now available for LBS prediction (reviewed in [14]). In general,
because of the location specificity of LBSs, most of these methods
have exploited one or more of four types of properties (evolutionary, geometric, energetic, and statistical) in order to distinguish
the binding site from other parts of the protein surface. In this
review, we will survey the many LBS prediction methods and
classify them on the basis of the site-distinguishing properties they
Andreas Kukol (ed.), Molecular Modeling of Proteins, Methods in Molecular Biology, vol. 1215,
DOI 10.1007/978-1-4939-1465-4_17, Springer Science+Business Media New York 2015
383
384
Methods
2.1 Template-Based
Methods
Reference
[8, 86]
[9, 8789]
[10]
[11, 90]
[21, 28]
[27]
Firestar
I-TASSER
Lees method
IntFOLD
ProBis
Ims method
[30]
[31]
PocketPicker
VICE
[7, 79]
FINDSITE
I. Template-based methods
3DLigandSite
[6, 16]
Method1
Table 1
List of LBS prediction methods
http://projects.biotec.
tu-dresden.de/pocket/
http://www.reading.ac.uk/
bioinf/IntFOLD/
http://probis.cmm.ki.si
http://firedb.bioinfo.cnio.es/
Php/FireStar.php
http://zhanglab.ccmb.med.
umich.edu/I-TASSER/
http://www.sbg.bio.ic.ac.
uk/~3dligandsite/
http://www.cssb.biology.
gatech.edu/findsite
Web server2
Yes
Yes
Yes
Yes
Structure
viewer3
(continued)
Note
[35]
[36]
[3739]
[40]
[41, 42]
[93]
[94]
[95]
SCREEN
POCASA
CASTp
MSPocket
fpocket
PocketDepth
DEPTH
DoGSiteScorer
Reference
Method1
Table 1
(continued)
http://scbx.mssm.edu/
sitehound/sitehound-web/
Input.html
http://dogsite.zbh.unihamburg.de/
http://bioserv.rpbs.univparis-diderot.fr/cgi-bin/
fpocket
http://proline.physics.iisc.
ernet.in/pocketdepth/
http://mspc.bii.a-star.edu.
sg/tankp/run_depth.html
http://altair.sci.hokudai.ac.
jp/g6/service/pocasa/
http://sts.bioengr.uic.edu/
castp/
http://bhapp.c2b2.columbia.
edu/screen2/cgi-bin/
screen2.cgi
Web server2
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Structure
viewer3
Note
386
Zhong-Ru Xie and Ming-Jing Hwang
[57, 58]
http://lise.ibms.sinica.edu.tw
http://opus.bch.ed.ac.uk/
stp/
http://ftsite.bu.edu
[62]
[65]
[67]
MetaPocket
2.0
MEDock
Thorntons
method
http://projects.biotec.
tu-dresden.de/
metapocket/
LISE
see Note 7
see Note 8
3
see Note 9
Yes
[50]
[49]
Yes
Yes
Moritas
method
FTSite
http://www.modelling.leeds.
ac.uk/qsitefinder/
[48]
Structure
viewer3
Q-SiteFinder
Web server2
Reference
Method1
Note
388
The main task of geometry-based methods is to identify, by computing some types of geometric measures, pocket(s) on the protein
surface that can accommodate small ligand molecules. Statistical
studies made on proteinligand complex structures archived in PDB
indicate that small molecule ligands tend to bind at deflated regions
of the protein surface, in particular, its largest (and/or deepest)
cavities [29]. Consequently, most geometry-based methods have
focused on identifying the largest pockets in proteins. However, how
to determine and identify cavities on the protein surface is a more
complicated problem than it might appear at first sight, and, over the
years, many diverse and creative approaches have been explored.
The first step in many LBS prediction methods in this category
is to find empty space on the protein surface, and one popular
approach is to spray grid points on the target protein and find
empty grids (those not occupied by protein atoms) [11, 2932].
For example, LIGSITEcsc [29] scans grids in seven directions (x, y, z,
and four cubic diagonals) to identify surface-empty-surface
connections, then clusters the empty grids from these to identify
empty spaces for potential ligand binding. Another approach is to
place empty spheres on the protein surface [33, 34]. For example,
to find large empty spaces, SURFNET [34] places empty spheres
between every pair of protein atoms that have no intervening
protein atoms. One variation of the empty sphere approach involves
rolling two spheres with different radii on the target protein to
generate an inner and outer "surface" and reveal empty spaces, i.e.,
empty pockets, between these two surfaces [35, 36]. Yet another
389
390
Another class of methods computes neither geometry nor interaction energy, but the statistics of certain properties for their propensities to be at, or associated with, known LBSs. One typical
propensity-based method is Hirayamas method [54], in which the
predicted binding pockets generated by a program named Alpha
Site Finder [55] are re-ranked by the amino acid composition,
which shows small, but statistically significant, differences between
LBSs and non-LBSs. Propensity-based methods often re-rank the
pockets predicted by other methods, mainly geometry-based
methods. For example, the surface triplet propensities (STP) algorithm [56] assigns a propensity score to each atom located in the
binding pockets predicted by SURFNET [34], then re-ranks the
SURFNET pockets by simply counting the number of high-scoring
atoms. In contrast, a new propensity-based method developed in
our laboratory [57, 58] does not rely on pockets pre-identified by
other methods. This new method was named LISE for Ligand
Interacting and binding Site Enriched protein triangles. In LISE,
the protein triangles are a triplet of three protein surface atoms
simultaneously interacting with a ligand molecule and the three
protein atoms are concomitantly enriched at LBSs and are assigned
an enrichment factor deduced from a statistical analysis of a set of
proteinligand complex structures [59].
2.6
Related Issues
391
392
393
Notes
1. The first step in LBS prediction should be to search for a
homologous structure(s). If ligand-bound homologous
structure(s) exist in PDB, a template-based prediction is usually quite reliable, particularly if the prediction is supported by
other types of methods.
2. With the exception of FINDSITE [7, 79] and perhaps a few
others, most methods do not report their prediction accuracy
on homology-derived models, although, according to
FINDSITE, homology models can be tolerated for templatebased LBS predictions.
3. It is generally recommended to use multiple different methods
to find consensus and/or to identify potential LBSs for
evaluation.
4. Several LBS prediction methods have been developed for specific protein families [8084]; for target proteins belonging
394
13.
14.
15.
16.
17.
18.
19.
20.
21.
22.
23.
24.
25.
395
396
40.
41.
42.
43.
44.
45.
46.
47.
48.
49.
50.
51.
52.
67.
68.
69.
70.
71.
72.
73.
74.
75.
76.
77.
78.
79.
80.
81.
82.
83.
84.
85.
86.
87.
88.
89.
90.
91.
397
398
Chapter 18
Information-Driven Structural Modelling
ofProteinProtein Interactions
JooP.G.L.M.Rodrigues, EzgiKaraca, andAlexandreM.J.J.Bonvin
Abstract
Proteinprotein docking aims at predicting the three-dimensional structure of a protein complex starting
from the free forms of the individual partners. As assessed in the CAPRI community-wide experiment, the
most successful docking algorithms combine pure laws of physics with information derived from various
experimental or bioinformatics sources. Of these so-called information-driven approaches, HADDOCK
stands out as one of the most successful representatives. In this chapter, we briefly summarize which
experimental information can be used to drive the docking prediction in HADDOCK, and then focus on
the docking protocol itself. We discuss and illustrate with a tutorial example a classical proteinprotein
docking prediction, as well as more recent developments for modelling multi-body systems and large
conformational changes.
Key words Biomolecular interactions, Information-driven docking, Conformational changes,
Multi-body docking, HADDOCK, Molecular modelling
1 Introduction
Docking is defined as the modelling of the three dimensional (3D)
structure of a molecular complex from its known unbound
constituents. It was developed to aid in the structural elucidation
of transient or weak interactions, which can be challenging to characterize experimentally due to, for example, difficulties in crystallization or because the molecular weight rules out a thorough
classical NMR analysis. The advent of explicit treatment of molecular flexibility, together with better and more efficient algorithms
for both sampling and scoring, has earned docking a solid reputation amongst experimentalists. In turn, this attention brought new
challenges such as the prediction of large molecular assemblies,
proteinnucleic acid complexes, high-throughput predictions of
entire metabolic pathways, or understanding the molecular origins
of binding affinity and specificity [1, 2].
Andreas Kukol (ed.), Molecular Modeling of Proteins, Methods in Molecular Biology, vol. 1215,
DOI 10.1007/978-1-4939-1465-4_18, Springer Science+Business Media New York 2015
399
400
401
2 Theory
This section briefly discusses various useful information sources,
how these can be used to drive docking predictions, and describes
the HADDOCK strategy to produce structural models of biomolecular complexes. Since HADDOCK uses CNS for its structure
calculations, details on the implementation of particular restraint
type is best found in the CNS Web site (http://cns-online.org/
v1.3) or related publications [9, 10].
2.1 Sources
ofInformation
forData-Driven
Docking
2.1.1 Common NMR
Structural Information
Sources
2.1.2 NMR Chemical
Shift Perturbations
402
2.1.4 Hydrogen/
Deuterium Exchange
H/D exchange provides information on the solvent accessible residues of a protein. In a deuterated medium, amide protons exposed
to the solvent exchange rapidly while those buried by the protein
structure do not. Upon interaction, the interface of the proteins
also becomes inaccessible to solvent exchange. Following this event
by either NMR with 15N HSQC spectra or by mass spectrometry
reveals the solvent-accessible surface of the bound complex and,
indirectly, the interfacial residues.
2.1.5 NMR
PseudocontactShifts
403
2.1.8 Low-Resolution
Shape Information
(Cryo-EM, SAXS,CCS)
2.1.9 Bioinformatics
Predictions
404
Rigid-Body Energy
Minimization (RBEM,
it0)
In the initial docking stage, the interacting partners are first separated
in space and each is randomly rotated around its center of mass to
remove any orientational bias. They are then subjected to a rigidbody energy minimization protocol, where first only the orientation
of the partners is optimized, and then both rotations and translations
are allowed, effectively resulting in the docking of the molecules.
Given the fast calculation of each docking model at this stage, it is
typically worth generating a large number of models to cover the
interaction space. By default, 1,000 are written to disk, although
10,000 are sampledeach model is the result of five internal docking
trials with, for each, the 180-rotated solution around the normal to
the interface being sampled as well. These models are then ranked
according to the HADDOCK score (see below), and a fraction of
these is selected for further flexible refinementtypically 200.
Semi-flexible
Simulated Annealing
inTorsion Angle Space
(TAD/SA, it1)
The second stage of the HADDOCK protocol fine-tunes each complex by flexible refinement of its interface. This second stage starts
with a rigid-body SA step to optimize the orientation of the components. Then, the side chains of the interfaceautomatically defined
for each docking model as all residues within 10 of a partner
moleculeare allowed to move in a second SA stage. Athird and
final SA stage optimizes both backbone and side-chains of the interface residues to allow for some conformational rearrangements.
Finally, a short energy minimization in Cartesian space relaxes the
models. A new ranking of the models is produced at this step, but,
usually, all models are allowed to undergo the third and final
refinement step in explicit solvent.
Restrained Molecular
Dynamics inExplicit
Solvent (Water)
405
2.3 Restraints
Implemented
inHADDOCK
2.3.1 Ambiguous
Interaction
Restraints(AIRs)
406
When the information is sufficiently accurate to allow unambiguous pairing of atoms or residues between partner molecules, it is
possible to incorporate it in HADDOCK as unambiguous distance
restraints. These use the same functional form of the distance
restraining potential as the AIRs (flat bottom harmonic potential
with upper and lower bounds transitioning to a linear potential
407
To ensure compactness of solutions, for example when using symmetry restraints without experimental information, HADDOCK
allows the definition of distance restraints between the geometric
centers of mass of the molecules (based only on CA atoms). These
so-called center of mass restraints can also be used when there is no
information about the binding interface (ab initio docking), albeit
with a lower chance of success since the only factor in play are the
physical terms of the scoring function.
3 Materials
The following Subheading4 describes an example protocol on
the usage of HADDOCK to model the interaction between two
proteins by using experimental information. In order to follow the
408
3.2 Crystallography
andNMR Suite (CNS)
v1.3
3.3 ProFit
3.4 NACCESS
3.5 PyMOL
409
4 Methods
4.1 Modelling
ofComplexes with
HADDOCK
Make sure that your input models are compliant with the PDB
format, particularly, the presence of an END statement as the last
line of the file. Furthermore, the segment identifier (characters
7376in each ATOM statement) and chain identifier fields (character 22in each ATOM statement) should be empty strings (i.e., filled
with spaces). If you use a crystal structure, make sure that there are
no double occupancies or residue insertions. If you are using an
ensemble of models, split the file in individual files that contain
only one structure (see Note 3).
As input data, you should combine chemical shift perturbation
data (or other data indicating residues at the interface) and solvent
accessibility data calculated with NACCESS: use only those
residues that have both a high enough chemical shift perturbation
(see Note 4) and a high enough relative accessibility. In the example, the residue solvent accessibilities calculated with NACCESS
are already provided in the files e2a_1F3G.rsa and hpr/hpr_
rsa_ave.lis (the latter containing the average for the 10 starting models for hpr). From these files you can select the residues
with high enough (e.g., >~40%) accessibility (see Note 5). You
could calculate the accessibility values yourself using the following
command:
Passive residues are defined as the solvent accessible surface neighbors of active residues. To define and visualize them you can use a
molecular visualization program, for example PyMOL,
410
You will use the active and passive residues for both molecules
to generate Ambiguous Interaction Restraints (AIRs); for this go
to the HADDOCK GenTBL service (http://haddock.chem.
uu.nl/services/GenTBL/) and follow the instructions. You should
save the resulting file as ambig.tbl in the working directory;
note that, in the e2a-hpr example directory, an example file
named e2a-hpr_air.tbl is already present and can be used for
comparison (see Note 6).
4.1.3 Setup ofaNew
Run: new.html
4.1.4 Run.cns
411
412
413
414
Fig. 1 Plot of HADDOCK scores versus interface RMSD from the lowest energy model for the three stages of
the docking protocol (blue, green, and red, for it0, it1, and water refinement, respectively). One can clearly see
a funnel at low RMSD values becoming more apparent after flexible refinement
415
416
The cd command brings you back into the main run directory
from where you start again HADDOCK.Only the analysis of the
best 10 models of the first cluster in the water will be run. Once
this is finished, go to the respective analysis directory and inspect
the various files. The RMSD from the average models should now
be low (check rmsave.disp).
Having run the HADDOCK analysis on a cluster basis for each
cluster, you should now have new directories in the water directory, called analysis_clustX_best10. Each of these analysis
directories contains now cluster specific statistics. You can also
visualize the clusters, using for example PyMOL.We provide a Perl
script in the tools/directory, joinpdb, which allows concatenation of the various PDB files into one single ensemble file:
417
Fig. 2 Superimposition of the top model of the best scoring cluster onto the native
structure (PDB ID 1GGR). The molecules were superimposed on backbone atoms
of E2A, which is shown in white surface representation with the phosphorylated
histidine colored according to the atom types (blue, red, and orange, for nitrogen,
oxygen, and phosphorous, respectively). The HPR molecules are shown in cartoon
representation (the model in blue and the native in peach) and the histidine residue involved in the phosphate transfer in ball-and-stick. The model is in excellent
agreement with the native structure (interface RMSD=0.97). The proximity of
the two histidines across the interface, which was not defined as a restraint in
HADDOCK, is consistent with the biological function of this phosphotransferase
complex
In general, the top ranked models of the cluster with the lowest
HADDOCK score are considered the representatives of the biological system. However, scoring in docking remains a difficult
problem and we do recommend, if possible, using additional independent information to validate the results (e.g., mutagenesis
data). The selected model should explain as much as possible all
what is known about the system (see Fig.2).
4.2 Other Docking
Scenarios
4.2.1 Multi-body
Docking
418
4.2.3 ProteinPeptide
Docking
Albeit the other end of the size spectrum, small systems such as
peptides are also challenging regarding sampling. Their extreme
flexibility and the many conformations they can adopt upon binding makes them challenging to model and require usually long
molecular dynamics simulations or other advanced sampling
methods, none of which is possible or feasible, time-wise, for use
in HADDOCK.
To cover the conformational landscape of peptides, we developed a shortcut approach. In this custom-tailored protocol,
the peptide is provided as an ensemble of three most common
conformations: -helical, -strand, and polyproline II (see Note 17).
Additionally, the number of MD steps in the flexible refinement
419
5 Notes
1. HADDOCK has a special featuresolvated dockingthat
allows water molecules to be introduced at the interface of the
complex for entire duration of the docking protocol. This feature should only be used when the experimental information is
accurate enough to drive the docking and the interface is
expected to be wet. In short, solvated docking starts by surrounding each molecule by a shell (approximately 4 wide) of
water molecules, optimized via a short MD simulation, prior to
the RBEM stage. After the minimization, all water molecules
that are not at the interface are removed. At the interface, only
a fraction of the molecules is kept (by default 25%), with the
removal being carried out via a biased Monte Carlo sampling
method whose criteria is based on a statistical potential of amino
acidwater contact propensities. Finally, energetically unfavorable water molecules (those with a positive intermolecular
energy) are removed, which might lead to a complete desolvation of the interface, and another round of RBEM is performed
to optimize the final complex. The remaining of the HADDOCK
protocol remains unchanged, with the difference that interfacial
water molecules might be included in the further refinement.
We refer the reader to the following references for an in-depth
explanation of solvated docking in HADDOCK: [3639].
2. HADDOCK must be correctly installed for the $HADDOCK
environment variable to be defined. Check the installation
instructions provided with the software.
3. If your input PDBs contains missing segments, this might lead
to domains drifting away during the refinement stage. To avoid
this, simply define a few unambiguous distance restraints
between CA atoms from the various sub-domains, setting
the actual measured distance as a target distance and the
bounds to 0.0. The same can be done to ensure that an ion
coordination geometry is properly maintained. Missing residues at the interface or in hinge regions must be handled with
extreme care not to compromise the biological integrity of the
models. Missing atoms, on the other hand, are not problematic since HADDOCK rebuilds them based on the topology
files of the force field, as long as the residue name is defined in
them. Also, termini charges are very important for the docking
protocol, as they can lead to artificial interactions. By default,
420
421
422
423
References
1. Melquiond AS, Karaca E, Kastritis PL etal
(2012) Next challenges in proteinprotein
docking: from proteome to interactome and
beyond. Comput Mol Sci 2:642651
2. Kastritis PL, Bonvin AM (2013) Molecular origins of binding affinity: seeking the
Archimedean point. Curr Opin Struct Biol
23(6):868877
3. Schlick T, Collepardo-Guevara R, Halvorsen
LA etal (2011) Biomolecular modeling and
simulation: a field coming of age. Quart Rev
Biophys 44:191228
4. Janin J (2013) The targets of CAPRI rounds
2027. Proteins 81(12):20752081
5. Lensink MF, Janin J (2013) Docking, scoring
and affinity prediction in CAPRI.Proteins
81(12):20822095
6. de Vries SJ, Melquiond ASJ, Kastritis PL etal
(2010) Strengths and weaknesses of data-
driven docking in critical assessment of prediction of interactions. Proteins 78:32423249
7. Dominguez C, Boelens R, Bonvin AMJJ
(2003) HADDOCK: a proteinprotein docking approach based on biochemical or biophysical information. J Am Chem Soc 125:
17311737
8. Linge JP, Habeck M, Rieping W etal (2003)
ARIA: automated NOE assignment and
NMR structure calculation. Bioinformatics
19:315316
9. Brnger AT, Adams PD, Clore GM etal
(1998) Crystallography & NMR system: a new
software suite for macromolecular structure
determination. Acta Crystallogr D Biol
Crystallogr 54:905921
10. Brunger AT (2007) Version 1.2 of the crystallography and NMR system. Nat Protocol
2:27282733
11. de Vries SJ, Bonvin AMJJ (2008) How proteins get in touch: interface prediction in the
study of biomolecular complexes. Curr Protein
Pept Sci 9:394406
12. Karaca E, Bonvin AMJJ (2013) Advances in
integrative modeling of biomolecular complexes. Methods 59:372381
13. Schmitz C, Melquiond AS, de Vries SJ etal
(2012)
Proteinprotein
docking
with
HADDOCK, NMR of biomolecules: towards
mechanistic systems biology, 1st edn. Wiley-
VCH Verlag GmbH & Co. KGaA, Weinheim,
pp521535
14. Wang G, Louis JM, Sondej M etal (2000)
Solution structure of the phosphoryl transfer
complex between the signal transducing proteins
424
Chapter 19
Identifying Putative Drug Targets and Potential
Drug Leads: Starting Points for Virtual
Screening and Docking
David S. Wishart
Abstract
The availability of 3D models of both drug leads (small molecule ligands) and drug targets (proteins) is
essential to molecular docking and computational drug discovery. This chapter describes a simple approach
that can be used to identify both drug leads and drug targets using two popular Web-accessible databases:
(1) DrugBank and (2) The Human Metabolome Database. First, it is illustrated how putative drug targets
and drug leads for exogenous diseases (i.e., infectious diseases) can be readily identified and their 3D structures selected using only the genomic sequences from pathogenic bacteria or viruses as input. The second
part illustrates how putative drug targets and drug leads for endogenous diseases (i.e., noninfectious diseases or chronic conditions) can be identified using similar databases and similar sequence input. This
chapter is intended to illustrate how bioinformatics and cheminformatics can work synergistically to help
provide the necessary inputs for computer-aided drug design.
Key words Drug, Disease, Drug target, Metabolite, Bioinformatics, Sequence comparison, Chemical
similarity, Exogenous disease, Endogenous disease
Introduction
As most readers have already seen in previous chapters, protein
modeling is a mature field that allows many interesting biological
questions to be addressed using only a computer. Insights gained
through computational modeling have helped us to better understand proteins and their many important structurefunction relationships. While macromolecular modeling has helped enormously
to advance basic biology, one of the central justifications for the
enormous resources that have gone into this field over the past
30 years is the hope that molecular modeling could, one day,
accelerate both drug discovery and drug design [13].
Computational drug discovery is a subfield of macromolecular
modeling that involves the docking or virtual screening of one or
more small-molecule compounds against a chosen protein target.
Andreas Kukol (ed.), Molecular Modeling of Proteins, Methods in Molecular Biology, vol. 1215,
DOI 10.1007/978-1-4939-1465-4_19, Springer Science+Business Media New York 2015
425
426
David S. Wishart
427
Theory
When medicinal chemists or pharmaceutical scientists think about
drugs and drug targets they generally classify them into two separate groups: (1) those that are associated with endogenous
human diseases and (2) those that are associated infectious or
exogenous diseases. Endogenous diseases are typically chronic
human disorders or conditions that arise due to germ-line mutations (genetic diseases), somatic mutations (cancer), age (atherosclerosis, immune disorders), or some other internal factors. On
the other hand, exogenous diseases are typically temporary diseases
or conditions that arise from external, nonhuman agents such as
viruses, bacteria, fungi, protozoans, or poisonous animals (snakes,
insects). The vast majority of drug targets (96 %) and drugs (89 %)
are associated with endogenous diseases, while only a tiny minority
of drugs targets (4 %) and drugs (11 %) are actually associated with
exogenous or infectious diseases [13, 15].
The identification of putative drug targets and drug leads for exogenous diseases can take one of two paths, both of which depend
substantially on bioinformatics and sequence database comparisons. One can either attempt to identify a completely novel drug
target/drug lead or one can attempt to identify a drug target/
drug lead that is similar (or even identical) to an existing class of
drug targets or drug leads. In both cases, one needs either the
complete protein or DNA sequence of the pathogen of interest.
Fortunately, with the advent of next-generation DNA sequencing,
the entire DNA sequence for hundreds of infectious agents of
interest is already known or can be determined in as little as a day.
If one chooses to identify a completely novel drug target or
drug lead the task is then to identify those genes or proteins in the
genome that are: (1) essential to viability; (2) disease causing; or
(3) presented on the surface of the organism. Surface-bound proteins may be identified by sequence analysis by looking for transmembrane segments using such tools as TMHMM [16] or
PSORTb [17]. Essential genes, especially for bacteria, may be
identified by comparing sequences to existing databases of essential
genes such as in the Database of Essential Genes [18]. Likewise
disease-causing genes can be identified by comparing sequences
between non-pathogenic forms of the microbe with pathogenic
forms (say E. coli O157 vs. E. coli MG1655) or through the identification of pathogenicity islands using tools such as IslandViewer
[19]. Alternately essential genes or disease causing genes may be
experimentally identified through knock-out mutations or deletions. Generally all viral genes in a viral genome are essential while
only 200-300 bacterial genes in a given bacterial genome are essential. Furthermore, among most pathogens, only a small fraction of
428
David S. Wishart
429
430
David S. Wishart
431
Methods
For this section we will describe two protocols. One will describe the
identification of drug targets and drug leads for a novel retrovirus
that exhibits strong similarity to the AIDS virus (HIV) (see Notes 18).
The other will describe the identification of drug leads (from a
preexisting list of putative drug targets) for prostate cancer.
432
David S. Wishart
top of the page with the eight clickable titles Home, Browse,
Search, Downloads, About, Help, Tools, Contact Us.
2. Click on the Search link. A submenu should appear that
displays several search options including ChemQuery, Text
Query, Interax Interaction Search, Sequence Search, and
Data Extractor. Select the Sequence Search option. A window with the title Sequence Search should appear (Fig. 1). As
seen in the figure the window contains a standard online
BLAST search form with a text box window, with eight different BLASTP parameter settings. There are also options for the
Drug type and Database to be searched, with a variety of
options. In almost all cases users can leave everything (except
the Drug type and Database selection) in their default position. A unique feature of the Sequence Search program is its
capacity to handle multiple FASTA-formatted sequences. This
allows users to BLAST multiple sequencesor even entire
proteomes.
3. For this example we will be looking for potential drug targets
to a newly isolated retrovirus. To obtain the set of sequences to
paste into the Sequence Search text box, launch a new browser
window and go to: http://www.wishartlab.com/molecularmodelingproteins/virus. Click on the Virus hyperlink. A list
of 15 viral protein sequences should be visible. Select all 15
sequences by clicking a dragging through the window with
your mouse. Copy the sequences (using the Copy option on
your browser or using Ctrl + C).
4. Now click on the Sequence Search browser window to activate
it and paste the sequences into the Sequence Search text box by
clicking your mouse in the text box and using the Paste option
on your browser (or Ctrl + V). You have now pasted 15 different
protein sequences from the newly sequenced retrovirus. Use the
scroll bars on the right side of the text box to see if all 15
sequences are there (numbered Peptide 1 to Peptide 15).
5. Now select the DrugBank sequence database to search. For
this example go down to the bottom of the Sequence Search
window and select Drug Type Approved and Database
Target. This means you will search through all known protein targets of FDA approved drugs. Once this is done, press
the Search button. Within a few seconds the BLAST search for
all 15 input sequences should be completed. The program will
return a text-based BLAST summary for each of the 15 proteins that were submitted. The top portion of the Sequence
Search output consists of a summary of the submitted
sequences. Below that is the BLAST result for the first sequence
(Peptide 1) listing the E-value, the bit score, the query length,
the name of the closest match, and the alignment with the
query sequence at the top and the DrugBank database match
433
434
David S. Wishart
Fig. 2 Screeen shot of the output from the DrugBank BLAST search using the 16 viral protein sequences
belonging to a novel retrovirus
435
Fig. 3 A view of the tabular output found in the DrugCard for Indinavir
436
David S. Wishart
437
438
David S. Wishart
menu bar located near the top of the page with the seven clickable titles Home, Browse, Search, About, Downloads,
Metabolomics Toolbox, and Contact Us.
4. Click on the Search link. A submenu should appear that displays
nine different search options including Chem Query, Text
Query, Sequence Search, Advanced Search, etc. Select the
Sequence Search option. A window with the title Sequence
Search should appear (Fig. 5). As seen in the figure the window
contains a standard online BLAST search form with a text box
window, with eight different BLASTP parameter settings. A
unique feature of the Sequence Search program is its capacity
439
Fig. 6 Screenshot of the output from a BLAST search against the HMDB using the ten protein sequences identified as potential prostate cancer drug targets
440
David S. Wishart
Fig. 7 Screenshot of the MetaboCard for Sarcosine. The hyperlinks for the MOL, SDF, and PDB structure files
(below the structure) are also visible
441
Notes
1. The examples given in Subheadings 3.1 and 3.2 are realistic
but somewhat simplified compared to what might be necessary
for real life drug discovery. In particular, the identification of
drug targets always requires some critical assessment of the
utility and viability of the drug target or drug lead. This typically requires a good deal of library research and additional
experimentation. For instance, one must determine whether
the drug target(s) should be inhibited (therefore requiring an
antagonist) or activated (therefore requiring an agonist). As a
general rule, the development of antagonists is generally much
easier than agonists.
2. It is usually a good idea to determine whether the putative
drug target has been previously identified and whether experimental lead compounds have already be explored. Even if a
drug target appears viable, one should take particular care to
determine if the protein is essential, unique, or conditionally
expressed for the associated disease or condition. Nonessential,
nonunique, or continuously expressed proteins are generally
not good drug targets. Likewise proteins with generally weak
affinities (i.e., most carbohydrate binding proteins) or poor
turnover rates often turn out to be poor drug targets.
3. The selection of drug leads also requires some careful
consideration. While DrugBank, HMDB, and PubChem can
442
David S. Wishart
offer many useful suggestions, they are not the only sources for
drug leads. Surveys through the literature or careful searches
through specialized drug-screening databases can often yield
very useful ideas. Once a collection of drug leads has been
identified, it is usually prudent to assess the suitability of the
compound as a drug. Drug compounds must not be too soluble, too lipophilic, too unstable, or too toxic. These requirements are closely related to their physicochemical properties,
which are also related to their Absorption, Distribution,
Metabolism, Excretion, and Toxicity (ADMET).
4. ADMET prediction is becoming increasingly common in
early-stage drug discovery, drug screening, and drug design.
Indeed, many computational chemists would argue that
ADMET prediction is something that should always be done
in the early phases of drug-lead selection. Fortunately there are
now a number of software packages, online servers, and standardized rules (Lipinskis rule of five) to determining the likely
success or drug-likeness that a compound might have.
5. Among existing tools, AdmetSAR [36] and PreADMET
(http://preadmet.bmdrc.org/) probably represent two of the
most comprehensive and complete ADMET servers currently
available.
6. AdmetSAR is both a server and a database with more than
210,000 literature-derived ADMET data values for nearly
100,000 compounds corresponding to 45 kinds of ADMETassociated properties obtained for different proteins, cell types,
and organisms. Through database matches, machine learning
classifiers and rule-based regression models derived from its
large database and various molecular descriptors, the
AdmetSAR server also allows users to predict up to 27 ADMET
properties for query compounds. Some of these properties
include probabilities for bloodbrain barrier penetration,
Caco-2 permeability, intestinal absorption, P-gp inhibition/
substrate status, CYP isotype inhibitor or substrate status, renal
cation transporter substrate status, carcinogenicity, and Ames,
fish, or honeybee toxicity. The server accepts SMILES string
data as input and rapidly returns a hyperlinked list of values,
probabilities, or qualitative classification statements (noninhibitor, toxic, nontoxic, etc.). Each entry is also hyperlinked
to a brief description of the ADMET feature.
7. The PreADMET server (http://preadmet.bmdrc.org/) supports a variety of applications including molecular descriptor
calculations (2,000+ values), drug likeness calculations, Caco-2
cell permeability, MDCK cell permeability, human intestinal
absorption (HIA), skin permeability, bloodbrain barrier
permeability, plasma protein binding, Ames toxicity, and rodent
443
444
David S. Wishart
27.
28.
29.
30.
31.
32.
33.
34.
35.
36.
Chapter 20
Molecular Docking toFlexible Targets
JesperSrensen, zlemDemir, RobertV.Swift, VictoriaA.Feher,
andRommieE.Amaro
Abstract
It is widely accepted that protein receptors exist as an ensemble of conformations in solution. How best to
incorporate receptor flexibility into virtual screening protocols used for drug discovery remains a significant challenge. Here, stepwise methodologies are described to generate and select relevant protein conformations for virtual screening in the context of the relaxed complex scheme (RCS), to design small molecule
libraries for docking, and to perform statistical analyses on the virtual screening results. Methods include
equidistant spacing, RMSD-based clustering, and QR factorization protocols for ensemble generation and
ROC analysis for ensemble selection.
Key words Relaxed complex scheme, Ligand filtering, Protein flexibility, QR factorization, RMSD-based
clustering, ROC analysis
1 Introduction
It is widely accepted that proteins do not exist in solution as a single
rigid structure but rather as an ensemble of conformations [13].
The atomic fluctuations that give rise to this ensemble range from
small rotations of an individual amino acid methyl group to much
larger fluctuations concerted between groups of residues and the
protein backbone, loops, or domains. The necessity to consider
alternate conformations, including subtle structural changes in a
binding pocket, is highlighted by the difficulties reported for accurate ranking in cross-docking exercises [47]. Put another way, no
single structure can represent the binding modes for all the competent inhibitors of a drug target. As computational chemists are often
seeking new inhibitors that bind to receptor pockets, these fluctuations need to be accounted for in our computational methods [811].
Modeling these protein ensembles thus provide an opportunity and
has demonstrated success in discovering novel and/or selective
inhibitors that bind to subpockets or alternate conformations not
obvious in the snapshot of a given crystal structure [1215].
Andreas Kukol (ed.), Molecular Modeling of Proteins, Methods in Molecular Biology, vol. 1215,
DOI 10.1007/978-1-4939-1465-4_20, Springer Science+Business Media New York 2015
445
446
447
2 Materials
Software used:
Classical and accelerated MD simulations were performed with
NAMD 2.9 [41, 42], using the Amber99SB force field [43]. The
docking programs used are Schrdinger Glide v. 6.0, 2013 with the
SP scoring function (Schrdinger, LLC, NewYork, NewYork, www.
schrodinger.com) [44, 45] and Autodock Vina version 1.1.2
(http://vina.scripps.edu) [46]. Statistics calculations were performed in MATLAB version R2011b 7.13.0.564 (MathWorks,
Natick, MA, www.mathworks.com/matlab). Figures of the protein
conformations were produced using VMD 1.9.1 [47].
Ensemble docking starting materials are
1. A crystal structure, an NMR ensemble, or a homology model
used for MD simulation.
2. A set of ligand files formatted properly for the docking program used.
3. The docking program.
In this case, the single 1.20 resolution crystal structure for
TbREL1 (PDB ID: 1XDN) [48] was used.
Multiple ligand files were compiled for docking; 121 and 40
known binders for TbREL1 from the Drug Discovery Unit (DDU)
Diversity compounds and Kinase set [49], respectively, the ATP
ligand extracted from the TbREL1 co-crystal structure [48] and
additional known binders found in previous virtual screening
efforts for TbREL1 [39, 40]. We have used these known actives to
generate a set of nonbinders (decoys) using DUD-E [50].
3 Methods
Here, we outline the steps in selecting a representative receptor
ensemble for docking from a dataset of molecular dynamics trajectories using the RCS, the preparation of the ligand files for docking, and the statistical analysis of the virtual screening results.
Virtual screening efforts incorporating receptor flexibility have
previously been reported for TbREL1 [39, 40, 51]. Discovery of
inhibitors towards RNA editing enzymes in trypanosomatid pathogens has also been reviewed recently [52].
3.1 Generating
aConformational
Ensemble
A set of receptor structure coordinates is a prerequisite to generating an ensemble and can be derived from X-ray crystallography or
NMR spectroscopy. If no receptor structure is available for the
target, a homology model based on a structure of a related protein
can be used, preferably with a high sequence identity [5356].
Inthe event that several crystal structures of the biomolecular
target are available, these should be incorporated, as they will often
448
449
parameter file (prmtop) and a script that reads a number of commands for ptraj. We specified for ptraj to read in (using the trajin
command) frames 110,000 (the number of frames in the simulated
trajectory), but only to read every 1,000th frame and output these,
resulting in ptraj outputting ten frames (using the trajout command), with an equidistant spacing of 10ns. The advantage of ptraj
is that we can specify it to align the protein structures to a reference
structure, i.e., the crystal structure, using the rms command. In the
input file below for ptraj, the file system.inpcrd contains the crystal
structure conformation, which is loaded in and used as a reference
structure for aligning. The output pdb files are used for docking.
(Script 1)
RMSD-based clustering is fairly common [68, 69], yet has a number of variables to be determined by the user, which should be
chosen based on the problem at hand. There are too many variations to outline here, instead we refer the reader to an excellent
paper reviewing the possibilities [69] and provide a simple example
of a commonly used method. When developing an ensemble for
virtual screening, in which the goal is to capture the most diverse
conformations of the active site, the RMSD calculation should be
performed with a small set of atoms or residues that line the active
site or are within a certain distance from the ligand (ifone is bound
in the protein). However, if one is interested in larger scale conformational changes as one may encounter with large loops near an
active or allosteric ligand site, then the protein backbone atoms
are most likely a better selection. Here we have chosen the former approach (see Note 5 for the residue selection). The RMSDbased clustering was performed in ptraj [75], although several
other programs are available for this task. As a first step, thetrajectory snapshots were aligned to the crystal structure conformation based on the same residue selection, but only taking the
backbone CA atoms into account. The remaining variables refer
to the different variations of RMSD-based clustering, which have
been described in great detail by Shao etal. [69]. In ptraj we
have used the cluster command, employing the average-linkage
450
(Script 2)
For the classical MD simulations, we have excluded clusters with
populations lower than 5% of the trajectory, although in some
cases lowly populated states may also be viable for discovery. This
results in seven clusters. For the accelerated MD simulations, we
have chosen to keep all ten clusters, because we expect a much better conformational sampling, while not necessarily visiting the
same conformation as often as in the case of conventional MD.
An alternative ensemble clustering methodology can also be
performed using QR factorization which enables one to efficiently
reduce the number of MD snapshots to a minimal set without compromising the loss of diversity in the geometric characteristics of
the binding pocket [39]. QR factorization is a mathematical technique that performs repeated Householder transformations with
column pivoting to reorder the ensemble of structures such that
they are arranged with increasing linear dependence. The steps
required for preprocessing the MD trajectory files for QR factorization are provided in a tutorial at the NBCR Web site listed below.
451
http://nbcr.ucsd.edu/wiki/index.php/
SI2011_track3_CADD_QR_factorization_tutorial
The processed files can then be submitted to the publicly available server on the same Web site listed below.
http://nbcr-222.ucsd.edu/opal2/
CreateSubmissionForm.do?serviceURL=http://
localhost:8080/opal2/services%2Ftrajqr_1.0
Structure extraction techniques based on the shape and chemical properties of the active site are also emerging [7074, 79, 80],
but not used here (see Note 6). Furthermore, structural water
molecules in the active site should be considered (see Note 7).
The final protein conformations used for VS were extracted using
ptraj as detailed above in scripts 1 and 2 and output in the pdb format.
The pdb files were then converted to the pdbqt format for Autodock
Vina (see Note 8). The active site center was defined by X, Y, and Z
coordinates using a fixed square box to enclose the active site (see
Note 9). For Glide docking a receptor grid file was generated using
the XGlide script provided by Schrdinger (see Note 10). The PDB
files were used as input. In total 25 protein conformations were used
for docking: eight different setups of the crystal structure, varying
inclusion of structural water molecules, and 17 structures from ATPbound simulations seven of which were extracted by RMS clustering
of conventional MD trajectories and the remaining ten extracted from
accelerated MD trajectories.We first explored which crystal structure
setup was best able to discriminate binders from nonbinders and then
added this one setup to the 17 clusters from MD.Thus, 18 protein
conformations were included for ensemble statistics, resulting in the
evaluation of 262,143 different ensembles.
3.3 Ligand Library
Construction
andPreparation
forDocking
452
453
3.5 Generating
Docking Statistics
When one has a set of known binders and known nonbinders for a
target, statistical methods can be used to assist in the selection of
454
Fig. 2 (a) Probability distribution function of binding free energies of binders and nonbinders in a virtual screening
experiment represented with the solid and dashed lines, respectively. (b) ROC plot corresponding to the virtual
screening in panel a. ROC plot for a random selection is also depicted with a dashed line for comparison
455
in which erfinv is the inverse error function and the standard error
calculated using the formula [91, 92]:
2
SEcalculated = (s NB
( AUC ) / N NB ) + (s B2 ( AUC ) / N B )
456
Fig. 3 (a) Probability distribution function of AUC for a real virtual screening protocol with a mean value of 0.7 and
a standard deviation of 0.1 (depicted with a solid-line curve), and for a random selection with a mean value of 0.5
and a standard deviation of 0.1 (depicted with a dashed-line curve). The shaded area corresponds to the p-value
to evaluate whether the real virtual screening protocol performs better than random. (b) Probability distribution
function of AUC for a set of two real virtual screening experiments with a mean value of 0.2 and a standard
deviation of 0.1 (depicted with a solid-line curve), and for a set of two identical virtual screening experiments with
a mean value of 0.0 and a standard deviation of 0.1 (depicted with a dashed-line curve). The shaded area
corresponds to the p-value to evaluate whether the two real virtual screening experiments perform identically
457
458
459
460
Fig. 4 (a) The crystal structure of TbREL1 with ATP bound (1XDN.pdb), also highlighting the magnesium ion and
three water molecules bound deep in the protein that interact with ATP.Black markers highlight important
interactions, the E60-R111 salt-bridge, Mg2+-triphosphate tail of ATP, E86 and V88 backbone hydrogen bonds
to ATP, Y58-D10 hydrogen bond, D210-R288 salt-bridge, R288-Water-N7 hydrogen bond, and stacking of F209
and the adenosine moeity. K87 is highlighted, as it is the catalytic residue that gets adenylated when attacking
P in ATP. (b) a setup of the crystal structure with one specific water molecule at the deep end of the pocket
has shown to improve the VS enrichment, (c) a representative structure from conventional MD, and (d) a representative structure from accelerated MD
461
appear in a separate publication. The mild enrichment demonstrated by this example could be a result of many factors, the most
likely is the challenging example we posed to these docking protocols. Here we have a set of known binders with low affinity
(10M<IC50<100 M) and have asked these programs to distinguish them from a set of DUD-E ligands of similar physiochemical
and topological properties, a task that continues to be an important area of research in computer-aided drug design.
4 Notes
1. The magnesium parameters can be downloaded from the
Bryce group AMBER parameter database (http://www.pharmacy.manchester.ac.uk/bryce/amber) where we have contributed the parameter files with permission from the parameter
developers [61].
2. We used dual-boost accelerated MD in NAMD [41, 42],
which applies a boost to the entire potential energy, and also a
boost to the dihedral potential. For each boosting term two
parameters are set: the energy threshold (E) and a tuning
parameter (), which determines the depth of the potential
energy well. To determine the boost energy a short (15ns)
classical MD simulation is performed, and the average of the
POTENTIAL and DIHED terms in the NAMD output are
calculated. The boost parameters are then determined according to the following formula. The factor 4 in the dihedral
terms is not a fixed factor; values of 3.56 have been reported
in the literature [58, 59, 94]. Since we have used the TIP4P-ew
water model [62], we have counted the extra particle on the
water model as an extra atom.
E ( dihed ) = DIHEDNAMD + 4 * # residues
a ( dihed ) = 1 + 4 * # residues
5
E ( total ) = POTENTIAL NAMD + 0.16 * # atoms
a ( total ) = 0.16 + # atoms
3. Here we have extracted conformations every 10ps. This is
specified in NAMD using the DCDfreq variable.
4. Equidistant means that there is an even spacing in time
between the snapshots extracted from the simulations. The
ideal is to select a number that will allow diverse conformations of the binding site; however, such a number is highly
system specific. Another aspect to take into account is not to
set this number too low, as this will lead to too many protein
conformations, which will require more computational
462
resources for the VS, and adding too many protein conformations is not recommended [31, 66].
5. The RMS clustering included the following residues: Tyr58,
Glu60, Glu86, Lys87, Asn92, Arg111, Asp159, Phe209,
Asp210, Glu283, Val286, Arg288, Arg292, Lys307, and
Arg309. These have all previously been highlighted in structural analyses as belonging to the active site [48, 64].
6. Osguthorpe etal. have created an ensemble based on shape
diversity of the active site that had an enriching effect on their
VS [70, 71]. The MDpocket utility found in Fpocketalgorithm
will calculate the volume and specific chemical properties of a
specified pocket from each snapshot of an MD simulation [72,
73]. Subsequent clustering can then be performed on these
data to extract a representative set of structures describing variability in the active site. Alternatively, the FTMap algorithm
floods the protein surface with a set of small organic molecules
and calculates an interaction energy, thereby predicting druggable hotspots in the protein [79, 80]. This algorithm has
recently been extended for the analysis of MD trajectories
[95]. This method can also be useful in identifying, visualizing, and characterizing new subpockets of the target site.
7. There are three water molecules in the cavity wat1, wat2, and
wat3, following the nomenclature of our previously published
work [64]. Thus, we have made the following combinations:
no waters, wat1, wat2, wat3, wat1
+
wat2, wat1
+
wat3,
wat2+wat3, wat1+wat2+wat3. In total, there are seven different receptor configurations. In recent years programs for the
analysis of structurally resolved water molecules have been
developed [16]. Schrdinger has developed the WaterMap
framework to explore and exploit water molecules bound inside
the ligand binding site in drug discovery [96, 97]. Molegro
Virtual Docker [98] has developed a docking algorithm with
attached water molecules that are then retained or displaced in
the docked pose based on energy contributions [99].
8. This conversion was done with the utility prepare_receptor4.py
in Autodock, which takes a pdb file as input and outputs a
pdbqt file.
9. The center was specified as x=41.1100, y=34.9382, and
z=35.8160, based on the ATP binding site. The box size was
defined as a square with the box length set to 25. As all the
structures used were previously aligned to the crystal structure
the square box should encapsulate the active site in all the
protein conformations.
10. The script is available on the Schrdinger Web site script center (http://www.schrodinger.com/scriptcenter/) and is
already preinstalled in Maestro. The script can be used to easily
463
464
19. The central limit theorem (CLT) states that if enough independent measurements of a property are performed on the
same system, the average property will be distributed like a
Gaussian. The center of the CLT curve will be the true mean,
and its width will change with the variance value.
20. If methods A and B are combined, the standard deviation for
this new method is s A + B = (s A2 + s B2 ) .
21. In this example, the best docking score that each ligand gets
among the specified receptor conformations is picked.
Alternatively, one could pick the average of the docking scores
of each ligand for the specified receptor conformations. Or one
could choose to compute a weighted-average of docking scores
using the population percentages of each cluster if the receptor
conformations are extracted by RMSD-based clustering.
Acknowledgements
This work was funded in part by through the NIH Directors New
Innovator Award Program DP2-OD007237 and the National
Science Foundations XSEDE Supercomputer resources grant
LRAC CHE060073N to R.E.A.Support from the National
Biomedical Computation Resource (P41 GM103426), the Center
for Theoretical Biophysics, and UCSD Drug Discovery Institute is
gratefully acknowledged. J.S. thanks the Alfred Benzon Foundation
for a generous postdoctoral fellowship.
References
1. Frauenfelder H, Sligar SG, Wolynes PG
(1991) The energy landscapes and motions of
proteins. Science 254(5038):15981603
2. Boehr DD, Nussinov R, Wright PE (2009)
The role of dynamic conformational ensembles
in biomolecular recognition. Nat Chem Biol
5(11):789796. doi:10.1038/nchembio.232
3. Forman-Kay JD (1999) The dynamics in
the thermodynamics of binding. Nat Struct
Biol 6(12):10861087. doi:10.1038/70008
4. Cross JB, Thompson DC, Rai BK, Baber JC,
Fan KY, Hu Y, Humblet C (2009)
Comparison of several molecular docking
programs: pose prediction and virtual screening accuracy. J Chem Inf Model 49(6):1455
1474. doi:10.1021/ci900056c
5. Cheng T, Li X, Li Y, Liu Z, Wang R (2009)
Comparative assessment of scoring functions
on a diverse test set. J Chem Inf Model
49(4):10791093. doi:10.1021/ci9000053
6. Armen RS, Chen J, Brooks CL 3rd (2009)
An evaluation of explicit receptor flexibility in
465
466
467
468
469
Chem
Inf
Model
47(2):488508.
doi:10.1021/ci600426e
101. Sheridan RP, Singh SB, Fluder EM, Kearsley
SK (2001) Protocols for bridging the
peptide to nonpeptide gap in topological
similarity searches. J Chem Inf Comput
Sci
41(5):13951406.
doi:10.1021/
ci0100144
INDEX
A
Accelerated molecular dynamics ...............................253285
Alchemical transitions ...................................... 179, 184, 187
-chymotrypsin ........................................................197200
AMBER ................................... 9, 62, 65, 261, 265267, 270,
271, 273, 277, 278, 355, 448, 461
Ambiguous interaction restraints (AIRs) ............405406, 410
AMOEBA........................................................ 51, 52, 56, 61
Antechamber .................................................... 187, 270272
Area per lipid ................................ 85, 88, 93, 95, 98, 99, 102,
111, 138, 141, 143
Aromatic order parameters ....................... 110, 116, 117, 120
Associative modeling ........................................................315
ATP-magnesium complex ........................................200201
Automated structure prediction ................................326329
CAPRI .............................................................................400
CFT. See Crooks fluctuation theorem (CFT)
CHARMM ................................ 9, 4850, 52, 53, 55, 58, 62,
65, 74, 94, 96, 103, 246249, 292, 293
Chemical similarity ..................................................316, 431
Cholesterol ...............................................................103, 114
CING ................................................353, 357, 365, 370373
Clustering ......................... 294, 295, 304, 387, 390, 392, 405,
411, 413, 415, 419, 422, 448452, 462, 464
CNS ..........................353, 355, 357, 367369, 376, 377, 400,
401, 406, 408, 411413, 415, 416, 418, 420422
Coarse-grained (CG) method ....................................98102
Coarse-graining (CG) ..........................27, 98101, 126128,
131, 133, 136, 137, 215
Collective variables (CVs) ................ 152162, 164168, 249
Committor ......................................... 43, 296298, 300, 305
Comparative modeling .............................................309329
Computing rate constants ............................................3234
F
FILM. See Folding in lipid membrane (FILM)
Folding ................................. 4, 16, 18, 24, 2729, 43, 47, 49,
125127, 130, 152, 157, 158, 160, 173, 175, 238, 254,
289305, 315317, 323, 334
Andreas Kukol (ed.), Molecular Modeling of Proteins, Methods in Molecular Biology, vol. 1215,
DOI 10.1007/978-1-4939-1465-4, Springer Science+Business Media New York 2015
471
H
HADDOCK ............. 357, 361, 363, 373, 400, 401, 403422
HIV-1 protease.................. 215, 216, 218221, 228, 231, 233
HMDB. See Human metabolome database (HMDB)
Homology .................................. 94, 101, 130, 260, 285, 313,
316318, 320, 321, 323, 327, 328, 332, 335, 357, 384,
386, 388, 391393, 404, 428, 435, 447, 451
Human metabolome database (HMDB) ................. 426, 429,
430, 436439, 441
Hybrid topology ............................................... 192, 193, 206
Hydrogen bonds ............................. 12, 1617, 19, 24, 27, 52,
54, 55, 94, 98, 99, 115, 117, 119, 120, 127, 129, 131,
158160, 293, 325, 326, 332, 354, 369, 370, 401, 405,
407, 413, 452, 460
I
Importance sampling ................................................242243
Inflategro ...........................................111, 121, 138, 139, 146
Ion channels ....................................................... 3, 4, 92, 125
J
Jarzynski equality.............................................. 174, 180, 192
K
Kinetic network ........................................................290, 302
Kinetics............................... 5, 8, 28, 142, 143, 173, 176, 177,
239, 243, 253, 289, 290, 293, 300, 302, 303, 305
N
NAMD...............................................9, 50, 65, 96, 255, 277,
447, 448, 461
NMR spectroscopy. See Nuclear magnetic resonance (NMR)
spectroscopy
Non-equilibrium methods ................................................175
Normal modes .......................... 127, 217, 218, 221, 231, 245,
246, 248, 418, 446
Nuclear magnetic resonance (NMR)
spectroscopy ...........................331, 351, 358, 372, 447
O
OnsagerMachlup (OM) ......................................... 240, 248
Open3DALIGN ..............................................................186
OPLS ........................................................... 21, 62, 187, 421
Ovomucoid ...............................................................197200
P
1-Palmitoyl-2-oleoyl-sn-glycero-3-phosphoethanolamine
(POPE) ....................................72, 74, 75, 79, 8587,
113, 120, 121, 262264, 269271
Parameter optimization .............................. 48, 52, 58, 6061
Parametrization ......................................................50, 5261
Particle-Mesh-Ewald summation (PME) ................. 7, 8, 18,
19, 22
PCA. See Principal component analysis (PCA)
Peptidelipid interaction ..................................................117
Periodic boundary conditions ............................... 7, 273, 278
Perturbation method ........................................ 177179, 188
Phi-value .................................................. 290, 291, 300304
Q
QM/MM ......................................................... 20, 28, 29, 50
QR factorization............................................... 448, 450, 451
Quantum/classical mechanics.......................................2744
R
Radius of gyration ................................ 1617, 135, 401, 403
Rate calculations .................................................................29
Reaction-field electrostatics ..........................................19, 23
Reaction rates .......................................................2744, 440
Reactive path .........................................2931, 34, 36, 37, 39
Receiver operating characteristic (ROC)
analysis..................................................................463
RECOORD ..................................... 353, 357, 367369, 373
Relative free energy .................................. 175, 187, 239, 244
Relaxed complex scheme (RCS) ............................... 446, 447
Replica exchange method ................................. 152, 154156
Residueresidue contacts ................................. 110, 120, 121,
333, 335336
Rhodopsin dimer ...................................... 131, 132, 137, 141
Rhombic dodecahedron.......................................... 18, 19, 24
RMSD. See Root-mean-square displacement (RMSD)
RMSF. See Root-mean-square-fluctuation (RMSF)
ROC analysis. See Receiver operating characteristic (ROC)
analysis
Root-mean-square displacement (RMSD) .................... 14, 15,
24, 63, 93, 113, 115, 145, 161, 168169, 248, 249, 260,
293295, 297, 298, 316, 318, 332334, 348, 356, 366,
372, 377, 405, 407, 413417, 422, 448450, 453, 464
Root-mean-square-fluctuation (RMSF) ..................... 15, 16,
120, 123, 144, 145, 279283
S
Sampling intermediates .................................... 240, 241, 246
Secondary structure ............................1516, 19, 63, 64, 115,
116, 128, 130, 131, 133, 158, 299, 314, 315, 322, 323,
325, 335337, 340342, 353, 354, 360, 422
Self-diffusion constant .................................................50, 85
Simulation .................................... 3, 27, 47, 73, 91, 109, 125,
151, 175, 215, 237, 253, 289, 313, 352, 386, 405, 446
Steepest descent.......................................6, 12, 111, 135, 157
Structural bioinformatics ....................................................93
Structure alignment .......................................... 226, 227, 384
Structure calculation .........................352358, 361367, 369,
373, 377, 401, 405
Structure refinement.................................................351377
Structure validation ......................................... 327, 353, 356,
357, 370373
System setup ............................................. 110114, 200201
T
TALOS+ ............................................357, 359361, 365, 374
Temperature coupling .........................................................23
Temperature factors ...........................9, 15, 16, 344, 345, 422
Template identification.............................................317320
Thermodynamic cycles .............................. 94, 183, 187193,
196198, 200, 202
Thermodynamic integration ............................. 175, 179, 180
Thermostat ...................................................... 8, 13, 23, 142,
143, 166, 293
Threading ................................................. 318321, 323, 328
TIP3P water ....................................................... 11, 157, 405
Topology ..................................9, 11, 21, 38, 7476, 79, 81, 82,
85, 89, 110113, 120122, 132, 133, 135137,
139141, 157, 182183, 185187, 191193, 195196,
199, 200, 205, 206, 262, 270, 271, 277, 278, 291,
332334, 337, 344346, 368, 404, 412, 419, 421
Topology generation .........................183, 185187, 195, 200,
206, 404, 421
TPS. See Transition path sampling (TPS)
Trajectory analysis ........................................................1416
Transition path ...................................2832, 34, 3644, 180,
241, 245, 295, 297299, 301
Transition path sampling (TPS) ................... 2744, 240, 241
Transition state .............................36, 43, 243244, 255, 257,
258, 290292, 296300, 302
Transmembrane protein ...........................262, 267, 313, 315,
325, 333, 343, 347
Transmembrane topology ................................. 334337, 345
Trp-cage miniprotein ...................................... 152, 157160,
162, 165
Trypsin inhibitors .....................................................194197
U
Unfolding ....................................47, 291293, 296, 302304