FightAIDS@Home - Phase II
Introduction
FightAIDS@Home Phase II is a milestone in the collaboration between
Dr. Arthur Olson and the
Molecular Graphics
Laboratory at the Scripps Research Institute and Dr. Ronald M. Levy's
group at Temple University. Dr. Olson initiated the largest
HIV virtual screening effort ever using his computational molecular
docking software, AutoDock, over a decade ago using
IBM's distributed volunteer computing grid,
the World Community Grid (WCG), called
FightAIDS@Home.
As director of the HIV Interaction and Viral Evolution
(HIVE) Center, an NIH funded HIV collaborative research
center, the virtual screening results play a significant role
informing a portion of the Center's research. The HIVE Center
comprises a multidisciplinary team of scientists whom together aim to
elucidate the structural and dynamic relationships between interacting
HIV macromolecules in an effort to design future therapeutics to combat
HIV and its evolution of drug resistance.
Dr.
Ronald Levy brings to the HIVE Center over 30 years of
leadership and experience in the development and application of
molecular dynamics simulations to study the structure and dynamics of
proteins and their complexes.
During FightAIDS@Home
Phase I, millions of compounds have been screened against HIV
related protein targets, and thousands of potential drug candidate
compounds have been identified. In Phase II, we plan to refine the
selection of these thousands of possible compounds by performing more
detailed computational experiments using our molecular dynamics engine IMPACT,
and the Binding Energy Distribution Analysis Method (BEDAM).
The introduction of BEDAM simulations into FightAIDS@Home Phase II
presents an enormous opportunity to refine and enrich the results from
Phase I, but also presents a technical challenge as the constraints on
the simulations running on the WCG are much different
than the constraints on the simulations when running on more
conventional computational resources.
The first experiments of FightAIDS@Home Phase 2 seek to achieve two
goals: first, to confirm that the new simulation schema is working as
intended and gives sufficiently reliable results compared to
traditionally run simulations; second, to demonstrate that using BEDAM
in conjunction with AutoDock
results in better predictions than using AutoDock or BEDAM alone.
There exists a symbiotic relationship between docking
and more computationally demanding free-energy methods like
BEDAM—without docking, the computing time required to score thousands of
ligands
with free energy methods is intractable, and without free energy
methods, the relationship between docking scores and experimental
binding affinities remains
much more empirical and less accurate.
Follow the following links to learn more about FightAIDS@Home Phase 1,
IBM's World Community Grid, and general information regarding
FightAIDS@Home Phase 2. To learn more detailed information of how
FightAIDS@Home Phase 2 works, continue reading this page.
If you haven't already, click on the following link to download the
World Community Grid software to run FightAIDS@Home Phase 2 (FAH2):
and make sure that FightAIDS@Home - Phase 2 is checked on your My
Projects page. You can also find the
minimum
system requirements and
installation
and registration instructions by following their links.
A drug candidate binding the LEDGF allosteric
binding pocket of HIV Integrase.
Overview of Molecular Dynamics and BEDAM
Physical free energy models of binding
One key aspect of drug discovery is to identify compounds which bind
strongly and specifically to the target receptor, and therefore there
exists considerable interest in the development of computational
models to predict the strength of protein-ligand interactions.
Thermodynamically, this interaction strength between ligand molecule
and receptor is measured by the binding free energy, and many computer
models aim to predict the protein-ligand binding free energy by
simulating the interactions between protein and ligand in a bound
complex. Docking methods use empirical scoring functions to
estimate binding free energies in order to distinguish between between
ligands that bind strongly from ligands that bind weakly or not at all.
On the other hand, physical free energy models, which use physics
based effective potentials, seek to
compute accurate protein-ligand binding free energies based on the
principles of statistical
mechanics. Unlike docking methodologies, many of these methods
are exceptionally computationally intensive and are highly dependent
on both accurate modeling of interaction force fields (ligand-ligand,
ligand-solvent, protein-protein, protein-solvent, and protein-ligand)
and efficient sampling of all rotational and translational, internal
and external, degrees of
freedom of the ligand and protein.
The statistical
mechanics theory of binding provides a prescription to compute
the binding free energy from first principles, which can be
implemented in various ways. One method developed in the Levy lab is
called the binding energy distribution analysis method
(BEDAM),
which runs using the Levy group's molecular dynamics engine
IMPACT.
This methodology uses Hamiltonian
replica exchange
molecular dynamics simulations of the
ligand-receptor complex in an implicit solvent model to construct a free energy
path that connects the unbound and bound states of the ligand with the
receptor.
Expand the following subsections for in-depth descriptions of how
BEDAM molecular dynamics simulations work!
A molecular dynamics (MD) simulation is a computer simulation of how
atoms and molecules move according to Newton's
laws of motion. Using Newton's second law "F = ma", the
positions and velocities of thousands
of atoms are continuously updated by computing the forces (and therefore the accelerations)
acting on each atom. Each simulation step is typically on the order of
a femtosecond (10^-15 second), and therefore a complete simulation, or
trajectory, typically consists of
millions of steps. To simulate a system for a millisecond takes a
trillion steps with a 1fs time step!
For every time step in the simulation, there are two primary
computations that occur:
- given the current positions of all the atoms, compute the forces
acting on each atom
- using the recently calculated forces, update the current
positions of the atoms (by integrating the equations of motion)
Step 1 is typically the most computationally intensive part of the
simulation because, in principle, one must consider the interactions
between all pairs of atoms, which means computing the distances
between each pair and considering all types of interactions between
those two atoms. Although many algorithms have been designed to
reduce the number of force calculations performed at each step, most
of the computer cycles used in an MD simulation are used to compute
forces.
Because these simulations involve computing a massive system of
coupled equations using floating point
arithmetic many, many times,
two trajectories starting from identical starting conditions will
diverge exponentially quickly. One consequence of this is that we
cannot use duplicate or quorum simulations to validate the results we
draw from the computational experiment. However, the fact that two
simulations with almost identical initial conditions diverge is not
evidence that there is little to gain from such simulations. It's
important to understand that MD simulations are used to predict
average behaviors and quantities of the system being simulated, and in
this sense, MD differs fundamentally from other types of numerical
trajectory-computing simulations, like computing the path of a
satellite or space shuttle through space where knowing the ``true''
trajectory is critical. There is substantial numerical evidence that
shows MD simulations of many-body systems of sufficient length are
representative of the ``true'' trajectories.
Although we simulate the motion of many individual atoms coupled
together, the quantities in which we are interested, temperature,
pressure, binding affinities, etc., do not depend on the specifics of
every particle, but instead on the statistical, or average, properties
of the molecular motion. It is the field of statistical mechanics
which provides the framework for using microscopic molecular dynamics
simulations to predict the macroscopic thermodynamic and kinetic
behavior of these complexes, like the binding affinities of ligands to
protein receptors and the kinetics of protein-ligand binding.
Ultimately there are two principles that determine the overall
usefulness of a MD simulation: sampling efficiently and forcefield
accuracy. Generally, sampling describes the process of selecting
elements from a population such that properties of those elements are
representative of the entire population. In terms of MD trajectories, sampling describes the
notion that the conformations,
energies, etc. of the trajectory are representative of the average
properties of the set of all trajectories of that system. The more
efficiently a simulation can sample the conformational space (set of
all 3D arrangements of atoms) or energy landscape (description of
energy values given positions/velocities), the more value a simulation
of a fixed length is, or put another way, better sampling decreases
the amount of simulation time needed to estimate quantities of
interest.
There are many types of sampling algorithms in molecular dynamics
designed to accelerate the sampling of particular degrees of freedom
during the simulations. In typical BEDAM simulations the sampling of
ligand-receptor complexes, and thus binding energy estimates, is
accelerated using a technique known as Hamiltonian Replica Exchange.
This technique takes advantage of parallel computing, where several
copies (called ``replicas'') of the
simulation are run in parallel. The Hamiltonian is the energy function,
and in BEDAM simulations has a component related to the coupling
between receptor and ligand which can be tuned to on, off, or
somewhere in between. Each replica in the simulation has this
receptor-ligand coupling dialed into a different strength and all
replicas are launched in parallel. Often during the simulation, the
replicas all stop and exchange receptor-ligand
coupling strengths and then the trajectories continue on; this is
where the name Hamiltonian exchange comes from in Hamiltonian Replica
Exchange.
An example of how the Hamiltonian parameter
lambda changes over time due to (left) replica exchange and (right)
lambda scheduling.
Implications for FAH2: The process of replica exchange
requires inter-replica communication, something that is often possible
on supercomputers and clusters, but is not well matched to the WCG.
As an alternative, we have implemented a different sampling protocol
called ``lambda scheduling'' where,
instead of exchanging Hamiltonian parameters with other replicas,
replicas will cycle through the various values of receptor-ligand
coupling on a precise schedule. Although it is not our standard way
of computing binding free energies, the methodology is well-founded in
non-equilibrium
statistical mechanics. Doing so complicates the statistical
mechanics methods we use to compute binding free energies (because
lambda scheduling is a non-equilibrium
process where as replica exchange is an
equilibrium process), but it
satisfies the constraints of working on the grid. Our first order of
business on the grid is testing this sampling paradigm to insure we
can estimate equilibrium properties like protein-ligand
binding affinities properly.
The other component determining the usefulness of MD simulations is
that of the accuracy of the forces being
calculated. Because the force computation is the most computationally
intensive portion of the simulation, many algorithms have been
developed to efficiently calculate all the necessary interactions,
each with their own advantages. One major decision when beginning a
molecular dynamics simulation concerns how to treat interactions with
water (the solvent) as water-protein
interactions drive structural organization at the protein (
solute) surface.
Typically, one chooses to model water explicitly or implicitly.
Explicit solvation simulations place their system of interest in a
bath of water molecules, enough to fully envelope the protein or
protein-ligand complex as well as fill any cavities, and the motions
of the water molecules are explicitly simulated in addition to the
rest of the atoms in their system. Alternatively, instead of modeling
the individual water molecules, implicit solvation simulations
represent the water as a bulk dielectric
and estimates of the solute-solvent interactions are approximated by
interactions with this dielectric. While hybrid solvation models
exist, they are outside the scope of this article.
Implications for FAH2: Implicit solvent simulations have
several advantages over explicit representations. Namely, computation
time is lower due to many fewer interaction calculations, the need to
sample and equilibrate many solvent conformations is eliminated, and
noise due to solvent structure is eliminated. Due to our lab's
experience with and development effort towards implicit solvation
models (namely, AGBNP, AGBNP2, AGBNP3),
the simulations running on the World Community Grid model solvent
implicitly. One relevant drawback of using an implicit solvation
model is that the force calculations are much harder to vectorize,
meaning that a parallelized version of IMPACT and BEDAM suitable for
running on GPUs is not currently
available. However, this is an active area of funded research of a
close collaborator,
Emilio Gallicchio.
Running BEDAM on IBM's WCG
Differences between Phase I and Phase II
For FAH2, the concept of a batch has changed from 1,000-10,000
ligand-receptor complexes per batch to 1 complex per batch, and each
workunit in that batch is running thousands of simulations of
that complex which we call replicas.
These replicas differ from each other in important ways, such as
different energy function parameters and different starting
conformations of the ligand and receptor.
Another key difference between Phase I and Phase II is that the
simulation corresponding to one replica is broken up into several
workunits that must be completed serially, i.e., the output of one
workunit serves as input for the next workunit. Each batch generates
tens of thousands of workunits that together form the simulations of
hundreds or thousands of replica trajectories.
First computational experiments
Additionally, the distributed and heterogeneous nature of the volunteer
grid imposes some constraints on how the simulations can be run.
We're in the process of learning how best to utilize this enormous
resource by running each batch with two different simulation schema
and several different analysis protocols. Doing so will allow us to
refine our future simulation schema in order to maximize the impact of
donated computing cycles. As volunteers ourselves, we do not want to
see wasted computing cycles and are working with the constraints
imposed by our dynamics engine, our forcefield model, and the nature
of the WCG to find the optimal computing methodology for us and the
volunteers.
The two simulation protocols currently being tested employ sampling
techniques that differ from the way these simulations are run on
clusters and supercomputers. The first is what we call
``lambda scheduling'' or ``lambda
cycling'', which is where throughout a replica's simulation, key
parameters associated with the coupling between ligand and receptor
are cycled through a known set of values. This greatly helps to
accelerate the sampling of the conformational space. Simulating
multiple replicas with the same parameter changing schedule helps
accelerate this sampling further.
The second simulation scheme is independent sampling in which no two
replicas will ever have the same combination of energy function
parameters, initial structures, etc., and these parameters are
constant throughout all workunits pertaining to each replica's
trajectory. With both simulation schema, long simulation times are
needed. We have developed analysis techniques which can combine the
results from all of these different replica trajectories to produce
estimates of the protein-ligand complex's binding free energy.