Infinite hierarchies of moment equations appear in many non-linear systems, including biochemical reaction networks Johnson et al. (2015), quantum optics Schack and Schenzle (1990), and kinetic theory of gases Levermore (1996), and more generally in the analysis of stochastic Kuehn (2015) and partial Frewer (2015) differential equations.
Spatial reaction networks are a typical example of such systems. Differential equations for moments derived from the chemical master equation (CME) Gardiner (2009) usually depend on longer range spatial correlations. Moment closure approximations are widely used in chemical and biochemical modeling applications to obtain a solvable system by expressing higher order terms through lower order ones (see Johnson et al. (2015) for a review on moment closure methods for reaction networks).
. Here a differential equation model is introduced and learned, which can be parameterized either from the relevant physics, or otherwise using an appropriate set of basis functions. The learning problem can be formulated in continuous space in the form of a partial differential equation (PDE) constrained optimization problem. This offers a natural approach to introduce PDEs into machine learning. In the Supplemental Material, we review the general variational problem for DBDs, and make explicit the connection to the work in this paper.
The learning algorithm shares many advantages with the well-known Boltzmann machine learning algorithm Ackley et al. (1985)
. For example, in generative modeling approaches such as variational autoencoders (VAEs), the distribution over latent variables is chosen (typically Gaussian), while in DBDs it is learned from data. Further, the energy function in DBDs can be used to interpret the reduced models learnedErnst et al. (2018)
. Other approaches for learning temporal data such as recurrent neural networks (RNNs) do not offer a reduced model probability distribution, and do not give insight into the learned moment closure approximation as done in SectionV.
In this paper, we use the architecture of deep Boltzmann machines (DBMs) to train deep dynamic Boltzmann distributions for learning moment closure approximations. The learned model replaces long range spatial correlations with correlations involving latent variables, whose activations are learned from data. We show how the centering transformation Montavon and Müller (2012); Melchior et al. (2016) can be applied in the adjoint method Ernst et al. (2018); Cao et al. (2003) to derive a learning algorithm which does not require pre-training.
This paper is organized as follows: Section 2 introduces a simple example demonstrating moment hierarchies, Section 3 reviews DBMs and centered DBMs, Section 4 presents the learning algorithm for centered dynamic DBMs, and Section 5 presents numerical examples.
Ii Moment closure in the Lotka-Volterra system
In this section, we review the appearance of a hierarchy of moments in a spatial probabilistic system. As an example, consider the Lotka-Volterra predator-prey system, described by the reactions:
where and denote the predator (hunter) and prey populations and are reaction rates.
In a well mixed system where spatial effects are not included, the mean number of and , denoted by and
, obey the following ordinary differential equation (ODE) system:
These differential equations are solvable for given initial conditions and .
In a spatial setting, consider a lattice in the single occupancy limit, where describes the -th lattice site occupied by species . The mean number of and obey:
where sums over lattice sites, and over neighboring sites. These equations do not close - a nearest neighbor moment appears on the right hand side. Further, the differential equation describing this term depends on yet higher order ones, e.g. next-nearest neighbors, or three particle correlations (if particles are allowed to diffuse, the diffusion constant would appear here). Moment closure methods seek to obtain tractable approximations to this infinite hierarchy of differential equations.
ii.1 Stochastic simulations of reaction-diffusion systems
Alternative to the deterministic approach of solving differential equations such as (2), stochastic approaches can be used. For example, the Gillespie stochastic simulation algorithm (SSA) can be derived from the CME Gillespie (1977); Mjolsness (2013). Spatial adaptations of the Gillespie SSA can similarly be used to generate stochastic simulations of (3), including in continuous space Kerr et al. (2008). Here we adopt a simple lattice based algorithm Ernst et al. (2018) in which particles hop on a grid, undergo unimolecular reactions following the Gillespie SSA, and bimolecular reactions with some chosen probability upon encounters.
To simulate the Lotka-Volterra system, we use a 2D lattice of sites with von Neumann connectivity and periodic boundary conditions. For reaction rates, we use , and bimolecular reaction probability corresponding to . The initial state has particles of hunter and prey randomly distributed, and simulations are run for timesteps with . Figure 1 shows snapshots of these simulations, which feature spatial patterns including waves.
We generate simulations as training data used in the rest of this paper. The mean number of hunter and prey is shown in Figure 1. Self intersections in this low order moment space reflect the dependence on spatial correlations which differ over time. The challenge for a deep learning problem is to identify relevant higher order correlations to separate states identical in low order moments, and to learn a closure model for expressing these correlations in terms of a finite number of parameters.
Iii DBMs for modeling reaction-diffusion systems
iii.1 Energy based models
Let the lattice on which particles diffuse be the designated as the visible layer of the DBM. Let there be a total of layers, where denotes the visible layer and the hidden layers, each with units. Let the units in the -th layer be one of species . The state of each layer is described by a matrix, where entries denote the absence or presence of species at site . We consider lattices in the single-occupancy limit, corresponding to the implicit constraint .
A general energy model for fully connected layers Ernst et al. (2018) has biases for unit in layer occupied by species , and weights connecting unit of species in layer with unit of species in layer . In this paper, we focus on learning hierarchical statistics with a smaller parameter space by making the following simplifications: (1) we consider locally connected layers, where each unit in layer is connected to units in layer , and (2) biases and weights are shared across units in a layer: and . The energy function is:
where sums over the local connectivity between two layers. Since each site in layer is connected to sites in layer , then this sum comprises terms in total.
iii.2 Learning rule for DBMs
Maximizing the log likelihood of observed data gives the well known learning rule for DBMs Salakhutdinov and Hinton (2009):
where , and denotes an average taken over the model distribution, and denotes an average taken over the data distribution, and the sign convention in the update steps is: and similarly for , with learning rate .
where sums over units connected to unit
. Each step of sampling is performed in two phases: one pass for layers with even indexes, and one pass for layers with odd indexes. Estimating the moments under the data distribution (keeping the visible layer clamped at a data vector, i.e. the wake phase) can been done for DBMs by mean field methodsSalakhutdinov and Hinton (2009), or else by Gibbs sampling Montavon and Müller (2012).
iii.3 Centering transformation and the centered gradient
where are the species-dependent centers in layer . Every regular DBM can be transformed to a centered DBM by transforming parameters as:
This can be used to derive the centered gradient Melchior et al. (2016): After sampling the moments of a regular DBM (6), transform to a centered DBM, calculate the gradient with respect to the centered parameters, and transform back to obtain the gradient for the regular DBM parameters. The result is
as derived in the Supplemental Material. To reduce noise, the centers are updated as the average unit’s state with an exponential sliding average with sliding parameter :
Iv Dynamic centered DBMs
We follow earlier work on dynamic Boltzmann distributions Johnson et al. (2015); Ernst et al. (2018, 2018) by escalating to time-varying interactions and (see Supplemental Material for review). Instead of minimizing the Kullback-Leibler (KL) divergence of the model distribution to the data distribution at an instant in time, we now seek to minimize the KL divergence integrated over all times. Define as the objective function:
The integral over time can lead to undesired extrema, particularly for periodic systems. In practice these can be eliminated using a small sliding time window: , where is some small fixed value and is shifted forward every few optimization steps Ernst et al. (2018).
iv.1 Reduced dynamic model
To describe the time-evolving interactions, introduce the autonomous differential equation system:
for given initial conditions , . Here is a chosen domain of interaction parameters (weights and biases), and are functions with free parameter vectors to be learned.
The functions on the right side can be chosen based on the physics of the system under consideration to learn a reduced physical model. For reaction-diffusion systems, the forms of can be derived from the CME when using a fully visible Boltzmann distributions Ernst et al. (2018). A more blackbox aligned approach is to introduce basis functions to parameterize (12). Following Ernst et al. (2018) we use the family of finite elements Arnold and Logg (2014)
, which has the advantage that in dimensions higher than one, basis functions are simply tensor products of 1D cubic polynomials. The learning algorithm in Section IV.3 requires
functions - these polynomials are therefore the Hermite polynomials that in 1D control four degrees of freedom: the value and the first derivative at each endpoint. Forof length , this gives degrees of freedom in total and corresponding coefficients to be learned:
and similarly for , where is the appropriate basis function.
iv.2 Moment closure approximation of dynamic centered DBMs
The key advantage of the dynamic Boltzmann distribution setting is the moment closure approximation that can be learned from data: any given moment evolves as (see Supplemental Material):
where . The learned differential equation of every interaction (weights and biases) contributes to the closure model, weighted by a covariance term between and the observable for which is the Lagrange multiplier. Equation (14) should be directly compared against (3). Higher order terms of visible units appearing in (3
) are exchanged for correlations with latent random variables, whose activations arelearned.
iv.3 Adjoint method learning problem with centering transformation
We next formulate a learning problem for the parameters defining the dynamical system (12). This is a specific case of a variational problem for the functions appearing on the right hand side of a differential equation Ernst et al. (2018); Gamkrelidze (1978), as shown in the Supplemental Material. To enforce the constraints (12), introduce adjoint variables to the centered parameters . The adjoint equations can be derived from the Hamiltonian (Supplemental Material) giving
with boundary conditions , and where
While analytic expression for are not available, in practice they can be easily estimated as: .
The sensitivity (update) equations for the parameters to be learned are:
where the update step is: with learning rate (see Supplemental Material).
Algorithm 1 is an example of how the learning problem can be solved in practice. Alternative approaches for solving PDE-constrained optimization problems can be applied, such as sequential quadratic programming (SQP). A benefit of the current algorithm is its simplicity - the inner loop at each timestep is equivalent to the wake/sleep phases of the Boltzmann machine learning algorithm Ackley et al. (1985). A lower bound on the log-probability of test data can therefore be obtained using established methods such as Annealed Importance Sampling (AIS) Salakhutdinov and Hinton (2009). It is naturally possible to use accelerated gradient descent methods such as Adam Kingma and Ba (2014).
V Numerical examples
We demonstrate Algorithm 1 for learning a moment closure approximation for the Lotka-Volterra system of Section II. As training data, we use the stochastic simulations generated in Section II.1. Note that we only consider a single initial condition in this work - however, it is possible to learn a larger parameter space from stochastic simulations with varying initial conditions Ernst et al. (2018).
For the architecture of the DBM, we use a locally connected DBM with one visible and two hidden layers as shown in Figure 2(a). Each patch in layer is connected to a single unit in layer . The number of units in each layer is held constant at , with boundary units implementing periodic boundary conditions to the layer below, reflecting those used in the stochastic simulations.
Many systems feature a large number of species, e.g. states of ion channels as different sub-units are activated. Having species in layer and species in layer leads to species dependent weights. To limit this parameter inflation, we consider two species in each layer, and weights and . We found the omitted cross-species weights to be less important, as a similar effect is produced by crowding in the hidden layers during sampling.
As initial condition, we use the maximum entropy distribution consistent with the initial hunter and prey particles. This corresponds to initial parameters , and all other weights and biases set to zero. The algorithm therefore must learn to activate latent variables to control spatial correlations. Starting from zero weights and biases allows the hidden layers to enforce either a strengthening or suppressing (positive or negative weights) relationship with neighboring layers. We restrict the domain of the differential equations (12) to be . Since the distribution starts with only the visible biases as non-zero, it is important to include these in the domain. When training on stochastic simulations from many initial conditions, using fewer parameters in the domain improves generalization Ernst et al. (2018). The side length of the cubic cell (13) used in all dimensions was .
Algorithm 1 is implemented as a C++ library freely available online Ernst et al. (2019) (included in the Supplemental Material). We use simple batch gradient descent with batch size , learning rate for all parameters for optimization steps, and sliding factor (the center is averaged over all units allowing a larger factor than in non-dynamic centered DBMs). We use steps of Gibbs sampling for estimating both the wake and sleep phase moments, with persistent chains in the sleep phase Tieleman (2008). The differential equations (12,15) are solved using Euler’s method with the step size from the stochastic simulations. The time window in (17) is of size timesteps, and we slide every two optimization steps. Figure 2(b) shows the learned parameter trajectories. An example of the hidden layer states is shown in Figure 2(c), showing the learned hierarchical representation of spatial patterns in the hidden layers.
The moment closure approximation (14) is of particular interest. Here it can be analyzed which weighted covariance terms the system has learned that contribute to the time evolution of an observable. Consider for example the mean number of prey . Figure 3, top-left, shows the time evolution of this observable under the trained model obtained by averaging over sampled lattices ( steps of Gibbs sampling from a random configuration) using the learned parameters at each timestep. For testing, we compare the trajectory against stochastic simulations, which agree well (for training on varying initial conditions, a test data set of stochastic simulations may be used Ernst et al. (2018)). Next, we calculate each term in (14), shown in rows two and three (to reduce noise, we first smooth the interactions with a low-pass filter with cutoff before sampling and calculating these terms). The sign of weight and bias terms is generally opposite - this reflects the selective activation of units based on the patches in neighboring layers, rather than broad, spatially uncorrelated activations. To validate the accuracy of these terms, we also plot their sum in the bottom row, which is nearly identical to the true time derivative calculated from the stochastic simulations.
The two oscillation cycles in Figure 1 evolve differently due to the dependence on higher order moments. How is this difference reflected in the learned moment closure approximation? In the first oscillation in Figure 3, the time evolution is primarily driven by the weight term from the first hidden layer, indicating that mainly nearest neighbor structure relevant. In the second oscillation, the contribution from terms in the second hidden layer has grown significantly, indicating longer range correlations are relevant. Indeed, visually inspecting samples of the stochastic simulations (see Figure 1) shows that larger domains of hunter and prey are formed in the second oscillation.
Examining different quantities in this fashion gives insight into the learned moment closure approximation. In the center column of Figure 3, we examine the mean number of nearest neighbors of prey. The terms in (14) are approximately scaled version of the terms for the mean number of prey. This is expected, since the reaction system does not explicitly give a source or sink for nearest neighbor correlations between prey. Not all terms are accurately captured. For example, the right column shows the mean number of next-nearest neighbors (Manhattan distance two) of hunter and prey, which are overestimated in the learned model. This is explained by weight terms from hunter and prey both contributing positively to the time evolution, rather than competitively as previously.
We have introduced a method for learning moment closure approximations from data using multiple hidden layers. A key result is the closure equation (14), which replaces long range spatial correlations in the visible layer with correlations with latent variables, whose activation is learned. The learning problem is that for a dynamic Boltzmann distribution, combined with the architecture of a DBM. The centering transformation from centered DBMs is extended to the adjoint system for the dynamic case, such that pre-training is unnecessary. The hierarchical architecture in Figure 2(a) is tailored to reflect the moment equations derived CME (3), naturally capturing correlations relevant to the moment closure problem. A further important result is the use of multinomial variables in the hidden layers, which allows interpretable learned representations as in Figure 2(c).
Further avenues for improvement exist. It may be possible to adapt the “serendipitous" family of finite elements Arnold and Logg (2014) to reduce the number of basis functions and therefore computational cost Vincent et al. (2015), although it is not , requiring modifications to the learning problem. For modeling applications, physically informed parameterizations are particularly interesting, e.g. for reaction-diffusion systems Ernst et al. (2018), and generally in mathematical modeling Mjolsness (2018).
Acknowledgements.This work was supported by National Institute of Aging grant R56AG059602 (E.M., O.K.E., T.B., T.S.), Human Frontiers Science Program grant HFSP - RGP0023/2018 (E.M.), and NIH P41-GM103712, NIH R01MH115456, and AFOSR MURI FA9550-18-1-0051 (O.K.E., T.B., T.S.).
- Ackley et al. (1985) David H. Ackley, Geoffrey E. Hinton, and Terrence J. Sejnowski. A learning algorithm for Boltzmann machines. Cognitive Science, 9(1):147–169, 1985.
- Arnold and Logg (2014) D.N. Arnold and A. Logg. Periodic table of the finite elements. SIAM News, 47(9), 2014.
- Cao et al. (2003) Y. Cao, S. Li, L. Petzold, and R. Serban. Adjoint sensitivity analysis for differential-algebraic equations: The adjoint DAE system and its numerical solution. SIAM Journal on Scientific Computing, 24(3):1076–1089, 2003.
- Ernst et al. (2018) Oliver K. Ernst, Tom Bartol, Terrence Sejnowski, and Eric Mjolsness. Learning dynamic Boltzmann distributions as reduced models of spatial chemical kinetics. The Journal of Chemical Physics, 149(3):034107, 2018.
- Ernst et al. (2018) Oliver K. Ernst, Tom Bartol, Terrence Sejnowski, and Eric Mjolsness. Learning moment closure in reaction-diffusion systems with spatial dynamic Boltzmann distributions. arXiv e-prints, art. arXiv:1808.08630, Aug 2018.
- Ernst et al. (2019) Oliver K. Ernst, Tom Bartol, Terrence Sejnowski, and Eric Mjolsness. Dynamic Boltzmann Distributions C++ Library v4.0, 2019.
- Frewer (2015) Michael Frewer. Application of Lie-group symmetry analysis to an infinite hierarchy of differential equations at the example of first order ODEs. arXiv e-prints, art. arXiv:1511.00002, Oct 2015.
- Gamkrelidze (1978) R. V. Gamkrelidze. Principles of Optimal Control Theory. Springer US, Boston, MA, 1978.
- Gardiner (2009) C. W. Gardiner. Stochastic methods: A handbook for the natural and social sciences. Springer, Berlin, 2009.
- Gillespie (1977) Daniel T. Gillespie. Exact stochastic simulation of coupled chemical reactions. The Journal of Physical Chemistry, 81(25):2340–2361, 1977.
Johnson et al. (2015)
Todd Johnson, Tom Bartol, Terrence Sejnowski, and Eric Mjolsness.
Model reduction for stochastic CaMKII reaction kinetics in synapses by graph-constrained correlation dynamics.Physical Biology, 12(4):045005–045005, 07 2015.
- Kerr et al. (2008) Rex A Kerr, T M Bartol, Boris Kaminsky, Markus Dittrich, Jen-Chien Jack Chang, Scott B Baden, Terrence Sejnowski, and J R Stiles. Fast Monte Carlo simulation methods for biological reaction-diffusion systems in solution and on surfaces. SIAM Journal on Scientific Computing : a publication of the Society for Industrial and Applied Mathematics, 30(6):3126–3126, 10 2008.
- Kingma and Ba (2014) Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- Kuehn (2015) Christian Kuehn. Moment closure - A brief review. arXiv e-prints, art. arXiv:1505.02190, May 2015.
- Levermore (1996) C David Levermore. Moment closure hierarchies for kinetic theories. Journal of statistical Physics, 83(5-6):1021–1065, 1996.
- Melchior et al. (2016) Jan Melchior, Asja Fischer, and Laurenz Wiskott. How to center deep Boltzmann machines. The Journal of Machine Learning Research, 17(1):3387–3447, 2016.
- Mjolsness (2013) Eric Mjolsness. Time-ordered product expansions for computational stochastic system biology. Physical Biology, 10(3):035009, jun 2013.
- Mjolsness (2018) Eric Mjolsness. Prospects for declarative mathematical modeling of complex biological systems. arXiv:1804.11044, 2018.
- Montavon and Müller (2012) Grégoire Montavon and Klaus-Robert Müller. Deep Boltzmann machines and the centering trick. In Neural Networks: Tricks of the Trade, pages 621–637. Springer, 2012.
- Salakhutdinov and Hinton (2009) Ruslan Salakhutdinov and Geoffrey Hinton. Deep Boltzmann machines. In Artificial intelligence and statistics, pages 448–455, 2009.
- Schack and Schenzle (1990) Rüdiger Schack and Axel Schenzle. Moment hierarchies and cumulants in quantum optics. Phys. Rev. A, 41:3847–3852, Apr 1990.
Training restricted Boltzmann machines using approximations to the likelihood gradient.In Proceedings of the 25th international conference on Machine learning, pages 1064–1071. ACM, 2008.
- Vincent et al. (2015) Kevin Vincent, Matthew Gonzales, Andrew Gillette, Christopher Villongco, Simone Pezzuto, Jeffrey Omens, Michael Holst, and Andrew McCulloch. High-order finite element methods for cardiac monodomain simulations. Frontiers in Physiology, 6:217, 2015.