Deep Learning Moment Closure Approximations using Dynamic Boltzmann Distributions

by   Oliver K. Ernst, et al.

The moments of spatial probabilistic systems are often given by an infinite hierarchy of coupled differential equations. Moment closure methods are used to approximate a subset of low order moments by terminating the hierarchy at some order and replacing higher order terms with functions of lower order ones. For a given system, it is not known beforehand which closure approximation is optimal, i.e. which higher order terms are relevant in the current regime. Further, the generalization of such approximations is typically poor, as higher order corrections may become relevant over long timescales. We have developed a method to learn moment closure approximations directly from data using dynamic Boltzmann distributions (DBDs). The dynamics of the distribution are parameterized using basis functions from finite element methods, such that the approach can be applied without knowing the true dynamics of the system under consideration. We use the hierarchical architecture of deep Boltzmann machines (DBMs) with multinomial latent variables to learn closure approximations for progressively higher order spatial correlations. The learning algorithm uses a centering transformation, allowing the dynamic DBM to be trained without the need for pre-training. We demonstrate the method for a Lotka-Volterra system on a lattice, a typical example in spatial chemical reaction networks. The approach can be applied broadly to learn deep generative models in applications where infinite systems of differential equations arise.



page 1

page 2

page 3

page 4


A Positive and Stable L2-minimization Based Moment Method for the Boltzmann Equation of Gas dynamics

We consider the method-of-moments approach to solve the Boltzmann equati...

Identification of high order closure terms from fully kinetic simulations using machine learning

Simulations of large-scale plasma systems are typically based on fluid a...

Learning invariance preserving moment closure model for Boltzmann-BGK equation

As one of the main governing equations in kinetic theory, the Boltzmann ...

Data-driven, structure-preserving approximations to entropy-based moment closures for kinetic equations

We present a data-driven approach to construct entropy-based closures fo...

A Hierarchy of Network Models Giving Bistability Under Triadic Closure

Triadic closure describes the tendency for new friendships to form betwe...

Neural network-based, structure-preserving entropy closures for the Boltzmann moment system

This work presents neural network based minimal entropy closures for the...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Infinite hierarchies of moment equations appear in many non-linear systems, including biochemical reaction networks Johnson et al. (2015), quantum optics Schack and Schenzle (1990), and kinetic theory of gases Levermore (1996), and more generally in the analysis of stochastic Kuehn (2015) and partial Frewer (2015) differential equations.

Spatial reaction networks are a typical example of such systems. Differential equations for moments derived from the chemical master equation (CME) Gardiner (2009) usually depend on longer range spatial correlations. Moment closure approximations are widely used in chemical and biochemical modeling applications to obtain a solvable system by expressing higher order terms through lower order ones (see Johnson et al. (2015) for a review on moment closure methods for reaction networks).

Dynamic Boltzmann distributions (DBDs) have been introduced as generative models that learn a reduced dynamical system from data Ernst et al. (2018, 2018); Johnson et al. (2015)

. Here a differential equation model is introduced and learned, which can be parameterized either from the relevant physics, or otherwise using an appropriate set of basis functions. The learning problem can be formulated in continuous space in the form of a partial differential equation (PDE) constrained optimization problem. This offers a natural approach to introduce PDEs into machine learning. In the Supplemental Material, we review the general variational problem for DBDs, and make explicit the connection to the work in this paper.

The learning algorithm shares many advantages with the well-known Boltzmann machine learning algorithm Ackley et al. (1985)

. For example, in generative modeling approaches such as variational autoencoders (VAEs), the distribution over latent variables is chosen (typically Gaussian), while in DBDs it is learned from data. Further, the energy function in DBDs can be used to interpret the reduced models learned 

Ernst et al. (2018)

. Other approaches for learning temporal data such as recurrent neural networks (RNNs) do not offer a reduced model probability distribution, and do not give insight into the learned moment closure approximation as done in Section 


In this paper, we use the architecture of deep Boltzmann machines (DBMs) to train deep dynamic Boltzmann distributions for learning moment closure approximations. The learned model replaces long range spatial correlations with correlations involving latent variables, whose activations are learned from data. We show how the centering transformation Montavon and Müller (2012); Melchior et al. (2016) can be applied in the adjoint method Ernst et al. (2018); Cao et al. (2003) to derive a learning algorithm which does not require pre-training.

This paper is organized as follows: Section 2 introduces a simple example demonstrating moment hierarchies, Section 3 reviews DBMs and centered DBMs, Section 4 presents the learning algorithm for centered dynamic DBMs, and Section 5 presents numerical examples.

Ii Moment closure in the Lotka-Volterra system

Figure 1: Left: Mean number of hunter and prey in the Lotka-Volterra system starting from particles of each population (see text). Right: Two lattices at an intersection point in the moment space of the left panel (red circle). Each have close to the average number of particles ( prey in black and hunters in blue), but different spatial correlations lead to differing time derivatives (3).

In this section, we review the appearance of a hierarchy of moments in a spatial probabilistic system. As an example, consider the Lotka-Volterra predator-prey system, described by the reactions:


where and denote the predator (hunter) and prey populations and are reaction rates.

In a well mixed system where spatial effects are not included, the mean number of and , denoted by and

, obey the following ordinary differential equation (ODE) system:


These differential equations are solvable for given initial conditions and .

In a spatial setting, consider a lattice in the single occupancy limit, where describes the -th lattice site occupied by species . The mean number of and obey:


where sums over lattice sites, and over neighboring sites. These equations do not close - a nearest neighbor moment appears on the right hand side. Further, the differential equation describing this term depends on yet higher order ones, e.g. next-nearest neighbors, or three particle correlations (if particles are allowed to diffuse, the diffusion constant would appear here). Moment closure methods seek to obtain tractable approximations to this infinite hierarchy of differential equations.

ii.1 Stochastic simulations of reaction-diffusion systems

Alternative to the deterministic approach of solving differential equations such as (2), stochastic approaches can be used. For example, the Gillespie stochastic simulation algorithm (SSA) can be derived from the CME Gillespie (1977); Mjolsness (2013). Spatial adaptations of the Gillespie SSA can similarly be used to generate stochastic simulations of (3), including in continuous space Kerr et al. (2008). Here we adopt a simple lattice based algorithm Ernst et al. (2018) in which particles hop on a grid, undergo unimolecular reactions following the Gillespie SSA, and bimolecular reactions with some chosen probability upon encounters.

To simulate the Lotka-Volterra system, we use a 2D lattice of sites with von Neumann connectivity and periodic boundary conditions. For reaction rates, we use , and bimolecular reaction probability corresponding to . The initial state has particles of hunter and prey randomly distributed, and simulations are run for timesteps with . Figure 1 shows snapshots of these simulations, which feature spatial patterns including waves.

We generate simulations as training data used in the rest of this paper. The mean number of hunter and prey is shown in Figure 1. Self intersections in this low order moment space reflect the dependence on spatial correlations which differ over time. The challenge for a deep learning problem is to identify relevant higher order correlations to separate states identical in low order moments, and to learn a closure model for expressing these correlations in terms of a finite number of parameters.

Iii DBMs for modeling reaction-diffusion systems

Energy based models can be used to describe the state of reaction-diffusion system at an instant in time. Here, we briefly review this notation Ernst et al. (2018) and key results for DBMs.

iii.1 Energy based models

Let the lattice on which particles diffuse be the designated as the visible layer of the DBM. Let there be a total of layers, where denotes the visible layer and the hidden layers, each with units. Let the units in the -th layer be one of species . The state of each layer is described by a matrix, where entries denote the absence or presence of species at site . We consider lattices in the single-occupancy limit, corresponding to the implicit constraint .

A general energy model for fully connected layers Ernst et al. (2018) has biases for unit in layer occupied by species , and weights connecting unit of species in layer with unit of species in layer . In this paper, we focus on learning hierarchical statistics with a smaller parameter space by making the following simplifications: (1) we consider locally connected layers, where each unit in layer is connected to units in layer , and (2) biases and weights are shared across units in a layer: and . The energy function is:


where sums over the local connectivity between two layers. Since each site in layer is connected to sites in layer , then this sum comprises terms in total.

iii.2 Learning rule for DBMs

Maximizing the log likelihood of observed data gives the well known learning rule for DBMs Salakhutdinov and Hinton (2009):


where , and denotes an average taken over the model distribution, and denotes an average taken over the data distribution, and the sign convention in the update steps is: and similarly for , with learning rate .

Estimating the moments can be done using the well-known wake-sleep algorithm Ackley et al. (1985). The moments under the model distribution (sleep phase) are given by Gibbs sampling:


where sums over units connected to unit

. Each step of sampling is performed in two phases: one pass for layers with even indexes, and one pass for layers with odd indexes. Estimating the moments under the data distribution (keeping the visible layer clamped at a data vector, i.e. the wake phase) can been done for DBMs by mean field methods 

Salakhutdinov and Hinton (2009), or else by Gibbs sampling Montavon and Müller (2012).

iii.3 Centering transformation and the centered gradient

A centered DBM Melchior et al. (2016); Montavon and Müller (2012) with parameters has the energy function:


where are the species-dependent centers in layer . Every regular DBM can be transformed to a centered DBM by transforming parameters as:


This can be used to derive the centered gradient Melchior et al. (2016): After sampling the moments of a regular DBM (6), transform to a centered DBM, calculate the gradient with respect to the centered parameters, and transform back to obtain the gradient for the regular DBM parameters. The result is


as derived in the Supplemental Material. To reduce noise, the centers are updated as the average unit’s state with an exponential sliding average with sliding parameter :


Iv Dynamic centered DBMs

We follow earlier work on dynamic Boltzmann distributions Johnson et al. (2015); Ernst et al. (2018, 2018) by escalating to time-varying interactions and (see Supplemental Material for review). Instead of minimizing the Kullback-Leibler (KL) divergence of the model distribution to the data distribution at an instant in time, we now seek to minimize the KL divergence integrated over all times. Define as the objective function:


The integral over time can lead to undesired extrema, particularly for periodic systems. In practice these can be eliminated using a small sliding time window: , where is some small fixed value and is shifted forward every few optimization steps Ernst et al. (2018).

iv.1 Reduced dynamic model

To describe the time-evolving interactions, introduce the autonomous differential equation system:


for given initial conditions , . Here is a chosen domain of interaction parameters (weights and biases), and are functions with free parameter vectors to be learned.

The functions on the right side can be chosen based on the physics of the system under consideration to learn a reduced physical model. For reaction-diffusion systems, the forms of can be derived from the CME when using a fully visible Boltzmann distributions Ernst et al. (2018). A more blackbox aligned approach is to introduce basis functions to parameterize (12). Following Ernst et al. (2018) we use the family of finite elements Arnold and Logg (2014)

, which has the advantage that in dimensions higher than one, basis functions are simply tensor products of 1D cubic polynomials

. The learning algorithm in Section IV.3 requires

functions - these polynomials are therefore the Hermite polynomials that in 1D control four degrees of freedom: the value and the first derivative at each endpoint. For

of length , this gives degrees of freedom in total and corresponding coefficients to be learned:


and similarly for , where is the appropriate basis function.

iv.2 Moment closure approximation of dynamic centered DBMs

The key advantage of the dynamic Boltzmann distribution setting is the moment closure approximation that can be learned from data: any given moment evolves as (see Supplemental Material):


where . The learned differential equation of every interaction (weights and biases) contributes to the closure model, weighted by a covariance term between and the observable for which is the Lagrange multiplier. Equation (14) should be directly compared against (3). Higher order terms of visible units appearing in (3

) are exchanged for correlations with latent random variables, whose activations are


iv.3 Adjoint method learning problem with centering transformation

We next formulate a learning problem for the parameters defining the dynamical system (12). This is a specific case of a variational problem for the functions appearing on the right hand side of a differential equation Ernst et al. (2018); Gamkrelidze (1978), as shown in the Supplemental Material. To enforce the constraints (12), introduce adjoint variables to the centered parameters . The adjoint equations can be derived from the Hamiltonian (Supplemental Material) giving


with boundary conditions , and where


While analytic expression for are not available, in practice they can be easily estimated as: .

The sensitivity (update) equations for the parameters to be learned are:


where the update step is: with learning rate (see Supplemental Material).

Algorithm 1 is an example of how the learning problem can be solved in practice. Alternative approaches for solving PDE-constrained optimization problems can be applied, such as sequential quadratic programming (SQP). A benefit of the current algorithm is its simplicity - the inner loop at each timestep is equivalent to the wake/sleep phases of the Boltzmann machine learning algorithm Ackley et al. (1985). A lower bound on the log-probability of test data can therefore be obtained using established methods such as Annealed Importance Sampling (AIS) Salakhutdinov and Hinton (2009). It is naturally possible to use accelerated gradient descent methods such as Adam Kingma and Ba (2014).

1:Input: Initial conditions for all interactions parameters, time interval , a formula for the learning rate , sliding factor , batch size , time window size , a formula for sliding .
2:Initialize: , to the data means, otherwise .
3:while not converged do
4:     for timepoint in with timestep  do
5:          Solve the constraints (12) for the current timepoint from the previous timepoint .
6:          Estimate wake & sleep phase moments (15) over the batch (e.g. Gibbs sampling).
7:          Update the centers according to (10).
8:          Calculate & store derivative terms in (15) and in (17).
9:     end for
10:      Solve the adjoint system (15) backwards in time from to .
11:      Update the parameters and according to (17) with learning rate .
12:      Slide the time window at forward if necessary to eventually cover .
13:end while
Algorithm 1 Learning algorithm for dynamic centered DBMs

V Numerical examples

We demonstrate Algorithm 1 for learning a moment closure approximation for the Lotka-Volterra system of Section II. As training data, we use the stochastic simulations generated in Section II.1. Note that we only consider a single initial condition in this work - however, it is possible to learn a larger parameter space from stochastic simulations with varying initial conditions Ernst et al. (2018).

For the architecture of the DBM, we use a locally connected DBM with one visible and two hidden layers as shown in Figure 2(a). Each patch in layer is connected to a single unit in layer . The number of units in each layer is held constant at , with boundary units implementing periodic boundary conditions to the layer below, reflecting those used in the stochastic simulations.

Many systems feature a large number of species, e.g. states of ion channels as different sub-units are activated. Having species in layer and species in layer leads to species dependent weights. To limit this parameter inflation, we consider two species in each layer, and weights and . We found the omitted cross-species weights to be less important, as a similar effect is produced by crowding in the hidden layers during sampling.

Figure 2: (a) Locally connected DBM in 3 layers with units each ( illustrated). Every patch of units in layer (blue) is connected to a single unit in layer with two species-dependent weights and . Gray units implement periodic boundary conditions to the layer below. (b) Interactions learned for the Lotka-Volterra system using Algorithm 1 (see text). (c) Spatial patterns in the layers of (a) at timepoint after training, obtained by steps of Gibbs sampling from a random configuration (raw probabilities shown for prey in black, hunter in blue).
Figure 3: Columns: Quantities of interest . Top row: Comparison with stochastic simulations. Second, third: Terms in the moment closure approximation (14). Bottom: Derivative from summing the second & third rows as in (14) compared to the derivative from stochastic simulations.

As initial condition, we use the maximum entropy distribution consistent with the initial hunter and prey particles. This corresponds to initial parameters , and all other weights and biases set to zero. The algorithm therefore must learn to activate latent variables to control spatial correlations. Starting from zero weights and biases allows the hidden layers to enforce either a strengthening or suppressing (positive or negative weights) relationship with neighboring layers. We restrict the domain of the differential equations (12) to be . Since the distribution starts with only the visible biases as non-zero, it is important to include these in the domain. When training on stochastic simulations from many initial conditions, using fewer parameters in the domain improves generalization Ernst et al. (2018). The side length of the cubic cell (13) used in all dimensions was .

Algorithm 1 is implemented as a C++ library freely available online Ernst et al. (2019) (included in the Supplemental Material). We use simple batch gradient descent with batch size , learning rate for all parameters for optimization steps, and sliding factor (the center is averaged over all units allowing a larger factor than in non-dynamic centered DBMs). We use steps of Gibbs sampling for estimating both the wake and sleep phase moments, with persistent chains in the sleep phase Tieleman (2008). The differential equations (12,15) are solved using Euler’s method with the step size from the stochastic simulations. The time window in (17) is of size timesteps, and we slide every two optimization steps. Figure 2(b) shows the learned parameter trajectories. An example of the hidden layer states is shown in Figure 2(c), showing the learned hierarchical representation of spatial patterns in the hidden layers.

The moment closure approximation (14) is of particular interest. Here it can be analyzed which weighted covariance terms the system has learned that contribute to the time evolution of an observable. Consider for example the mean number of prey . Figure 3, top-left, shows the time evolution of this observable under the trained model obtained by averaging over sampled lattices ( steps of Gibbs sampling from a random configuration) using the learned parameters at each timestep. For testing, we compare the trajectory against stochastic simulations, which agree well (for training on varying initial conditions, a test data set of stochastic simulations may be used Ernst et al. (2018)). Next, we calculate each term in (14), shown in rows two and three (to reduce noise, we first smooth the interactions with a low-pass filter with cutoff before sampling and calculating these terms). The sign of weight and bias terms is generally opposite - this reflects the selective activation of units based on the patches in neighboring layers, rather than broad, spatially uncorrelated activations. To validate the accuracy of these terms, we also plot their sum in the bottom row, which is nearly identical to the true time derivative calculated from the stochastic simulations.

The two oscillation cycles in Figure 1 evolve differently due to the dependence on higher order moments. How is this difference reflected in the learned moment closure approximation? In the first oscillation in Figure 3, the time evolution is primarily driven by the weight term from the first hidden layer, indicating that mainly nearest neighbor structure relevant. In the second oscillation, the contribution from terms in the second hidden layer has grown significantly, indicating longer range correlations are relevant. Indeed, visually inspecting samples of the stochastic simulations (see Figure 1) shows that larger domains of hunter and prey are formed in the second oscillation.

Examining different quantities in this fashion gives insight into the learned moment closure approximation. In the center column of Figure 3, we examine the mean number of nearest neighbors of prey. The terms in (14) are approximately scaled version of the terms for the mean number of prey. This is expected, since the reaction system does not explicitly give a source or sink for nearest neighbor correlations between prey. Not all terms are accurately captured. For example, the right column shows the mean number of next-nearest neighbors (Manhattan distance two) of hunter and prey, which are overestimated in the learned model. This is explained by weight terms from hunter and prey both contributing positively to the time evolution, rather than competitively as previously.

Vi Discussion

We have introduced a method for learning moment closure approximations from data using multiple hidden layers. A key result is the closure equation (14), which replaces long range spatial correlations in the visible layer with correlations with latent variables, whose activation is learned. The learning problem is that for a dynamic Boltzmann distribution, combined with the architecture of a DBM. The centering transformation from centered DBMs is extended to the adjoint system for the dynamic case, such that pre-training is unnecessary. The hierarchical architecture in Figure 2(a) is tailored to reflect the moment equations derived CME (3), naturally capturing correlations relevant to the moment closure problem. A further important result is the use of multinomial variables in the hidden layers, which allows interpretable learned representations as in Figure 2(c).

Further avenues for improvement exist. It may be possible to adapt the “serendipitous" family of finite elements Arnold and Logg (2014) to reduce the number of basis functions and therefore computational cost Vincent et al. (2015), although it is not , requiring modifications to the learning problem. For modeling applications, physically informed parameterizations are particularly interesting, e.g. for reaction-diffusion systems Ernst et al. (2018), and generally in mathematical modeling Mjolsness (2018).

This work was supported by National Institute of Aging grant R56AG059602 (E.M., O.K.E., T.B., T.S.), Human Frontiers Science Program grant HFSP - RGP0023/2018 (E.M.), and NIH P41-GM103712, NIH R01MH115456, and AFOSR MURI FA9550-18-1-0051 (O.K.E., T.B., T.S.).


  • Ackley et al. (1985) David H. Ackley, Geoffrey E. Hinton, and Terrence J. Sejnowski. A learning algorithm for Boltzmann machines. Cognitive Science, 9(1):147–169, 1985.
  • Arnold and Logg (2014) D.N. Arnold and A. Logg. Periodic table of the finite elements. SIAM News, 47(9), 2014.
  • Cao et al. (2003) Y. Cao, S. Li, L. Petzold, and R. Serban. Adjoint sensitivity analysis for differential-algebraic equations: The adjoint DAE system and its numerical solution. SIAM Journal on Scientific Computing, 24(3):1076–1089, 2003.
  • Ernst et al. (2018) Oliver K. Ernst, Tom Bartol, Terrence Sejnowski, and Eric Mjolsness. Learning dynamic Boltzmann distributions as reduced models of spatial chemical kinetics. The Journal of Chemical Physics, 149(3):034107, 2018.
  • Ernst et al. (2018) Oliver K. Ernst, Tom Bartol, Terrence Sejnowski, and Eric Mjolsness. Learning moment closure in reaction-diffusion systems with spatial dynamic Boltzmann distributions. arXiv e-prints, art. arXiv:1808.08630, Aug 2018.
  • Ernst et al. (2019) Oliver K. Ernst, Tom Bartol, Terrence Sejnowski, and Eric Mjolsness. Dynamic Boltzmann Distributions C++ Library v4.0, 2019.
  • Frewer (2015) Michael Frewer. Application of Lie-group symmetry analysis to an infinite hierarchy of differential equations at the example of first order ODEs. arXiv e-prints, art. arXiv:1511.00002, Oct 2015.
  • Gamkrelidze (1978) R. V. Gamkrelidze. Principles of Optimal Control Theory. Springer US, Boston, MA, 1978.
  • Gardiner (2009) C. W. Gardiner. Stochastic methods: A handbook for the natural and social sciences. Springer, Berlin, 2009.
  • Gillespie (1977) Daniel T. Gillespie. Exact stochastic simulation of coupled chemical reactions. The Journal of Physical Chemistry, 81(25):2340–2361, 1977.
  • Johnson et al. (2015) Todd Johnson, Tom Bartol, Terrence Sejnowski, and Eric Mjolsness.

    Model reduction for stochastic CaMKII reaction kinetics in synapses by graph-constrained correlation dynamics.

    Physical Biology, 12(4):045005–045005, 07 2015.
  • Kerr et al. (2008) Rex A Kerr, T M Bartol, Boris Kaminsky, Markus Dittrich, Jen-Chien Jack Chang, Scott B Baden, Terrence Sejnowski, and J R Stiles. Fast Monte Carlo simulation methods for biological reaction-diffusion systems in solution and on surfaces. SIAM Journal on Scientific Computing : a publication of the Society for Industrial and Applied Mathematics, 30(6):3126–3126, 10 2008.
  • Kingma and Ba (2014) Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • Kuehn (2015) Christian Kuehn. Moment closure - A brief review. arXiv e-prints, art. arXiv:1505.02190, May 2015.
  • Levermore (1996) C David Levermore. Moment closure hierarchies for kinetic theories. Journal of statistical Physics, 83(5-6):1021–1065, 1996.
  • Melchior et al. (2016) Jan Melchior, Asja Fischer, and Laurenz Wiskott. How to center deep Boltzmann machines. The Journal of Machine Learning Research, 17(1):3387–3447, 2016.
  • Mjolsness (2013) Eric Mjolsness. Time-ordered product expansions for computational stochastic system biology. Physical Biology, 10(3):035009, jun 2013.
  • Mjolsness (2018) Eric Mjolsness. Prospects for declarative mathematical modeling of complex biological systems. arXiv:1804.11044, 2018.
  • Montavon and Müller (2012) Grégoire Montavon and Klaus-Robert Müller. Deep Boltzmann machines and the centering trick. In Neural Networks: Tricks of the Trade, pages 621–637. Springer, 2012.
  • Salakhutdinov and Hinton (2009) Ruslan Salakhutdinov and Geoffrey Hinton. Deep Boltzmann machines. In Artificial intelligence and statistics, pages 448–455, 2009.
  • Schack and Schenzle (1990) Rüdiger Schack and Axel Schenzle. Moment hierarchies and cumulants in quantum optics. Phys. Rev. A, 41:3847–3852, Apr 1990.
  • Tieleman (2008) Tijmen Tieleman.

    Training restricted Boltzmann machines using approximations to the likelihood gradient.

    In Proceedings of the 25th international conference on Machine learning, pages 1064–1071. ACM, 2008.
  • Vincent et al. (2015) Kevin Vincent, Matthew Gonzales, Andrew Gillette, Christopher Villongco, Simone Pezzuto, Jeffrey Omens, Michael Holst, and Andrew McCulloch. High-order finite element methods for cardiac monodomain simulations. Frontiers in Physiology, 6:217, 2015.