Black box variational inference for state space models
Latent variable time-series models are among the most heavily used tools from machine learning and applied statistics. These models have the advantage of learning latent structure both from noisy observations and from the temporal ordering in the data, where it is assumed that meaningful correlation structure exists across time. A few highly-structured models, such as the linear dynamical system with linear-Gaussian observations, have closed-form inference procedures (e.g. the Kalman Filter), but this case is an exception to the general rule that exact posterior inference in more complex generative models is intractable. Consequently, much work in time-series modeling focuses on approximate inference procedures for one particular class of models. Here, we extend recent developments in stochastic variational inference to develop a `black-box' approximate inference technique for latent variable models with latent dynamical structure. We propose a structured Gaussian variational approximate posterior that carries the same intuition as the standard Kalman filter-smoother but, importantly, permits us to use the same inference approach to approximate the posterior of much more general, nonlinear latent variable generative models. We show that our approach recovers accurate estimates in the case of basic models with closed-form posteriors, and more interestingly performs well in comparison to variational approaches that were designed in a bespoke fashion for specific non-conjugate models.READ FULL TEXT VIEW PDF
Continuous latent time series models are prevalent in Bayesian modeling;...
Latent variable models have been widely applied for the analysis and
Variational inference is a powerful tool for approximate inference, and ...
A body of recent work in modeling neural activity focuses on recovering
In order to integrate uncertainty estimates into deep time-series modell...
Estimating the state of a dynamical system from a series of noise-corrup...
We perform approximate inference in state-space models that allow for
Black box variational inference for state space models
Goal-Based Dynamical System for behavioral data analysis
Latent variable models are commonplace in time-series analysis, with applications across statistics, engineering, the sciences, finance and economics. The core approach is to assume that latent variables , which are correlated across , underlie correlated observations
. Standard models for latent dynamics include the linear dynamical system (LDS) and hidden Markov models. While each approach comes with a distinct model and set of computational tools, often the basic goal of inference is the same: to discern the filtering distributionand the smoothing distribution of the latent variables. Closed-form expressions for these distributions are available when the overall probabilistic model has tree or chain structure that admits closed-form message passing. In general, inference in non-Gaussian or nonlinear models requires numerical approximation or sampling.
Markov chain Monte Carlo sampling and particle filtering for general time-series models are well-developed but typically do not scale well to large-scale problems. Even when a model , with parameters , has been trained, we can only access an analytically-intractable posterior through further sampling. Here, we take a variational approach to time-series modeling: rather than attempting to compute the posterior of our generative model, we approximate it with a distribution with variational parameters . Inference proceeds by simultaneously optimizing (through its model parameters ) and (through its variational parameters ) such that approximates the true posterior.
Our main contributions are (1) a structured approximate posterior that can express temporal dependencies, and (2) a fast and scalable inference algorithm. We propose a multivariate Gaussian approximate posterior with block tri-diagonal inverse covariance, and formulate an algorithm that scales (in both time and space complexity) only linearly in the length of the time-series. For inference, we make use of recent advances in stochastic gradient variational Bayes (SGVB) (Rezende et al., 2014; Kingma & Welling, 2013; Kingma et al., 2014) to learn an approximate posterior with a complex functional dependence upon the observations
. Using this approach we are able to learn a neural network (NN) that mapsinto the smoothed posterior (sometimes called the recognition model). This approach is also ‘black-box’ in the sense that the inference algorithm does not depend explicitly upon the functional form of the generative model .
Our motivations lie in the study of high-dimensional time-series, such as neural spike-train recordings (Kao et al., 2015). We seek to infer trajectories that provide insight into the latent, low-dimensional structure in the dynamics of such data. Recent, related approaches to variational inference in time-series models focus upon the design and learning of rich generative models capable of capturing the statistical structure of large, complex datasets (Gan et al., 2015; Chung et al., 2015). In contrast, our focus is upon computationally efficient inference in structured, interpretable parameterizations that build upon methods fundamental in scientific applications.
We apply our smoothing approach for approximate posterior inference of a well-studied generative model: the Poisson linear dynamical system (PLDS) model. We find that our general, black-box approach outperforms a specialized variational Bayes expectation maximization (VBEM) approach(Emtiyaz Khan et al., 2013) to inference in PLDS, reaching comparable solutions to VBEM before it can complete a single EM iteration. Additionally, we apply our method to inference in a one-dimensional, nonlinear dynamical system, showing that we are able to accurately recover nonlinear relationships in the posterior mean.
In variational inference we approximate an intractable posterior distribution with 111In many approaches to variational inference the dependence of upon is dropped; in our case, the parameterization of may depend explicitly upon the observations . that comes from a tractable class (e.g., the Gaussian family) and is parameterized by variational parameters . We learn and together by optimizing the evidence lower bound (ELBO) of the marginal likelihood (Jordan et al., 1999), given by,
The quantity is the ELBO, and is the entropy of the approximating posterior. Our goal is to differentiate with respect to and so as to maximize ,
For the remainder of this section, we use the notation . Typically at least some terms of eq. 3 cannot be integrated in closed form. While it is often possible to estimate the gradient by sampling directly from
, in general the approximate gradient exhibits high variance(Paisley et al., 2012). One approach to addressing this difficulty, independently proposed by Kingma & Welling (2013), Rezende et al. (2014) and Titsias & Lázaro-Gredilla (2014)
, is to compute the integral using the “reparameterization trick”: choose an easy-to-sample random variablewith distribution and parameterize through a function of observations and parameters ,
The point of this notation is to make clear that is a deterministic function: all randomness in comes from the random variable . This allows us to approximate the gradient using the simple estimator,
where are iid samples from . In Kingma & Welling (2013) this estimator is referred to as the Stochastic Gradient Variational Bayes (SGVB) estimator. Empirically, eq. 5 has much lower variance than previous sampling-based approaches to the estimation of eq. 3 (Kingma & Welling, 2013; Titsias & Lázaro-Gredilla, 2014).
An important property of eq. 5 is that it does not depend upon the particular form of : we need only be able to evaluate it at the samples . It is in this sense that our approach is “black-box”: in principle, inference works the same way regardless of our choice of generative model . In practice, of course, different modeling choices will affect the computation time and the convergence rate of the method.
The estimator also permits significant freedom in our parameterization of the transformation : for inference, we just need to be able to differentiate with respect to . While it is possible to use the SGVB approach with a separate set of parameters for each observation (as in Hoffman et al. (2013), for instance), much recent work has used deep neural networks (DNNs) to train a function that maps directly into the posterior (Rezende et al., 2014; Kingma & Welling, 2013; Kingma et al., 2014). Under this approach, with a trained , no additional gradient steps are needed to obtain for new observations .
Using a black-box inference approach, learning a state-space model is in part just a matter of parameterizing a generative model with time-series structure. However, the posterior will in general have temporal correlation structure inadequately captured by the approximate posteriors studied in most previous variational inference literature. The first challenge, then, is to formulate an approximate posterior expressive enough to capture the temporal correlations characteristic of time-series models. We take and as “stacked” versions of the response and latent variable, respectively, at a particular time . We let and , where and .
One common, convenient choice of approximate posterior is the multivariate normal. In the notation of Section 2, the multivariate normal comes about if we choose and take to be an affine function (Titsias & Lázaro-Gredilla, 2014). We can then express a sample as,
so that is distributed as multivariate normal with mean and covariance :
A potential downside of the Gaussian approach is that in our setting , and so the number of parameters scales quadratically in . This makes it difficult to manage and learn for large-scale datasets. A simple workaround is to consider with a special structure that reduces the effective number of parameters. Possible examples include using diagonal covariance (fully-factorized, or “mean field” approximation) (Bishop, 2006), or a diagonal covariance matrix plus low-rank matrix (for instance, a sum of outer products) (Rezende et al., 2014).
For modeling time-series data, we seek an approximate posterior capable of expressing our strong expectation that the latent variables change smoothly over time. While the Gaussian approximate posterior of eq. 8 can represent arbitrary correlation structure, we propose a Gaussian approximate posterior whose parameterization scales only linearly in . To do so, we borrow from the toolkit of the standard Kalman filter. In an LDS model with Gaussian observations, the posterior is a multivariate Gaussian with a block tri-diagonal inverse covariance. This block-tridiagonal structure results from (and expresses) the conditional independence properties of the LDS prior.
To enable our approximate posterior to express the same correlation structure we parameterize the inverse covariance of eq. 8, , to be block tri-diagonal. Our final posterior takes the form:
where is the posterior mean and is a lower block bi-diagonal matrix with blocks 222A lower block bi-diagonal matrix has only non-zero diagonal and (first) lower-diagonal blocks..
In this subsection we drop subscripts and functional notation for clarity and refer, for instance, to as . We can perform inference efficiently by exploiting the special structure of .
While is in general a dense matrix, we parameterize as block tri-diagonal. Since is symmetric, in practice we represent only the diagonal and first block off-diagonal matrices of . Matrix inversion and sampling may be performed quickly using the Cholesky decomposition333Block structure is frequently exploited in computation of the Cholesky decomposition; see for instance Björck (1996)., . The computation of the lower-triangular Cholesky factor, , is linear (in both time in space) in the length of the time-series .
We can sample as in eq. 6, where now:
For an arbitrary matrix , computation of scales cubically in dimensionality of the matrix. However, by exploiting the lower-triangular structure of , matrix inversion scales only linearly (Trefethen & Bau III, 1997). The entropy of is also easy to compute since .
In short, for learning and we need never explicitly represent any part of . For data analysis and model comparison, however, it may be useful to compute the covariance . These covariances correspond to the block-diagonal and first block off-diagonals of , and may also be computed efficiently (Jain et al., 2007).
Our overall approach is closely related to the standard forward-backward algorithm used for instance in Kalman smoothing. However, there is a major technical distinction between its standard use (e.g., in expectation maximization) and our approach: we explicitly differentiate parameters through the matrix factorization .
While eq. 10 succinctly states the general mathematical form of the smoothing posterior, the practical performance of the algorithm depends upon the specifics of the parameterization. There are many possible parameterizations, especially since the parameters and may be arbitrary functions of observations . To illustrate, we discuss two distinct parameterizations. We use the notation to indicate that parameter is defined as a function of inputs through a neural network with parameters . The parameters of all networks are incorporated into :
We can naturally parameterize and of eq. 10 using 3 neural networks. We use one neural network to represent a map ,
where is a segment of , and . We can parameterize the block tri-diagonal covariance ,
by parmeterizing each of the blocks separately:
In practice, we found it necessary to enforce the positive-definiteness of the covariance by adding a diagonal matrix to , where is a fixed constant. In the experiments, we refer to this parameterization as VILDSblk.
We can also define the approximate posterior through a product of Gaussian factors, , where:
and are matrices and is a
-dimensional vector. In this set-up, we can viewas a prior. In terms of eq. 10, the final posterior is then given by:
In order to be a parameterization of the smoothing posterior, eq. 10, and must be block tri-diagonal. The multiplicative interaction between the posterior mean and covariance leads to different performance from the parameterization described in Section 4.1. Further, we can choose to initialize the means with a given degree of smoothness, which is not possible in the formulation of Section 4.1. In the experiments, we refer to this parameterization as VILDSmult; in Appendix A we describe the specific parameteriation we used for and .
In the experiments, we refer to the SGVB with the smoothing approximate posterior as VILDS. We refer to SGVB with an approximate posterior independent across time as mean field (MF). The mean field posterior is given by,
where is a full covariance matrix. We optimize all parameters by gradient ascent, using the SGVB approach with to estimate the gradient with eq. 5.
For training VILDS and MF, we performed gradient descent on all parameters of the generative model and approximate posterior. We tried several adaptive gradient stochastic optimization methods, including: ADAM (Kingma & Ba, 2014), Adadelta (Zeiler, 2012), Adagrad (Duchi et al., 2011)
and RMSprop(Tieleman & Hinton, 2012). In the experiments we show here, we used Adadelta to learn all parameters. We gradually decreased the base learning rate by a factor of after a period of
“epochs” without an increase of the objective function.
First, we illustrate the efficacy of our approach by showing that we can recover the analytic posterior of a Kalman filter model. Under a Kalman filter model, the latents are governed by an LDS,
with Gaussian innovation noise with covariance matrix , . Observations are coupled to the latents through a loading matrix ,
and are Gaussian noise with diagonal covariance.
We simulated 5000 time-points from a 2-dimensional latent dynamical system model, with 100-dimensional linear observations. We parameterize the VILDS approximate posterior using a 5-layer, dense NN for each of and . We use a rectified-linear nonlinearity between each layer, followed by a linear output layer mapping into the parameterization. We compare both of the approximate posterior parameterizations described in Section 4. We refer to the parameterization of Section 4.1 as VILDSblk, and that of Section 4.2 as VILDSmult. For both choices of approximate posterior, the VILDS smoothed posterior means (Fig. 1) show good agreement with the true Kalman filter posterior. The VILDS smoothed posterior variances, and also show good agreement with the Kalman filter posterior covariance (not shown).
The Kalman filter/smoother is exact for an LDS with linear-Gaussian observations. A common generalization in the literature is an LDS with non-Gaussian observations. One well-studied example is the Poisson LDS (PLDS). Under this model, the latents are again governed by an LDS,
with Gaussian innovation noise with covariance matrix , . Observations are modulated by a log-rate , which is coupled to the latent state via a loading matrix ,
The vector is a vector bias term for each element of the response. Given the log-rate , observations
With Poisson observations, the posterior does not have a closed form. Several methods have been proposed for approximate learning and inference in the special case of the PLDS (Buesing et al., 2014, 2012; Macke et al., 2011); Laplace approximation is also frequently used (Paninski et al., 2010; Fahrmeir & Kaufmann, 1991). We compare VILDS to the variational Bayes expectation-maximization approach proposed by Emtiyaz Khan et al. (2013). This VBEM approach uses a full, unconstrained Gaussian as a variational approximate posterior , and performs EM iterations through a dual-space parameterization. We refer to it by the abbreviation VBDual, to emphasize this dual-space parameterization. We parameterize both the MF and VILDS approximate posteriors using a 5-layer, dense NN for each of and . For VILDS, we use the parameterization of Section 4.2. We use a rectified-linear nonlinearity between each layer, followed by a linear output layer mapping into the parameterization. We simulated samples from a PLDS model with latent states and observation dimensions. We initialized all three methods (VILDS, MF and VBDual) using the nuclear-norm minimization methods outlined in Pfau et al. (2013).
To better illustrate the timecourse of learning, each epoch consisted of only minibatches, where each minibatch was of size . A single gradient step was taken for each minibatch. We ran both MF and VILDS for a fixed 500 iterations. We find that VILDS reaches a higher ELBO value than either MF or VBDual, and does so before VBDual can complete a single expectation-maximization iteration (see Fig. 2). Further, the posterior means learned by VILDS are smoother than those learned using the MF approximate posterior (see Fig. 3).
VILDS can perform approximate posterior inference for nonlinear, non-Gaussian generative models. To illustrate, we simulated samples from a toy one-dimensional nonlinear dynamical model given by:
where and are each iid random variables. For the approximate posterior, we parameterized both the and using 8-layer networks where each layer has only a single unit, and rectified-linear nonlinearity. We use the LDS-inspired parameterization of Section 4.2. As shown in Fig. 4, VILDS is capable of recovering the nonlinear relationship in the state space. For these experiments, we held the generative model parameters fixed and learned only .
We proposed a Gaussian variational approximate posterior with block tri-diagonal covariance structure capable of expressing “smoothed” trajectories of a time-series posterior. Exploiting the block tri-diagonal covariance structure, inference scales only linearly (in both time and space complexity) in the length of a time-series. Using the SGVB approach to variational inference, we can perform approximate inference for a wide class of latent variable generative models.
Despite the generality of the inference algorithm, the approach is limited by the Gaussian approximate posterior: most latent variable time-series generative models have non-Gaussian posteriors. One possible route forward are the methods of Rezende & Mohamed (2015) and Dinh et al. (2014), which permit learning and inference using a non-Gaussian approximate posterior within the SGVB framework.
We implemented all methods in Python using Theano with the Lasagne library(Bastien et al., 2012; Bergstra et al., 2010). We plan to release the source code on Github shortly.
As we were preparing this manuscript we became aware of Krishnan et al. (2015), which studies a closely-related (but distinct) method. In future work, we will plan to perform detailed comparisons between the methods.
Funding for this research was provided by DARPA N66001-15-C-4032, Google Faculty Research award, and ONR N00014-14-1-0243 (LP); Simons Global Brain Research Award 325171 (LP and JC); Sloan Research Fellowship (JPC). We thank David Carlson and Megan McKinney, Esq. for useful comments on the manuscript, and Gabriel Synnaeve for making his Python code publicly available.
The posterior mean and covariances may be computed by the standard Kalman forward-backward algorithm. To see this, we can write the posterior in matrix form as the product of two Gaussians. We have
where the positive-definite matrix is a covariance matrix, analogous to the “innovation noise” in the standard Kalman filter, matrix is a linear dynamics matrix444 For stable dynamics, we assume that the eigenvalues of
For stable dynamics, we assume that the eigenvalues ofhave magnitude less than one; in the examples we considered, we did not need to enforce this constraint..
We can then re-write the approximate posterior as the product . By the standard product-of-normal-densities identity,
also has a multivariate normal distribution, where and are given by eq. 19 (which we repeat here):