Black box variational inference for state space models

11/23/2015 ∙ by Evan Archer, et al. ∙ Google Stony Brook University Columbia University 0

Latent variable time-series models are among the most heavily used tools from machine learning and applied statistics. These models have the advantage of learning latent structure both from noisy observations and from the temporal ordering in the data, where it is assumed that meaningful correlation structure exists across time. A few highly-structured models, such as the linear dynamical system with linear-Gaussian observations, have closed-form inference procedures (e.g. the Kalman Filter), but this case is an exception to the general rule that exact posterior inference in more complex generative models is intractable. Consequently, much work in time-series modeling focuses on approximate inference procedures for one particular class of models. Here, we extend recent developments in stochastic variational inference to develop a `black-box' approximate inference technique for latent variable models with latent dynamical structure. We propose a structured Gaussian variational approximate posterior that carries the same intuition as the standard Kalman filter-smoother but, importantly, permits us to use the same inference approach to approximate the posterior of much more general, nonlinear latent variable generative models. We show that our approach recovers accurate estimates in the case of basic models with closed-form posteriors, and more interestingly performs well in comparison to variational approaches that were designed in a bespoke fashion for specific non-conjugate models.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

Code Repositories

vilds

Black box variational inference for state space models


view repo

gbds

Goal-Based Dynamical System for behavioral data analysis


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Latent variable models are commonplace in time-series analysis, with applications across statistics, engineering, the sciences, finance and economics. The core approach is to assume that latent variables , which are correlated across , underlie correlated observations

. Standard models for latent dynamics include the linear dynamical system (LDS) and hidden Markov models. While each approach comes with a distinct model and set of computational tools, often the basic goal of inference is the same: to discern the filtering distribution

and the smoothing distribution of the latent variables. Closed-form expressions for these distributions are available when the overall probabilistic model has tree or chain structure that admits closed-form message passing. In general, inference in non-Gaussian or nonlinear models requires numerical approximation or sampling.

Markov chain Monte Carlo sampling and particle filtering for general time-series models are well-developed but typically do not scale well to large-scale problems. Even when a model , with parameters , has been trained, we can only access an analytically-intractable posterior through further sampling. Here, we take a variational approach to time-series modeling: rather than attempting to compute the posterior of our generative model, we approximate it with a distribution with variational parameters . Inference proceeds by simultaneously optimizing (through its model parameters ) and (through its variational parameters ) such that approximates the true posterior.

Our main contributions are (1) a structured approximate posterior that can express temporal dependencies, and (2) a fast and scalable inference algorithm. We propose a multivariate Gaussian approximate posterior with block tri-diagonal inverse covariance, and formulate an algorithm that scales (in both time and space complexity) only linearly in the length of the time-series. For inference, we make use of recent advances in stochastic gradient variational Bayes (SGVB) (Rezende et al., 2014; Kingma & Welling, 2013; Kingma et al., 2014) to learn an approximate posterior with a complex functional dependence upon the observations

. Using this approach we are able to learn a neural network (NN) that maps

into the smoothed posterior (sometimes called the recognition model). This approach is also ‘black-box’ in the sense that the inference algorithm does not depend explicitly upon the functional form of the generative model .

Our motivations lie in the study of high-dimensional time-series, such as neural spike-train recordings (Kao et al., 2015). We seek to infer trajectories that provide insight into the latent, low-dimensional structure in the dynamics of such data. Recent, related approaches to variational inference in time-series models focus upon the design and learning of rich generative models capable of capturing the statistical structure of large, complex datasets (Gan et al., 2015; Chung et al., 2015). In contrast, our focus is upon computationally efficient inference in structured, interpretable parameterizations that build upon methods fundamental in scientific applications.

We apply our smoothing approach for approximate posterior inference of a well-studied generative model: the Poisson linear dynamical system (PLDS) model. We find that our general, black-box approach outperforms a specialized variational Bayes expectation maximization (VBEM) approach

(Emtiyaz Khan et al., 2013) to inference in PLDS, reaching comparable solutions to VBEM before it can complete a single EM iteration. Additionally, we apply our method to inference in a one-dimensional, nonlinear dynamical system, showing that we are able to accurately recover nonlinear relationships in the posterior mean.

2 Stochastic gradient variational Bayes

In variational inference we approximate an intractable posterior distribution with 111In many approaches to variational inference the dependence of upon is dropped; in our case, the parameterization of may depend explicitly upon the observations . that comes from a tractable class (e.g., the Gaussian family) and is parameterized by variational parameters . We learn and together by optimizing the evidence lower bound (ELBO) of the marginal likelihood (Jordan et al., 1999), given by,

(1)
(2)

The quantity is the ELBO, and is the entropy of the approximating posterior. Our goal is to differentiate with respect to and so as to maximize ,

(3)

For the remainder of this section, we use the notation . Typically at least some terms of eq. 3 cannot be integrated in closed form. While it is often possible to estimate the gradient by sampling directly from

, in general the approximate gradient exhibits high variance

(Paisley et al., 2012). One approach to addressing this difficulty, independently proposed by Kingma & Welling (2013), Rezende et al. (2014) and Titsias & Lázaro-Gredilla (2014)

, is to compute the integral using the “reparameterization trick”: choose an easy-to-sample random variable

with distribution and parameterize through a function of observations and parameters ,

(4)

The point of this notation is to make clear that is a deterministic function: all randomness in comes from the random variable . This allows us to approximate the gradient using the simple estimator,

(5)

where are iid samples from . In Kingma & Welling (2013) this estimator is referred to as the Stochastic Gradient Variational Bayes (SGVB) estimator. Empirically, eq. 5 has much lower variance than previous sampling-based approaches to the estimation of eq. 3 (Kingma & Welling, 2013; Titsias & Lázaro-Gredilla, 2014).

An important property of eq. 5 is that it does not depend upon the particular form of : we need only be able to evaluate it at the samples . It is in this sense that our approach is “black-box”: in principle, inference works the same way regardless of our choice of generative model . In practice, of course, different modeling choices will affect the computation time and the convergence rate of the method.

The estimator also permits significant freedom in our parameterization of the transformation : for inference, we just need to be able to differentiate with respect to . While it is possible to use the SGVB approach with a separate set of parameters for each observation (as in Hoffman et al. (2013), for instance), much recent work has used deep neural networks (DNNs) to train a function that maps directly into the posterior (Rezende et al., 2014; Kingma & Welling, 2013; Kingma et al., 2014). Under this approach, with a trained , no additional gradient steps are needed to obtain for new observations .

3 Variational approach to state-space modeling

Using a black-box inference approach, learning a state-space model is in part just a matter of parameterizing a generative model with time-series structure. However, the posterior will in general have temporal correlation structure inadequately captured by the approximate posteriors studied in most previous variational inference literature. The first challenge, then, is to formulate an approximate posterior expressive enough to capture the temporal correlations characteristic of time-series models. We take and as “stacked” versions of the response and latent variable, respectively, at a particular time . We let and , where and .

3.1 Gaussian approximate posterior

One common, convenient choice of approximate posterior is the multivariate normal. In the notation of Section 2, the multivariate normal comes about if we choose and take to be an affine function (Titsias & Lázaro-Gredilla, 2014). We can then express a sample as,

(6)
(7)

so that is distributed as multivariate normal with mean and covariance :

(8)

It is easy to sample from a Gaussian approximate posterior using eq. 6, and the entropy term within eq. 2 has a closed form:

(9)

A potential downside of the Gaussian approach is that in our setting , and so the number of parameters scales quadratically in . This makes it difficult to manage and learn for large-scale datasets. A simple workaround is to consider with a special structure that reduces the effective number of parameters. Possible examples include using diagonal covariance (fully-factorized, or “mean field” approximation) (Bishop, 2006), or a diagonal covariance matrix plus low-rank matrix (for instance, a sum of outer products) (Rezende et al., 2014).

3.2 Smoothing Gaussian approximate posterior

For modeling time-series data, we seek an approximate posterior capable of expressing our strong expectation that the latent variables change smoothly over time. While the Gaussian approximate posterior of eq. 8 can represent arbitrary correlation structure, we propose a Gaussian approximate posterior whose parameterization scales only linearly in . To do so, we borrow from the toolkit of the standard Kalman filter. In an LDS model with Gaussian observations, the posterior is a multivariate Gaussian with a block tri-diagonal inverse covariance. This block-tridiagonal structure results from (and expresses) the conditional independence properties of the LDS prior.

To enable our approximate posterior to express the same correlation structure we parameterize the inverse covariance of eq. 8, , to be block tri-diagonal. Our final posterior takes the form:

(10)

where is the posterior mean and is a lower block bi-diagonal matrix with blocks 222A lower block bi-diagonal matrix has only non-zero diagonal and (first) lower-diagonal blocks..

3.2.1 Computation

In this subsection we drop subscripts and functional notation for clarity and refer, for instance, to as . We can perform inference efficiently by exploiting the special structure of .

While is in general a dense matrix, we parameterize as block tri-diagonal. Since is symmetric, in practice we represent only the diagonal and first block off-diagonal matrices of . Matrix inversion and sampling may be performed quickly using the Cholesky decomposition333Block structure is frequently exploited in computation of the Cholesky decomposition; see for instance Björck (1996)., . The computation of the lower-triangular Cholesky factor, , is linear (in both time in space) in the length of the time-series .

We can sample as in eq. 6, where now:

(11)

For an arbitrary matrix , computation of scales cubically in dimensionality of the matrix. However, by exploiting the lower-triangular structure of , matrix inversion scales only linearly (Trefethen & Bau III, 1997). The entropy of is also easy to compute since .

In short, for learning and we need never explicitly represent any part of . For data analysis and model comparison, however, it may be useful to compute the covariance . These covariances correspond to the block-diagonal and first block off-diagonals of , and may also be computed efficiently (Jain et al., 2007).

Our overall approach is closely related to the standard forward-backward algorithm used for instance in Kalman smoothing. However, there is a major technical distinction between its standard use (e.g., in expectation maximization) and our approach: we explicitly differentiate parameters through the matrix factorization .

4 Parameterization of the smoothing posterior

While eq. 10 succinctly states the general mathematical form of the smoothing posterior, the practical performance of the algorithm depends upon the specifics of the parameterization. There are many possible parameterizations, especially since the parameters and may be arbitrary functions of observations . To illustrate, we discuss two distinct parameterizations. We use the notation to indicate that parameter is defined as a function of inputs through a neural network with parameters . The parameters of all networks are incorporated into :

(12)

4.1 Diagonal and block off-diagonal parameterization

We can naturally parameterize and of eq. 10 using 3 neural networks. We use one neural network to represent a map ,

(13)

where is a segment of , and . We can parameterize the block tri-diagonal covariance ,

(14)

by parmeterizing each of the blocks separately:

(15)
(16)

In practice, we found it necessary to enforce the positive-definiteness of the covariance by adding a diagonal matrix to , where is a fixed constant. In the experiments, we refer to this parameterization as VILDSblk.

4.2 Product-of-Gaussians approximate posterior

We can also define the approximate posterior through a product of Gaussian factors, , where:

(17)
(18)

and are matrices and is a

-dimensional vector. In this set-up, we can view

as a prior. In terms of eq. 10, the final posterior is then given by:

(19)
(20)

In order to be a parameterization of the smoothing posterior, eq. 10, and must be block tri-diagonal. The multiplicative interaction between the posterior mean and covariance leads to different performance from the parameterization described in Section 4.1. Further, we can choose to initialize the means with a given degree of smoothness, which is not possible in the formulation of Section 4.1. In the experiments, we refer to this parameterization as VILDSmult; in Appendix A we describe the specific parameteriation we used for and .

5 Experiments

In the experiments, we refer to the SGVB with the smoothing approximate posterior as VILDS. We refer to SGVB with an approximate posterior independent across time as mean field (MF). The mean field posterior is given by,

(21)

where is a full covariance matrix. We optimize all parameters by gradient ascent, using the SGVB approach with to estimate the gradient with eq. 5.

For training VILDS and MF, we performed gradient descent on all parameters of the generative model and approximate posterior. We tried several adaptive gradient stochastic optimization methods, including: ADAM (Kingma & Ba, 2014), Adadelta (Zeiler, 2012), Adagrad (Duchi et al., 2011)

and RMSprop

(Tieleman & Hinton, 2012). In the experiments we show here, we used Adadelta to learn all parameters. We gradually decreased the base learning rate by a factor of after a period of

epochs” without an increase of the objective function.

Figure 1: Posterior mean inference compared with ground truth. Each panel shows the posterior mean along a dimension of the two-dimensional () state space of a Kalman filter. We show time-points of the posterior means from a sample Kalman filter experiment. We fit using VILDS with the parameterization described in Section 4.1 (VILDSblk) and that described in Section 4.2 (VILDSmult). The true posterior means computed using the closed-form Kalman filter equations (black) agree closely with those recovered using VILDSmult (blue) and VILDSblk (red).

5.1 Kalman filter model

First, we illustrate the efficacy of our approach by showing that we can recover the analytic posterior of a Kalman filter model. Under a Kalman filter model, the latents are governed by an LDS,

(22)

with Gaussian innovation noise with covariance matrix , . Observations are coupled to the latents through a loading matrix ,

(23)

and are Gaussian noise with diagonal covariance.

We simulated 5000 time-points from a 2-dimensional latent dynamical system model, with 100-dimensional linear observations. We parameterize the VILDS approximate posterior using a 5-layer, dense NN for each of and . We use a rectified-linear nonlinearity between each layer, followed by a linear output layer mapping into the parameterization. We compare both of the approximate posterior parameterizations described in Section 4. We refer to the parameterization of Section 4.1 as VILDSblk, and that of Section 4.2 as VILDSmult. For both choices of approximate posterior, the VILDS smoothed posterior means (Fig. 1) show good agreement with the true Kalman filter posterior. The VILDS smoothed posterior variances, and also show good agreement with the Kalman filter posterior covariance (not shown).

5.2 Poisson LDS (PLDS)

Figure 2: Speed comparison: ELBO convergence vs time (seconds) for VILDS, MF and VBDual. VBDual is fit by an iterative procedure that optimizes a dual-space cost function for each E-step (Emtiyaz Khan et al., 2013). The first VBDual E-step takes many iterations to converge, and causes the long gap from

seconds to the first VBDual datapoint. Subsequent VBDual E-steps are less time-consuming. VILDS and MF are learned by stochastic gradient descent using the adaptive-gradient technique, Adadelta 

(Zeiler, 2012). Gradients are computed on minibatches of size , and each datapoint is collected after minibatches (one “epoch”). Both MF and VILDS were run for epochs. VILDS achieves the highest ELBO value, followed by MF and then VBDual. VILDS achieves ELBO values comparable to VBDual before VBDual can complete a single EM iteration.

The Kalman filter/smoother is exact for an LDS with linear-Gaussian observations. A common generalization in the literature is an LDS with non-Gaussian observations. One well-studied example is the Poisson LDS (PLDS). Under this model, the latents are again governed by an LDS,

(24)

with Gaussian innovation noise with covariance matrix , . Observations are modulated by a log-rate , which is coupled to the latent state via a loading matrix ,

(25)

The vector is a vector bias term for each element of the response. Given the log-rate , observations

are Poisson-distributed,

(26)

With Poisson observations, the posterior does not have a closed form. Several methods have been proposed for approximate learning and inference in the special case of the PLDS (Buesing et al., 2014, 2012; Macke et al., 2011); Laplace approximation is also frequently used (Paninski et al., 2010; Fahrmeir & Kaufmann, 1991). We compare VILDS to the variational Bayes expectation-maximization approach proposed by Emtiyaz Khan et al. (2013). This VBEM approach uses a full, unconstrained Gaussian as a variational approximate posterior , and performs EM iterations through a dual-space parameterization. We refer to it by the abbreviation VBDual, to emphasize this dual-space parameterization. We parameterize both the MF and VILDS approximate posteriors using a 5-layer, dense NN for each of and . For VILDS, we use the parameterization of Section 4.2. We use a rectified-linear nonlinearity between each layer, followed by a linear output layer mapping into the parameterization. We simulated samples from a PLDS model with latent states and observation dimensions. We initialized all three methods (VILDS, MF and VBDual) using the nuclear-norm minimization methods outlined in Pfau et al. (2013).

To better illustrate the timecourse of learning, each epoch consisted of only minibatches, where each minibatch was of size . A single gradient step was taken for each minibatch. We ran both MF and VILDS for a fixed 500 iterations. We find that VILDS reaches a higher ELBO value than either MF or VBDual, and does so before VBDual can complete a single expectation-maximization iteration (see Fig. 2). Further, the posterior means learned by VILDS are smoother than those learned using the MF approximate posterior (see Fig. 3).

Figure 3: Comparison of posterior means learned using the mean field approximate posterior (green), VILDS (red) and VBDual with an unstructured Gaussian approximate posterior (black). The model has a rotational invariance, and so we project the MF and VILDS posteriors onto the VBDual posterior means using least squares. The posterior mean trajectories learned by VILDS are visibly smoother than the MF posterior mean trajectories.

5.3 Nonlinear dynamics simulation

VILDS can perform approximate posterior inference for nonlinear, non-Gaussian generative models. To illustrate, we simulated samples from a toy one-dimensional nonlinear dynamical model given by:

(27)
(28)

where and are each iid random variables. For the approximate posterior, we parameterized both the and using 8-layer networks where each layer has only a single unit, and rectified-linear nonlinearity. We use the LDS-inspired parameterization of Section 4.2. As shown in Fig. 4, VILDS is capable of recovering the nonlinear relationship in the state space. For these experiments, we held the generative model parameters fixed and learned only .

Figure 4: True and VILDS-fit posterior means for nonlinear dynamics simulation. VILDS was fit to 5000 samples drawn from the nonlinear dynamical system described in eq. 27

. Each point in both plots represents a single ordered pair

. (A) We illustrate the nonlinear dynamical relationship between and ; in a linear dynamical system this relationship would be a straight line. In red, we show the posterior means recovered by VILDS. (B) The VILDS posterior means of (A) plotted alone, for comparison.

6 Conclusion

We proposed a Gaussian variational approximate posterior with block tri-diagonal covariance structure capable of expressing “smoothed” trajectories of a time-series posterior. Exploiting the block tri-diagonal covariance structure, inference scales only linearly (in both time and space complexity) in the length of a time-series. Using the SGVB approach to variational inference, we can perform approximate inference for a wide class of latent variable generative models.

Despite the generality of the inference algorithm, the approach is limited by the Gaussian approximate posterior: most latent variable time-series generative models have non-Gaussian posteriors. One possible route forward are the methods of Rezende & Mohamed (2015) and Dinh et al. (2014), which permit learning and inference using a non-Gaussian approximate posterior within the SGVB framework.

We implemented all methods in Python using Theano with the Lasagne library

(Bastien et al., 2012; Bergstra et al., 2010). We plan to release the source code on Github shortly.

As we were preparing this manuscript we became aware of Krishnan et al. (2015), which studies a closely-related (but distinct) method. In future work, we will plan to perform detailed comparisons between the methods.

Acknowledgments

Funding for this research was provided by DARPA N66001-15-C-4032, Google Faculty Research award, and ONR N00014-14-1-0243 (LP); Simons Global Brain Research Award 325171 (LP and JC); Sloan Research Fellowship (JPC). We thank David Carlson and Megan McKinney, Esq. for useful comments on the manuscript, and Gabriel Synnaeve for making his Python code publicly available.

References

Appendix A Smoothing approximate posterior with explicit forward/backward

The posterior mean and covariances may be computed by the standard Kalman forward-backward algorithm. To see this, we can write the posterior in matrix form as the product of two Gaussians. We have

(29)
(30)

where,

(31)

where the positive-definite matrix is a covariance matrix, analogous to the “innovation noise” in the standard Kalman filter, matrix is a linear dynamics matrix444

For stable dynamics, we assume that the eigenvalues of

have magnitude less than one; in the examples we considered, we did not need to enforce this constraint..

We can then re-write the approximate posterior as the product . By the standard product-of-normal-densities identity,

also has a multivariate normal distribution

, where and are given by eq. 19 (which we repeat here):

(32)
(33)

Computation proceeds just as in Section 3.2.1, except that now the computation of eq. 19 takes the form,

(34)

which may be computed efficiently by exploiting the block bi-diagonal structure of .