1 Introduction
Continuous latent time series models are popular tools in Bayesian machine learning. One thereby combines multiple copies of a likelihood model through a time series prior, thus enabling the latent variables to drift over time while still sharing statistical strength accross all times (Blei & Lafferty, 2006; Wang et al., 2008; Sahoo et al., 2012; Gultekin & Paisley, 2014; Charlin et al., 2015; Ranganath et al., 2015; Jerfel et al., 2017; Bamler & Mandt, 2017).
Variational Inference (VI) enables approximate inference for complex models by solving an optimization problem (Jordan et al., 1999). One chooses a variational distribution and fits it to the posterior, where the fullyfactorized mean field class is the most widely used. However, the standard mean field class is not a good choice when it comes to time series models. Instead, one often uses structured variational distributions in time (Wainwright et al., 2008; Blei & Lafferty, 2006). The perhaps simplest such choice is a multivariate Gaussian with a tridiagonal precision matrix. This can be seen as using the probabilistic model of the Kalman filter as a variational distribution (variational Kalman filter) (Blei & Lafferty, 2006), and reflects the fact that the prior is a firstorder Markov process.
In this paper, we introduce an efficient VI algorithm for fitting such tridiagonal Gaussian variational models to a large class of complex latent time series models. We draw on Black Box Variational Inference (BBVI) with reparameterization gradients (Salimans & Knowles, 2013; Kingma & Welling, 2014; Rezende et al., 2014; Ruiz et al., 2016), where one forms a MonteCarlo estimate of the variational lower bound’s gradient. The problem with structured variational distributions is that computing this reparameterization gradient is expensive; a naive implementation involves a dense matrix multiplication and scales as , where is the number of time steps. In this paper, we lay out a general algorithm which gives the reparameterization gradient in . This algorithm can be thought of a variant of the forwardbackward algorithm (Murphy, 2012) in the context of BBVI, and relies on a reparameterization procedure that never involves the inversion or multiplication of a dense matrix. We illustrate our approach on the dynamic word embedding model (Bamler & Mandt, 2017).
2 Background and Problem Setting
We start by specifying our generative and variational models and give an overview of the proposed algorithm.
Generative model.
We consider models with timedependent observations , where is the number of time steps. At each time step , the generative process is described by an arbitrary likelihood function , and
is a latent variable. We furthermore assume that the prior is Markovian. The joint probability thus factorizes as follows,
(1) 
where denotes an arbitrary prior on the first latent variable. Many models fall into this class. Our goal is an efficient posterior approximation for this model class, using a structured variational approximation and black box inference techniques.
Kalman smoothing revisited.
Before we describe our approach, we revisit the Kalman smoother (Rauch et al., 1965) as an efficient algorithm for a particularly simple realization of Eq. 1 where all conditional distributions are Gaussian. This is often called a Wiener process (more precisely, it is a GaussMarkov process). The prior is Gaussian with tridiagonal precision matrix , and the likelihood is a Gaussian centered around with precision . Thus, the posterior is also Gaussian, , and can be obtained analytically. One finds and . Obtaining the posterior modes involves solving the linear system of equations . For an arbitrary matrix , this involves operations. However, for the Wiener process, is tridiagonal and one can solve for in linear time using a specialized algorithm.
The forwardbackward algorithm (Murphy, 2012) implicitly decomposes into a product of two bidiagonal matrices and , where is lower triangular and is upper triangular. One starts with a forward pass through the data, in which one solves
for the auxiliary vector
using forward substitution, followed by a backward pass in which one solves for using back substitution. As detailed in section 3.3, we use a similar philosophy.Variational model.
For a general likelihood model , the exact posterior of Eq. 1 is intractable. To circumvent this problem, we use BBVI to minimize the KL divergence between a variational distribution and the true posterior. Here, summarizes all variational parameters. We use a structured variational distribution as our variational model that is motivated by the exact posterior of the analytically solvable model discussed above. We thus consider a Gaussian with a tridiagonal precision matrix ,
(2) 
This can be understood as varying the parameters of a (fictitious) Wiener process so that its posterior approximates the posterior of the true model (Blei & Lafferty, 2006).
As we discuss below, the tridiagonal structure of allows us to fit the variational distribution efficiently. Note that the covariance matrix is dense, thus encoding longrange correlations between any two time steps. Modelling correlations is important for many time series applications, in particular when the prior is strong e.g. due to little evidence per time step.
Black box variational inference.
In BBVI, one fits to the posterior , maximizing the evidence lower bound (ELBO) using stochastic gradients. The reparameterization trick amounts to representing the ELBO as
(3) 
The entropy is often an analytic function or can be estimated using other tricks. Here, is a vector of
independent standardnormal distributed random variables, and
denotes functions that are defined such that the random variable has probability density (see Section 3.2 below).In order to implement an efficient BBVI algorithm, one needs to be able to estimate the gradient of efficiently, i.e., in time. This involves the following three tasks:
All three of the above tasks can easily be solved efficiently if one chooses a mean field variational distribution, i.e., if factorizes over all time steps. However, the mean field approximation ignores correlations between time steps, which are important in many time series models as discussed above. In Section 3 we address each one of the above tasks individually.
3 Inference
In this section, we give the details of our new black box variational inference algorithm.
3.1 Evaluating the Entropy
The entropy of a multivariate Gaussian with precision matrix is , up to an additive constant. Evaluating the determinant of a general matrix takes operations in practice. To avoid this expensive operation, we parameterize the precision matrix via its Cholesky decomposition,
(4) 
Here, is an upper triangular matrix. Since we restrict to have a tridiagonal structure, is bidiagonal (Kılıç & Stanica, 2013), i.e., it has the structure
(5) 
with . As the mapping from to is unique, it suffices to optimize the nonzero components of in order to find the optimal entries of . It turns out that we never have to construct the matrix explicitly. The variational parameters are thus the marginal means (see Eq. 2) and the nonzero components of . Using the relation , we can evaluate the entropy in linear time,
(6) 
3.2 Evaluating the reparameterization functions
In contrast to the entropy, the expected logjoint in Eq. 3 cannot be evaluated analytically for a general model. We obtain an unbiased gradient estimator of the expected logjoint by drawing independent samples for .
For what follows, let denote any of the
variational parameters. Using the chain rule, the estimate of the gradient of the expected logjoint with respect to
is(7) 
where we defined
(8)  
To further simplify Eq. 7 we specialize the reparameterization function to our variational model. Using Eqs. 2 and 4, we find
(9) 
Here, is a dense (upper triangular) matrix. Instead of computing the inverse, we evaluate the term on the righthand side of Eq. 9 by solving the linear system for using back substitution. This takes only operations due to the bidiagonal structure of , see Eq. 5. We can therefore evaluate Eq. 8 in time for each sample .
3.3 Estimating the Reparameterization Gradient
Last, we show how to efficiently estimate the reparameterization gradient in time. In this subsection we describe an efficient method to evaluate the Jacobian that appears on the righthand side of Eq. 7.
A naive evaluation of all gradient estimates would require operations: the sums on the righthand side of Eq. 7 run over terms, and the number of variational parameters for which the gradient estimate has to be evaluated grows at least linearly in for a reasonably flexible variational distribution.
We use the following trick to evaluate all gradient estimates in linear time in
. For any invertible matrix
that depends on some parameter , we have(10) 
Solving for , we obtain
(11) 
Eq. 11 expresses the derivative of the dense (upper triangular) matrix in terms of the derivative of the bidiagonal matrix . In fact, both and are sparse matrices with only a single nonzero entry, see Eq. 5. The dense matrix still appears on the righthand side of Eq. 11 but we avoid evaluating it explicitly. Instead, we again solve a bidiagonal linear system of equations. Combining Eqs. 7, 9 and 11, we obtain the formal expressions
(12)  
(13)  
(14) 
where bold face denotes a column vector of derivatives of the logjoint as defined in Eq. 8. Instead of computing the inverse in Eqs. 13–14, we obtain the vectors and by solving the linear systems and , respectively, in time using the bidiagonal structure of . Since is lowerdiagonal and is upperdiagonal these operations carryout a forward and a backward pass through the time steps, respectively. This is similar to the forwardbackward algorithm for the Wiener process with Gaussian likelihood discussed in Section 2.
4 Experiments
To demonstrate the benefits of our approach, we summarize selective experimental results from work on dynamic word embeddings, which used our proposed BBVI algorithm.
Dynamic SkipGram Model.
Bamler & Mandt (2017) proposed the dynamic skipgram model as a generalization of the skipgram model to timestamped text corpora (Mikolov et al., 2013). The model, thus, represents words as latent trajectories in an embedding space, and is able to track how words change their meaning over time.
To train the model, one first summarizes the text corpus in terms of certain sufficient statistics, using a finite vocabulary. For each time stamp, one builds cooccurrence matrices of words and their surrounding words. In addition, the model requires socalled nagative examples, expressing the likelihood of cooccurrance by mere chance. The model then models these statistics in terms of the geometrical arrangement of certain latent word and context embedding vectors. The prior over these embedding vectors is a latent OrnsteinUhlenbeck process (a Wiener process in the presence of a restoring force), enforcing a continuous drift over time. For more details, we refer to (Bamler & Mandt, 2017).
Algorithms and Baselines.
SGI denotes the nonBayesian skipgram model with independent random initializations of word vectors (Mikolov et al., 2013; Hamilton et al., 2016). SGP denotes the same approach as above, but with word and context vectors being preinitialized with the values from the previous year, as in (Kim et al., 2014). Both approaches have no dynamical priors and hence no overhead and just scale linearly with . DSGF denotes the dynamic skipgram filtering algorithm, proposed in (Bamler & Mandt, 2017), which also runs in , but uses the mean field approximation and only uses information form the past. Finally, DSGS denotes the dynamic skipgram smoothing algorithm. This is the algorithm proposed in this paper, applied to the dynamic skipgram model.
Data.
We used data from the Google books corpus (Michel et al., 2011) from the last two centuries (). We also used the “State of the Union” (SoU) addresses of U.S. presidents, which spans more than two centuries. Finally, we used a Twitter corpus of news tweets for randomly drawn dates from to
. Details on hyperparameters and preprocessing are given in
(Bamler & Mandt, 2017).Results.
We focus on the quantitative results of (Bamler & Mandt, 2017), showing show that the dynamic skipgram smoothing algorithm (described and generalized in this paper) generalizes better to unseen data than all baselines at similar run time. We thereby analyze heldout predictive likelihoods on wordcontext pairs at a given time , where is excluded from the training set.
The predictive likelihoods as a function of time are shown in Figure 1. On all three corpora, differences between the two implementations of the static model (SGI and SGP) are small, which suggests that preinitializing the embeddings with the previous result seems to have little impact on the predictive power. Loglikelihoods for the skipgram filter (DSGF) grow over the first few time steps as the filter sees more data, and then saturate. Skipgram smoothing (this paper) further outperforms skipgram filtering.
To conclude, we stress that a structured BBVI algorithm with quadratic instead of linear run time in would be impractical. We therefore hope that our structured reparameterization trick will spur further research on complex latent time series models.
References
 Bamler & Mandt (2017) Bamler, Robert and Mandt, Stephan. Dynamic word embeddings. In International Conference on Machine Learning, 2017.
 Blei & Lafferty (2006) Blei, David M and Lafferty, John D. Dynamic Topic Models. In Proceedings of the 23rd International Conference on Machine Learning, pp. 113–120. ACM, 2006.
 Charlin et al. (2015) Charlin, Laurent, Ranganath, Rajesh, McInerney, James, and Blei, David M. Dynamic Poisson Factorization. In Proceedings of the 9th ACM Conference on Recommender Systems, pp. 155–162, 2015.
 Gultekin & Paisley (2014) Gultekin, San and Paisley, John. A Collaborative Kalman Filter for TimeEvolving Dyadic Processes. In Proceedings of the 2nd International Conference on Data Mining, pp. 140–149, 2014.
 Hamilton et al. (2016) Hamilton, William L, Leskovec, Jure, and Jurafsky, Dan. Diachronic word embeddings reveal statistical laws of semantic change. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, pp. 1489–1501, 2016.
 Jerfel et al. (2017) Jerfel, Ghassen, Basbug, Mehmet E, and Engelhardt, Barbara E. Dynamic Compound Poisson Factorization. In Artificial Intelligence and Statistics, 2017.
 Jordan et al. (1999) Jordan, Michael I, Ghahramani, Zoubin, Jaakkola, Tommi S, and Saul, Lawrence K. An Introduction to Variational Methods for Graphical Models. Machine learning, 37(2):183–233, 1999.
 Kılıç & Stanica (2013) Kılıç, Emrah and Stanica, Pantelimon. The Inverse of Banded Matrices. Journal of Computational and Applied Mathematics, 237(1):126–135, 2013.
 Kim et al. (2014) Kim, Yoon, Chiu, YiI, Hanaki, Kentaro, Hegde, Darshan, and Petrov, Slav. Temporal Analysis of Language Through Neural Language Models. In Proceedings of the ACL 2014 Workshop on Language Technologies and Computational Social Science, pp. 61–65, 2014.
 Kingma & Welling (2014) Kingma, Diederik P and Welling, Max. Autoencoding variational Bayes. In ICLR, 2014.
 Michel et al. (2011) Michel, JeanBaptiste, Shen, Yuan Kui, Aiden, Aviva Presser, Veres, Adrian, Gray, Matthew K, Pickett, Joseph P, Hoiberg, Dale, Clancy, Dan, Norvig, Peter, Orwant, Jon, et al. Quantitative Analysis of Culture Using Millions of Digitized Books. Science, 331(6014):176–182, 2011.
 Mikolov et al. (2013) Mikolov, Tomas, Sutskever, Ilya, Chen, Kai, Corrado, Greg S, and Dean, Jeff. Distributed Representations of Words and Phrases and their Compositionality. In Advances in Neural Information Processing Systems 26, pp. 3111–3119. 2013.
 Murphy (2012) Murphy, Kevin P. Machine learning: a probabilistic perspective. MIT press, 2012.
 Ranganath et al. (2015) Ranganath, Rajesh, Perotte, Adler J, Elhadad, Noémie, and Blei, David M. The Survival Filter: Joint Survival Analysis with a Latent Time Series. In UAI, pp. 742–751, 2015.
 Rauch et al. (1965) Rauch, Herbert E, Striebel, CT, and Tung, F. Maximum likelihood estimates of linear dynamic systems. AIAA journal, 3(8):1445–1450, 1965.

Rezende et al. (2014)
Rezende, Danilo Jimenez, Mohamed, Shakir, and Wierstra, Daan.
Stochastic Backpropagation and Approximate Inference in Deep Generative Models.
In The 31st International Conference on Machine Learning (ICML), 2014.  Ruiz et al. (2016) Ruiz, Francisco, Titsias, Michaelis, and Blei, David. The generalized reparameterization gradient. In NIPS, 2016.

Sahoo et al. (2012)
Sahoo, Nachiketa, Singh, Param Vir, and Mukhopadhyay, Tridas.
A Hidden Markov Model for Collaborative Filtering.
MIS Quarterly, 36(4):1329–1356, 2012.  Salimans & Knowles (2013) Salimans, Tim and Knowles, David A. Fixedform variational posterior approximation through stochastic line ar regression. Bayesian Analysis, 8(4), 2013.
 Wainwright et al. (2008) Wainwright, Martin J, Jordan, Michael I, et al. Graphical models, exponential families, and variational inference. Foundations and Trends® in Machine Learning, 1(1–2):1–305, 2008.
 Wang et al. (2008) Wang, Chong, Blei, David, and Heckerman, David. Continuous time dynamic topic models. In Proceedings of the TwentyFourth Conference on Uncertainty in Artificial Intelligence, pp. 579–586, 2008.
Comments
There are no comments yet.