1 Introduction
Hidden Markov models (HMMs) [1]
are popular probabilistic models for modelling sequential data in a variety of fields including natural language processing, speech recognition, weather forecasting, financial prediction and bioinformatics. However, their traditional inference methods such as variational inference (VI)
[2] and Markov chain Monte Carlo (MCMC) [3] are not readily scalable to large datasets. For example, one dataset in our experiment consists of million observations.An important milestone for scaling VI was made by Hoffman et al. [4], who proposed stochastic VI (SVI) that computes cheap gradients based on minibatches of data, updating the model parameters before a complete pass of the full dataset. A recent scalable and more accurate algorithm was proposed by Foulds et al. [5], who applied such stochastic optimization to the collapsed latent Dirichlet allocation (LDA) [6], and their stochastic collapsed variational inference (SCVI) algorithm has been successful in large scale topic modelling.
However, while these recent advances have been studied extensively for topic models that assume a simple bagofwords data setting [4, 6, 7, 8, 9], there has been little research on whether and how we can apply them in a time dependent data setting. Some research such as SVI for Bayesian time series models [10] and collapsed VI (CVI) for HMMs [11] consider the settings where datasets consist of many independent time series, naturally avoiding to break the sequential dependencies. Perhaps the only true exception is the SVI algorithm for HMMs proposed by Foti et al. [12] in the setting of a single long time series, where the sequential dependencies must be broken.
In this paper, we follow the success of SCVI for LDA [5] and study the SCVI algorithm applied to a single long time series. In a collapsed HMM, we break a long chain into subchains, and we propose a novel sumproduct algorithm to update the posteriors of subchains, taking into account their edge transitions due to the sequential dependencies. Our sum product algorithm can be understood as an alternative buffering method to the one in [12]. Our experiments on two discrete datasets show that our SCVI algorithm for HMMs is scalable to very large datasets, memory efficient and significantly more accurate than the existing SVI algorithm.
2 Background
A hidden Markov model (HMM) [1] consists of a hidden state sequence and a corresponding observation sequence . Let there be hidden states. For convenience, we let the start state be and set . Let be the transition matrix where , and be the initial state distribution where . For
, we specify the Dirichlet priors with symmetric hyperparameters
on , in a Bayesian setting.A hidden sequence is generated by a Markov process, and each observation is generated conditioned on its hidden state. We have for ,
(1) 
where parametrizes the observation likelihood for the hidden state , with
. Without loss of generality, we assume that the observation likelihoods and their conjugate prior take exponential forms. The exponential family is a broad class of probability distributions including multinomial, Gaussian, gamma, Poisson, Dirichlet, Wishart and many others; and there is a conjugate prior distribution for each member in this class. We have for
,(2)  
(3) 
The base measure and log normalizer are scalar functions; and the parameter and sufficient statistics
are vector functions. The subscripts
and represent the local hidden variables and global model parameters, respectively. The dimensionality of the prior hyperparameter is equal to .3 Stochastic Collapsed Variational Inference
There is substantial empirical evidence [5, 11, 13] that marginalizing the model parameters is helpful for both accurate and efficient inference. Thus we integrate out the model parameters and the marginal data likelihood of an HMM is:
(4) 
The gamma functions and log normalizers result from the marginalization. denotes the transition count from the hidden state to , . dot denotes the summed out column, e.g., . denotes the posterior hyperparameter for the hidden state , and , where is the standard delta function.
Given an observed sequence x
, the task of Bayesian inference in the collapsed space is to compute the posterior distributions over the hidden sequence,
. The posteriors over the model parameters can be estimated by taking a variational Bayesian maximization step with our estimated
[2]. As the exact computation is intractable, we introduce a variational distribution in a tractable family and we maximize the evidence lower bound (ELBO) denoted by ,(5) 
We consider the tractable family under the generalized mean field assumption [14] in the collapsed space: we break a single long hidden sequence into a set of subchains. We have . We do not make any further assumptions about the inner structure of each subchain, preserving the inner transition information. It might be worth emphasizing that the time series dependencies in an HMM model are not broken; only the variational posterior is factorized. Therefore, the information can still flow across different subchains via edge transitions.
For notational simplicity, we let each subchain be of the length and be the number of subchains given a long chain. For each hidden subchain , we denote the corresponding observed subchain by . Combining the work of SCVI for LDA [5] and CVI for HMM [11], we uniformly sample an observation subchain , and we derive the posterior update for with a zeroth order Taylor approximation [6],
(6)  
(7)  
(8)  
(9)  
(10) 
where denotes the global expected transition count from state to , and denotes the global emission statistics at hidden state . Unlike CVI for HMM [11], we do not need to maintain local statistics and thus our algorithm is memory efficient. We show the algorithmic procedure to infer in Section 3.1.
Given , we can collect the local transition counts and emission statistics and update the global statistics with an online average weighted by a step size ,
(11)  
(12) 
3.1 Modified Forward Backward Algorithm
Given a subchain , we denote the hidden variable before it and the hidden variable after it to be the guarding variables; and we denote and to be the edge transitions. In (6), the edge transitions prevent us from applying the standard forward backward algorithm [15] to the HMM parametrized by the surrogate parameters and
. Therefore, we propose a modified sumproduct algorithm to buffer subchain edges with guarding variables. We start by defining a joint distribution of a subchain and its guarding variables using a factor graph shown in figure
1,(13)  
(14)  
(15)  
(16)  
(17)  
(18) 
The functions associated with each factor node are given in (1418). It is easy to verify that summing over the guarding variables of the joint probability in (13) reduces to in (6). Now we can use the sum product algorithm [16] to compute the required marginals of . Specifically, we first pick as the root node and pass the messages from the leaf node , and then we pass messages in a reverse direction^{1}^{1}1In both recursions, we have eliminated the messages of the ‘variable node to factor node’ type [17]. . We have,
(19)  
(20) 
where the initial messages are simply the distributions of the two guarding variables,
(21) 
After the messages have been passed in both directions, we compute the required variable marginals and pairwise marginals by,
(22)  
(23) 
The normalization constant can be obtained by normalizing any of these marginals. This completes our algorithm to infer .
Our modified sum product algorithm is an alternative buffering method to the one proposed by Foti et al. [12] in their SVI algorithm for a single long time series. A key difference is that we assume the independent subchains and we allow messages to be passed across the boarders via local beliefs of the guarding variables in (21), whereas the subchains in the SVI algorithm are naturally correlated. However, the price for preserving the correlation is that they assume the hidden chain is irreducible and aperiodic so that each subchain starts with the initial distribution equal to the stationary distribution of the whole chain. A second superficial difference is that we buffer a subchain by only two guarding variables, whereas Foti et al. buffered a subchain with more observations.
4 Experiments
We evaluated the utility of our buffering method and compared the performances of our SCVI algorithm against the SVI algorithm on two synthetic datasets created from the Wall Street Journal (WSJ) and New York Times (NYT). Both corpora are made of sentences, which in turn are sequences of words. For each sentence, the underlying sequence can be understood as a Markov chain of hidden partofspeech (PoS) tags [18] and words are drawn conditioned on PoS tags, making HMMs natural models. We shuffled both datasets, added special symbols after each sentence to denote the ends and concatenated them. We used the first million words in the concatenated WSJ and
million words in the concatenated NYT as our two long time series, respectively. As the evaluation metrics, we used predictive log likelihoods by holding out
words of each time series as testing sets.For both the SVI and our SCVI algorithms: we set the transition and emission priors to be
; we initialized the global statistics using exponential distributions suggested by Hoffman et al.
[4]; we set assuming a universal PoS tag set [19]; when buffering was turned off, we set the initial distribution to start a subchain to be the whole chain’s stationary distribution. For SVI, when buffering was turned on, we buffered a subchain with words on both sides. We varied the subchain lengths,and used minibatches of subchains to reduce the sampling variance. Following Foti et al.
[12], we fixed the total length of all subchains in a minibatch , where is the minibatch size. Increasing means decreasing and viceversa. Also, we varied the forgetting rates , which parametrize the step sizes . Under each of the combined settings, we ran both algorithms for iterations.Figure LABEL:single_combined presents the predictive log likelihood results on the WSJ (left and middle) and NYT (right). We see that in most settings our SCVI algorithm outperformed the SVI algorithm by large margins, extending the success of SCVI for LDA [5] to time series data. The only exception is when , both algorithms performed comparably on the NYT. For SCVI, a smaller forgetting rate was preferred, which further promotes the scalability; whereas SVI was less sensitive. When is small, there are noticeable improvements using respective buffering methods in both algorithms. For SCVI, we attribute the improvement to the inter subchain communication through guarding variables.
5 Conclusion
We have presented a stochastic collapsed variational inference algorithm for HMMs in the setting of a single long time series and an alternative buffering method that modifies the standard forward backward recursions. Our SCVI algorithm is significantly more accurate than the SVI algorithm on two large datasets, and our buffering method is robust against the poor choices of subchain lengths. For future work, we aim to derive the true nature gradients of the ELBO to prove the convergence of our algorithm [20], although we never saw a nonconverging case in our experiments.
References
 [1] Lawrence R. Rabiner. A tutorial on hidden Markov models and selected applications in speech recognition. pages 267–296, 1990.
 [2] Matthew Beal. Variational Algorithms for Approximate Bayesian Inference. PhD thesis, The Gatsby Computational Neuroscience Unit, University College London, 2003.
 [3] L Scott. Bayesian methods for hidden Markov models: Recursive computing in the 21st century. Journal of the American Statistical Association, 97:337–351, 2002.
 [4] Matthew D. Hoffman, David M. Blei, Chong Wang, and John Paisley. Stochastic variational inference. J. Mach. Learn. Res., 14(1):1303–1347, May 2013.
 [5] James R. Foulds, L. Boyles, C. DuBois, Padhraic Smyth, and Max Welling. Stochastic collapsed variational bayesian inference for latent dirichlet allocation. In KDD, 2013.
 [6] Yee Whye Teh, David Newman, and Max Welling. A collapsed variational Bayesian inference algorithm for latent Dirichlet allocation. In In Advances in Neural Information Processing Systems, volume 19, 2007.
 [7] Matthew D. Hoffman, David M. Blei, and Francis R. Bach. Online learning for latent dirichlet allocation. In NIPS, pages 856–864. Curran Associates, Inc., 2010.
 [8] Chong Wang and David Blei. Truncationfree stochastic variational inference for bayesian nonparametric models. Advances in Neural Information Processing Systems, 2012.
 [9] Michael Bryant and Erik B. Sudderth. Truly nonparametric online variational inference for hierarchical dirichlet processes. In F. Pereira, C.J.C. Burges, L. Bottou, and K.Q. Weinberger, editors, Advances in Neural Information Processing Systems 25, pages 2699–2707. Curran Associates, Inc., 2012.

[10]
Matthew Johnson and Alan Willsky.
Stochastic variational inference for Bayesian time series models.
In
Proceedings of the 31st International Conference on Machine Learning (ICML14)
, pages 1854–1862. JMLR Workshop and Conference Proceedings, 2014. 
[11]
Pengyu Wang and Phil Blunsom.
Collapsed Variational Bayesian Inference for Hidden Markov
Models.
In
Proceedings of the 16th International Conference on Artificial Intelligence and Statistics (AISTATS)
, Scottsdale, AZ, USA, 2013.  [12] Nicholas Foti, Jason Xu, Dillon Laird, and Emily Fox. Stochastic variational inference for hidden Markov models. In Advances in Neural Information Processing Systems 27, pages 3599–3607. 2014.
 [13] Arthur Asuncion, Max Welling, Padhraic Smyth, and Yee Whye Teh. On smoothing and inference for topic models. In Proceedings of the TwentyFifth Conference on Uncertainty in Artificial Intelligence, Arlington, Virginia, United States, 2009.
 [14] Eric P. Xing, Michael I. Jordan, and Stuart Russell. A generalized mean field algorithm for variational inference in exponential families. In Proceedings of the Nineteenth Conference on Uncertainty in Artificial Intelligence, pages 583–591, 2003.
 [15] Leonard E. Baum and Ted Petrie. Statistical inference for probabilistic functions of finite state Markov chains. Annals of Mathematical Statistics, 37(6):1554–1563, 1966.
 [16] F. R. Kschischang, B. J. Frey, and H. A. Loeliger. Factor graphs and the sumproduct algorithm. IEEE Trans. Inf. Theor., 47(2):498–519, September 2001.
 [17] Christopher M. Bishop. Pattern Recognition and Machine Learning (Information Science and Statistics). SpringerVerlag New York, Inc., Secaucus, NJ, USA, 2006.
 [18] Daniel Jurafsky and James H. Martin. Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition. Prentice Hall PTR, Upper Saddle River, NJ, USA, 1st edition, 2000.
 [19] Slav Petrov, Dipanjan Das, and Ryan T. McDonald. A universal partofspeech tagset. In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC2012), Istanbul, Turkey, May 2325, 2012, pages 2089–2096, 2012.
 [20] Francisco J. R. Ruiz, Neil D. Lawrence, and James Hensman. True natural gradient of collapsed variational bayes. In NIPS Workshop on Advances in Variational Inference, Montreal, 2014.