1 Background
A hidden Markov model (HMM) [1] consists of a hidden state sequence and a corresponding observation sequence . Let there be hidden states. For convenience, we let the start state be and set . Let be the transition matrix where , and be the initial state distribution where . A hierarchical Dirichlet process HMM (HDPHMM) [2, 3]
allows to use an unbounded number of hidden states by constructing an infinite mean vector
from a stick breaking process and drawing transition vectors from the shared . We have and ,(1) 
A hidden sequence is generated by a first order Markov process, and each observation is generated conditioned on its hidden state. We have for ,
(2) 
where parametrizes the observation likelihoods for , with
. We assume that the observation likelihoods and their conjugate prior take exponential forms:
(3)  
(4) 
The base measure and log normalizer are scalar functions; and the parameter and sufficient statistics are vector functions. The subscripts and
represent the local hidden variables and global model parameters, respectively. The dimensionality of the prior hyperparameter
is equal to . For a complete Bayesian treatment, we place vague Gamma priors on and , and . The graphical model is shown in figure 1 (left).2 Stochastic Collapsed Variational Inference
HMMs and HDPHMMs are popular probabilistic models for modelling sequential data. However, their traditional inference methods such as variational inference (VI) [4]
and Markov chain Monte Carlo (MCMC)
[3, 5] are not readily scalable to large datasets (e.g., one dataset in our experiment contains million sequences with combined length over million). In this paper, we follow the success of stochastic collapsed variational inference (SCVI) for latent Dirichlet allocation (LDA) [6], and we propose a scalable SCVI algorithm for HMMs and HDPHMMs. Our algorithm achieved better predictive performances than the stochastic variational inference (SVI) [7] applied to HMMs [8] and to HDPHMMs [9].We present our derivation in the following three steps: 1. we marginalize out the model parameters ; 2. we derive stochastic updates for each sequence; 3. we derive stochastic updates for the posteriors of the HDP parameters (for HDPHMMs only). For notational simplicity, we consider a dataset of sequences each of length . That is and . Similarly, we write for hidden sequences and .
2.1 Collapsed HDPHMMs
There is substantial empirical evidence [10, 11, 6] that marginalizing the model parameters is helpful for both accurate and efficient inference. The marginal data likelihood of an HDPHMM is:
(5) 
The gamma functions and log normalizers come from the marginalization. denotes the transition count from the hidden state to , . dot denotes the summed out column, e.g., . denotes the posterior hyperparameter for the hidden state , and , where is the standard delta function.
The gamma functions are a nuisance to take derivatives of (5). Following [12], we replace them by integrals of some auxiliary variables and s and the joint likelihood becomes:
(6) 
where
is Beta distributed,
is the number of tables labelled with in the Chinese restaurant in a Chinese restaurant franchise, and is unsigned Stirling number of the first kind. The factor graph with the auxiliary variables is given in figure 1 (right).We are interested in the posterior . As the exact computation is intractable, we introduce a variational distribution in a tractable family,
(7) 
and we maximize the evidence lower bound (ELBO) denoted by ,
(8) 
2.2 Inference for Sequences
To infer , we factorize it as a product of independent sequences, . Combining the work of SCVI for LDA [6] and CVI for HMM [11], we randomly sample with , and we derive the update for with a zeroth order Taylor approximation [13]:
(9)  
(10) 
in which denotes the geometric expectation, denotes the expected transition count from state to , and denotes the emission statistics at the hidden state . The details on expectations that appear in the paper are in Appendix A.
As is proportional to a HMM parametrized by the surrogate parameters and , we can use the forward backward algorithm [14]. After collecting the local transition counts and emission statistics , we update the global statistics by taking a weighted average:
(11)  
(12) 
where is the step size satisfying and .
Unlike CVI for HMM [11], our algorithm is memory efficient, since we update without subtracting the local statistics, as such they do not need to be explicitly stored.
2.3 Inference for HDP
For notational clarity, we write the variational posteriors of the HDP parameters to be governed by their variational parameters. We have,
(13) 
We derive stochastic updates for the HDP posteriors. For a randomly selected sequence , we form an artificial dataset consisting replicates of the observed and hidden sequences . Assuming we can compute and
based on the artificial dataset, we derive the intermediate variational parameters and take a weighted average with their old estimates. Hence, we have the following updates (
in (16) is also a function of ):(14)  
(15)  
(16) 
where dot denotes the summed out column, and denotes summing over for . The details on computing and
are in Appendix B. Stochastic optimizations often benefit from the use of minibatches, to reduce the variance of noisy samples and the updating time of variational parameters. Thus we propose to update the global statistics after a minibatch is processed, and to update the HDP posteriors after a larger batch is processed. Altogether, our SCVI algorithm for HDPHMMs is given in Appendix C, and it applies to HMMs by removing the outermost loop.
3 Experiments
We evaluated our SCVI algorithm applied to HMMs (denoted by SCVI) and applied to HDPHMMs (denoted by SCVIHDP) compared to the SVI algorithm applied to HMMs [8] (denoted by SVI). SVI applied to HDPHMMs was omitted, since we were unable to make noticeable improvement over SVI using a point estimate of the top level stick [9]. We used two discrete datasets, the Wall Street Journal (WSJ) and New York Times (NYT). Both datasets are made of sentences. For each sentence, the underlying sequence can be understood as a Markov chain of hidden partofspeech (PoS) tags [15]
and words are drawn conditioned on PoS tags, making (HDP)HMMs natural models. We used the predictive log likelihoods as our evaluation metrics.
For SVI and SCVI, we set the transition priors to , to encourage sparsity. For SCVIHDP, we set the HDP priors to be vague, . We set for the first iteration such that all the algorithms started with the same transition prior counts. Finally, all the emission priors were set to ; all the global statistics and
were initialized using exponential distributions, as suggested by
[7].The first three rows in figure 2 presents the predictive log likelihood results of three inferences on WSJ ( sentences, for training and for testing). We fixed the number of hidden states (or truncation level) ^{1}^{1}1One goal of our experiments is to show the improvement by sharing statistics using our HDP inference. Using a larger truncation level than the number of hidden states would put our SCVIHDP in an advantageous position and we would not be able to identify the source of improvement. and varied the minibatch sizes and forgetting rates , which parametrize the step sizes . The large batch size was set to be for SCVIHDP. We let each inference run through the dataset times and reported the per time step likelihoods. In all the settings, our SCVI outperformed SVI by large margins, extending the success of SCVI for LDA [6] to time series data. Further, our collapsed HDP inference helped SCVIHDP to surpass our SCVI by noticeable margins.
The forth row in figure 2 presents the predictive log likelihood results of three inferences on NYT ( million sentences, for training and for testing) using the complementary settings to WSJ. We fixed and and varied . We ran all the algorithms (implemented in Cython) for hours and reported the likelihood results versus wallclock time. We see that given the same time, our SCVI converged much better than SVI. Our SCVIHDP overlapped with our SCVI towards the end, but it was always better prior to that, making better use of its time.
4 Conclusion
In this paper, we have presented a general stochastic collapsed variational inference algorithm that is scalable to very large time series datasets, memory efficient and significantly more accurate than the existing SVI algorithm. Our algorithm is also the first truly variational algorithm for HDPHMMs, avoiding point estimates, and it comes with performance gains. For future work, we aim to derive the true nature gradients of the ELBO to prove and further speed up the convergence of our algorithm [16], although we never saw a nonconverging case in our experiments.
References
 [1] Lawrence R. Rabiner. A tutorial on hidden Markov models and selected applications in speech recognition. pages 267–296, 1990.
 [2] Matthew J. Beal, Zoubin Ghahramani, and Carl E. Rasmussen. The infinite hidden Markov model. In Machine Learning, pages 29–245. MIT Press, 2002.
 [3] Y. W. Teh, M. I. Jordan, M. J. Beal, and D. M. Blei. Hierarchical Dirichlet processes. Journal of the American Statistical Association, 101(476):1566–1581, 2006.

[4]
Matthew Beal.
Variational Algorithms for Approximate Bayesian Inference
. PhD thesis, The Gatsby Computational Neuroscience Unit, University College London, 2003.  [5] Jurgen Van Gael, Yunus Saatci, Yee W. Teh, and Zoubin Ghahramani. Beam sampling for the infinite hidden markov model. In ICML ’08: Proceedings of the 25th international conference on Machine learning, pages 1088–1095, New York, NY, USA, 2008. ACM.
 [6] James R. Foulds, L. Boyles, C. DuBois, Padhraic Smyth, and Max Welling. Stochastic collapsed variational bayesian inference for latent dirichlet allocation. In KDD, 2013.
 [7] Matthew D. Hoffman, David M. Blei, Chong Wang, and John Paisley. Stochastic variational inference. J. Mach. Learn. Res., 14(1):1303–1347, May 2013.
 [8] Nicholas Foti, Jason Xu, Dillon Laird, and Emily Fox. Stochastic variational inference for hidden Markov models. In Advances in Neural Information Processing Systems 27, pages 3599–3607. 2014.
 [9] Matthew Johnson and Alan Willsky. Stochastic variational inference for Bayesian time series models. In Proceedings of the 31st International Conference on Machine Learning (ICML14), pages 1854–1862. JMLR Workshop and Conference Proceedings, 2014.

[10]
Arthur Asuncion, Max Welling, Padhraic Smyth, and Yee Whye Teh.
On smoothing and inference for topic models.
In
Proceedings of the TwentyFifth Conference on Uncertainty in Artificial Intelligence
, Arlington, Virginia, United States, 2009.  [11] Pengyu Wang and Phil Blunsom. Collapsed Variational Bayesian Inference for Hidden Markov Models. In Proceedings of the 16th International Conference on Artificial Intelligence and Statistics (AISTATS), Scottsdale, AZ, USA, 2013.
 [12] Y. W. Teh, K. Kurihara, and M. Welling. Collapsed variational inference for HDP. In Advances in Neural Information Processing Systems, volume 20, 2008.
 [13] Yee Whye Teh, David Newman, and Max Welling. A collapsed variational Bayesian inference algorithm for latent Dirichlet allocation. In In Advances in Neural Information Processing Systems, volume 19, 2007.
 [14] Leonard E. Baum and Ted Petrie. Statistical inference for probabilistic functions of finite state Markov chains. Annals of Mathematical Statistics, 37(6):1554–1563, 1966.

[15]
Daniel Jurafsky and James H. Martin.
Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition
. Prentice Hall PTR, Upper Saddle River, NJ, USA, 1st edition, 2000.  [16] Francisco J. R. Ruiz, Neil D. Lawrence, and James Hensman. True natural gradient of collapsed variational bayes. In NIPS Workshop on Advances in Variational Inference, Montreal, 2014.
Appendix A
In this section, we present the standard (geometric) expectations that appeared in the main paper.
If is Beta distributed, , (e.g., in the main paper), we have,
(17) 
If
(e.g., in the main paper), we have,(18) 
If and are independent, (e.g., and in the main paper), we have .
Appendix B
In this section, we present the details on computing and . For , we notice the inequality^{2}^{2}2 is the expected number of tables in the restaurant. The inequality holds by the property of CRP: the expected number of tables grows not linearly but logarithmically with the number of customers.: . Thus we compute it as follows:
(19)  
(20)  
(21)  
(22) 
The approximation in (19) comes from the technique proposed by Teh et al. detailed in [12]. In (20),
denotes the probability of at least one transition from state
to state ; and the second equality comes from the fact that is repeated times under exactly the same distribution. In (21) and (22), we propose a fast approximate method. We partition a hidden sequence into a set of overlapping but independent clusters . Allowing to overlap is sufficient to preserve all the pairwise transition information, while making the independence assumption permits the above linear computations as in [12]. The same strategy applies to computing .(23)  
(24)  
(25)  
(26) 
Appendix C
In this section, we present our SCVI algorithm for HDPHMMs.