Clustering-Enhanced Stochastic Gradient MCMC for Hidden Markov Models with Rare States

10/31/2018 ∙ by Rihui Ou, et al. ∙ 0

MCMC algorithms for hidden Markov models, which often rely on the forward-backward sampler, suffer with large sample size due to the temporal dependence inherent in the data. Recently, a number of approaches have been developed for posterior inference which make use of the mixing of the hidden Markov process to approximate the full posterior by using small chunks of the data. However, in the presence of imbalanced data resulting from rare latent states, the proposed minibatch estimates will often exclude rare state data resulting in poor inference of the associated emission parameters and inaccurate prediction or detection of rare events. Here, we propose to use a preliminary clustering to over-sample the rare clusters and reduce variance in gradient estimation within Stochastic Gradient MCMC. We demonstrate very substantial gains in predictive and inferential accuracy on real and synthetic examples.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

As time series data become increasingly large, traditional algorithms for inference suffer from the added computational burden and are often intractable. As a result, there has been an increased interest in the development of scalable algorithms. We are particularly interested in obtaining reliable uncertainty quantification (UQ) in inferences and prediction. There is a recent literature on relevant methods, including approaches that subsample ‘chunks’ of the times series (Johnson, 2014; Hughes et al., 2015) and data thinning (Jiang and Willett, 2016). In the present work, we focus on Hidden Markov Models (HMMs) – well-regarded for their versatility in characterizing a breadth of phenomena including protein folding dynamics (Stigler et al., 2011), credit ratings (Petropoulos et al., 2016), speech recognition (Huang et al., 1990), stock market forecasting (Hassan and Nath, 2005) and rare event detection (Zhang et al., 2005; Chan et al., 2004)

. In particular, we consider the case where the time series exhibits normal behavior over large intervals of time interrupted intermittently by different transient dynamics associated with one or more rare states in the latent Markov chain akin to the dynamics of network attacks

(Ourston et al., 2003) and solar flares (Hall and Willett, 2013).

Due to the temporal dependence inherent in sequential data, methods for inferring HMM parameters – such as Monte Carlo methods (Scott, 2002)

, Expectation-Maximization

(Bishop, 2006), and variational Bayes’ (Johnson and Willsky, 2014) – often rely on an implementation of the forward-backward algorithm (FB) named as such due to the need to make both a forward and backward pass through the entire time series. At each iteration, one attains a local update of the unknown latent states in the HMM using FB to obtain marginal distributions of these states. This is followed by a global update of the parameters for the emission distributions. As the FB must pass through the entire sequence of observations in the time series, it results in a time complexity which is linear in the number of samples, burdensome for very large time series.

One natural approach to addressing the computational burdens is to subsample the data. Unfortunately, while a number of subsampling strategies have been developed in the iid setting (Campbell and Broderick, 2017; Scott et al., 2016; Huggins et al., 2016; Quiroz et al., 2018; Fu and Zhang, 2017), the sequential nature of time series makes this strategy more difficult. Most relevant to the present work, Foti et al. (2014) constructed a Stochastic Variational Inference HMM (SVI-HMM) algorithm which breaks long sequences apart to construct minibatch samples. The approach is justifiable due to the inherent memory decay in HMMs when the time series is long. Unfortunately, as noted in Ma et al. (2017), this algorithm suffers from a number of drawbacks, in particular, the well known under-estimation of posterior correlation and limitations to conjugate emission distribution. Through an alternative formulation of the HMM, Ma et al. (2017)

also made use of the underlying mixing of the latent states to separate the time series to develop a Stochastic Gradient MCMC (SG-MCMC) framework for inferring the emission parameters and transition probabilities. Their approach demonstrated drastically improved computational performance compared to traditional FB. In this SG-MCMC algorithm, estimates of the correlation decay rate of the underlying Markov chain are obtained to determine appropriately long buffers separating subseries into effectively independent chunks. These subseries of the full data are then subsampled to obtain minibatch estimates of the full data log-posterior gradient.

There is one large drawback common to both SVI-HMM and SG-MCMC. In the presence of one or more rare states, minibatches will often fail to include any portion of the time series in which transient dynamics occur. As a result, one should expect current minibatch based estimation to fail to discover rare state(s), even though these states are often of particular interest. Motivated by this observation, we propose to first cluster the subseries, resulting in a biased subsampling strategy which has a high probability of including rare dynamics, which we then implement within SG-MCMC. Our Cluster-enhanced Stochastic Gradient MCMC (CSG-MCMC) approach adds minimal computational complexity while drastically improving the accuracy of gradient estimates. We provide numerical evidence from synthetic data experiments that the clustering based estimates have smaller variance than those of Ma et al. (2017) thereby minimizing error and improving predictive performance. Additional experiments are provided using heart rate data from the MIMIC-III database (Johnson et al., 2016).

2 Background

2.1 Hidden Markov Models and Motivation

HMMs consist of discrete-valued latent states generated by a Markov chain and the corresponding observations generated by an emission distribution determined by the state

. Specifically, the joint distribution of

and factorizes as

(1)

where is the Markov transition matrix for the latent state, consists of parameters of the associated emission distributions, and is the distribution of the initial state. Throughout, we assume the latent Markov chain is recurrent and irreducible and that the latent states are at stationarity so that is drawn from the unique stationary distribution for .

As a motivating example, consider the following extreme case. Suppose there are two latent states with transition matrix

(2)

which has stationary distribution Furthermore, consider Gaussian emissions

(3)

where . A realization of this process, shown in Figure 1 with , shows rare visits to a neighborhood of one.

Figure 1: One realization of the HMM with transition matrix (2) and emission distributions (3) is shown above. Here, and . Observe the large portion of the time series near zero with intermittent visits near one.

Now, assume that only the common variance,

, of the emission distributions is known and we are interested in inference on the transition matrix and means. We use the conjugate priors

where corresponds to column of . Given the minimal overlap of the two emission distributions the latent states can be (almost exactly) identified through the observations. Thus, we treat as known given thereby ignoring the sequential dependence in the data. In this case, the posterior distributions are

(4)

where is the number of observations in state , is the sample mean of the observations in state , is the number of transitions from state to state , and is the posterior variance of .

Suppose now that we subdivide the full time series into subchains each of length and construct a posterior sampling algorithm as follows. At each step, we draw uniformly at random subchains and sample from (4) using only these subsampled chains. When , state 2 will be so rare that we can expect

at (almost) every iteration so that the posterior estimates of and will be essentially equivalent to their priors.

Of course, poor inference for the rare state parameters may also occur using the full data. However, one should expect this minibatch subsampling approach to greatly exacerbate the issue particularly when the subchains cover a small proportion of the full chain, i.e. As both the SVI-HMM of Foti et al. (2014) and SG-MCMC of Ma et al. (2017) are based on this sampling approach, it is reasonable to expect both methods will suffer when there are one or more rare latent states. In fact, it is possible that such an approach will fail to detect rare states all together.

Consider the alternate posterior sampling algorithm which avoids this issue. First, we cluster the subchains into groups , which contains subchains where all latent states are one, and , which contains subchains with at least one observation from state two. We then use a stratified subsample in (4) by drawing subsamples uniformly at random from cluster where . At each step, we can be assured that , , , and will be positive thereby improving the inference regarding the rare state. We investigate this subsampling strategy in concert with SG-MCMC which we now review.

2.2 Stochastic Gradient MCMC for HMMs

Consider the alternate formulation of the HMM which marginalizes out the hidden states to attain the marginal likelihood,

(5)

where is the diagonal matrix with entries , is a

-dimensional vector of ones, and

is the stationary distribution for . Given a prior , the posterior distribution of given observations is then,

(6)

Assuming a continuous parameter , marginalizing the latent states allows one (in theory) to make use of sampling algorithms such as Hamiltonian Monte Carlo or SG-MCMC. Unfortunately, both methods require calculation of the gradient of with components

(7)

where

(8)

When is large, evaluating the gradient becomes intractable largely due to the repeated matrix multiplication needed to calculate and .

To circumvent this issue, Ma et al. (2017) construct gradient approximations utilizing minibatch subsampling in conjunction with the mixing of the latent Markov chain. We summarize their main idea in a slightly simpler construction.

Fix with odd. For , consider the subchain of length centered at and let

Now partition the full time series into the sequential subchains such that

Here, we fix the subchain centers

We may then express (7) using these subseries as

(9)

Similar to (7), the calculation of and will be computationally demanding when is large as they must pass over the full series after and before respectively. As such, it is desirable to find approximations. Assume is known and let be a buffer length longer than the inverse of the spectral gap of . Given the memory decay in the latent states, any data occurring more than steps before or after will be essentially independent of it. Thus, we may make the following approximations.

(10)

where is the invariant distribution of . These approximations are simply truncations of the products in (8) making them much less demanding to calculate.

A second approximation arises by using a random subsample of subchains in the calculation of (9). The approximated gradient has components

(11)

where and is a subset of the subseries centers, . Taking futher reduces the computational burden of approximating

To ensure independence of the terms within (11) some care must be made to ensure each subsampled chunk and the observations before and after it are sufficiently separated. This is equivalent to the requirement that

for with where is a minimum gap between subseries. Ma et al. (2017) use a more flexible sampling approach which enforces this gap condition without fixing the subseries centers a priori. However, we do note that when , poor coverage of rare latent states should be expected.

The SG-MCMC algorithm proceeds using the gradient estimates in (11). Of course, in practice the transition matrix is unknown, and therefore so are its invariant distribution and spectral gap. The authors address this issue naturally by using updated estimates of to estimate and . See Ma et al. (2017) for the full details including technical details regarding sampling from the simplex in estimating the columns of . A more general discussion of SG-MCMC can be found in Ma et al. (2015). For now, we summarize the SG-MCMC method in Algorithm 1.

set number of SG steps per iteration

initialize and

  for  do
     Estimate the spectral gap and buffer length using Sample subchains of length .
     for  do
        Update
     end forCalculate Set
     for   do
        Update
     end forSet
  end for
Algorithm 1 SG-MCMC for HMMs (Ma et al., 2017)

3 Clustering Based SG-MCMC for HMMs

In the implementation of Algorithm 1, one relies on the critical assumption that the gradient estimate in (11

) is normally distributed about the true gradient such that

(12)

where is the covariance of the minibatch estimates. When rare states exist within the HMM, this assumption is questionable. Furthermore, even if the estimates are unbiased, one should certainly expect the covariance to be very large given the proclivity of the minibatch samples to exclude rare states. Thus, we focus on designing a subsampling stategy to construct a gradient estimator which i) will avoid bias in the components of the rare latent state(s), ii) have smaller covariance, and iii) be better approximated by a Normal distribution than the original minibatch gradient estimator. While (i) is specific to the case of rare latent state, goals (ii) and (iii) will also improve SG-MCMC in general settings.

To this end, we use stratified sampling instead of minibatches to bias the subsamples to include portions of the time series in the rare state. Again, following the discussion in Section 2.2 we partition the full time series into sequential subseries each of length which we identify by the index of center point in the subseries, for and .

As a preprocessing steps, we first use kmeans++ to cluster the subsequences into clusters, . We identify each subseries by its center point so that partition . We then draw a subsample for independently. Let be the number of subseries in cluster and be the number of subsamples drawn from Note that , , and are all tuneable parameters which must be specified.

To evaluate (9), one could reorder the sum, grouping the sum over subseries together. Therefore, (9) can be rewritten as

(13)

The inner sum over may be approximated using the subsampled indices

. Thus, an unbiased estimator of the

th component of the gradient is

(14)

where we rescale each inner sum over by to match the variance.

The calculation of and is still intractable for large so we follow Ma et al. (2017) and make use of the approximation (10). Therefore, the stratified estimator is:

(15)

Unlike Ma et al. (2017), we do not constrain the chosen subsamples within or across clusters to satisfy the gap condition. However, our numerical experiments thus far indicate this estimation procedure indeed meets requirements (i) - (iii). A more detailed analytic investigation is being conducted presently.

Our CSG-MCMC with stratified sampling approach is summarized in Algorithm 2 wherein we have elected to specify a fixed buffer length explicitly rather than estimate it from the data. This choice was made to simplify the implementation of CSG-MCMC allowing us to focus more directly on improvements in gradient estimation resulting from the stratified subsampling approach.

0:  Clusters and
  Select the buffer length
  Initialize and
  for  do
     for  do
        
        for  do
           Uniformly sample and
        end for
     end for
     Update and using in SG-MCMC as the estimator of
  end for
Algorithm 2 Cluster-enhanced SG-MCMC

4 Experiments

4.1 Conjugate Emission Distribution

We consider two numerical experiments on synthetic data each with observations and Gaussian emission densities to demonstrate improvements through subseries clustering. For comparison, we report the log predictive density and mean squared loss of the estimated transition matrix as a function of wall-clock run time. Additionally, 10 step predictive intervals and estimated gradient variance are provided.

In the first dataset, hereby referred to the balanced dataset (BD), there are four latent states with transition matrix

which has stationary distribution, , which is uniform over the states. The Gaussian emission densities have means -6, -3, 0, and 3, and common variance equal to 2.

The second data set is generated from an imbalanced HMM model (ID), with latent states and transition matrix

which has an imbalanced stationary distribution on the hidden states. The Gaussian emission densities have means -20, -10, 0, 10, and 20, and common variance equal to 1. We will use this imbalanced dataset to demonstrate our stratified sampling scheme’s ability to capture dynamics. As the emissions have a huge separation and the latent states are not truly rare, one should expect SG-MCMC to work well in this case. Nonetheless, CSG-MCMC outperforms it here too.

In the first case, we subsample the datasets by setting and the minibatch size . We draw minibatches from each of clusters. In the second case, we subsample the datasets by setting , , and draw 1 sample from each cluster for a total sample size . To cluster the data we view the subseries as vector in and use the Euclidean metric in kmeans++.

We assign improper (constant) priors for all parameters so that In this case, the parameters are the emission means and variances and . The simplex constraints on the columns of are attained by projecting back to the simplex after each iteration.

We plot the 10-step-ahead log predictive density of 2000 alternative time observations and predictive intervals to evaluate the performance of both algorithms. In the BD dataset, CSG-MCMC outperforms SG-MCMC in terms of log predictive density (Figure 2). The clustering-based method requires less runtime to attain the same predictive density, which indicates that our method has a faster convergence speed. Moreover, in the long run, its log predictive density is still higher than that of the other algorithm, which suggests that it provides a more accurate approximation to the posterior.

Concerning predictive intervals, the non-clustering predictive intervals fail to include the first observation most of the time, and they are wider than the non-clustering intervals (Figure 3). CSG-MCMC outperforms SG-MCMC due to the greatly decreased variance in the gradient estimates in every iteration (See Figure 4).

In the study of the ID dataset, we use to evaluate the performance of our clustering-based algorithm with respect to capturing rare dynamics. The convergence performance of indicates that the clustering-based algorithm is approximately times faster. Furthermore, the long time behavior of both algorithms indicates that the clustering-based algorithm provides a more accurate estimate of the posterior. The reason why the clustering-based algorithm outperforms the non-clustering algorithm is that the stratified sampling scheme guarantees rare dynamics are selected to estimate the gradient in every iteration, while the non-clustering algorithm will likely fail to incorporate the information from rare dynamics.

Figure 2: The 10-step-ahead log predictive density of 2000 alternative observations versus time (Left) in the BD dataset. versus time in the ID dataset (Right).
Figure 3: The 95% predictive intervals of both methods and 10 realizations of the generated alternative data. We show the comparison between both methods in 2000 iterations (Top Left), 3300 iterations (Top Right), 4600 iterations (Bottom Left), 5900 iterations (Bottom Right).
Figure 4: The Monte Carlo estimates of the variances of estimated with different minibatch size and at a point in the parameter space, which is plotted into two heat maps. The heat map indicates that the clustering-based method provides estimates of with smaller variance.

4.2 Non-conjugate Emission Distribution

As a second experiment to demonstrate the advantages of the CSG-MCMC, we study the convergence speed and accuracy when a non-conjugate emission probability is specified. In this case, we simulate a dataset with observations and latent states with the transition matrix

The stationary distribution is uniform over the two states so there are no rare states. The emission distribution of the first state is Ber, while the emission distribution of second is Ber.

We subsample the time series by setting and the total subsample size to comprised of samples from each of clusters. We use uniform priors for the success probabilities in the Bernoulli emissions and improper priors for .

Similar to Section 4.1, we plot the log predictive density of alternative observations and use to evaluate the estimation performance of both algorithms (Figure 5).

The log predictive density of alternative data shows CSG-MCMC has both a higher convergence speed and better accuracy. Furthermore, the convergence performance of the transition parameter demonstrates that the clustering-based method provides an accurate estimate of in an approximately 10 times higher speed.

Figure 5: The 10-step-ahead log predictive density of 2000 alternative observations versus time in the non-conjugate model (Left). versus time in the non-conjugate model (Right).
Figure 6: The heart rate of a single patient (Left). The plot of versus time of both algorithms (Right).

4.3 Real Data Analysis

The discovery and prediction of rare states is of particular importance in heart rate data wherein a patient of interest may exhibit normal behavior most of the time with intermittent moments with a dangerously rapid or slow pulse. The increasing prevalence of wearable technology has made it possible to attain second-to-second heart rate data over many days resulting in massively long time series. In analyzing such time series, naive subsampling is likely to miss rare dynamics, thereby hindering detection, let alone inference, of such rare states.

In this section, we analyze heart rate data from the MIMIC-III database using both CSG-MCMC and SG-MCMC in which heart rate is reported at approximately one second intervals. We restrict our focus to the 84 largest time series. An example of one such series is shown in Figure 6.

We build a model with hidden states and Gaussian emission distribution in each state. Due to the computational budget, we randomly select from each series and run the algorithms for 1500 iterations each. Since recognizing the states of patients are crucial, we mainly focus on how both algorithms perform with regard to recognizing dynamics. We use , where is the average of the transition matrix over the last 200 iterations of the algorithm, as our metric of performance.

We plot the comparison of both algorithms in Figure 6. The plot shows that the transition parameter achieves stationarity faster in CSG-MCMC. Moreover, when the runtime is long, the accuracy of the clustering-based method still dominates that of the other. Thus, the clustering-based method provides a more accurate and faster estimate of the transition parameter .

5 Discussion

The numerical experiments in the previous section indicate clear improvements in CSG-MCMC but also elicit a number of questions worth addressing. We are currently interested in pursuing two primary directions.

First, the results of Figure 4 demonstrate a reduction in variance of the CSG-MCMC in gradient estimation. However, a more thorough review is necessary to specifically address the issues of normality, bias, and variance of the clustering based estimation procedure. Rigorous results would then allow for a careful investigation of the optimal situations in which CSG-MCMC can provide substantial gains in addition to detailed bounds on bias in posterior estimates and potential avenues for improvement.

Additionally, there are number of different approaches one could use in the preprocessing clustering step. Throughout the present work, we clustered subseries by viewing them as -dimension vectors and using kmeans++ with the Euclidean metric. This approach has clear limitations with regards to clustering subseries with rare extreme data. For example, consider the original motivating example from Section 2.1. The spikes associated with the rare state could occur at any one of the points within a given subseries. As a result, there will be clusters in . Most of the subseries will cluster near the origin with additional clusters near each of the canoncial vectors , . Clustering based on Euclidean distance is unlikely to group data near and , potentially absorbing rare event subseries into clusters comprised of typical dynamics when . As a result, this will eliminate the biased subsampling CSG-MCMC was intended to address.

Alternatively, one could attempt an alignment of the cluster vectors in a myriad of manners such as Procrustes’ analysis or more simply sorting the data points within each subseries prior to embedding in While these two approaches add computation to the preprocessing step them may greatly improve the grouping of rare event data into a single cluster. Other strategies include clustering based on one or more summary statistics of the subseries. This would shorten the clustering step if the number of summary statistics is less than .

Finally, both the gradient estimation and clustering will depend on details of the HMM, perhaps in drastic and in surprising ways. Thus, additional numerical experimentation is needed to identify setting specific choices for the , , and clustering methodology within CSG-MCMC.

6 Acknowledgements

We thank the authors of (Ma et al., 2017) for providing code for SG-MCMC enabling a direct comparison between our two methods. The heart rate data was taken from the MIMIC-III database developed by the MIT Lab for Computational Physiology (Johnson et al., 2016).

References

  • Bishop (2006) Bishop, C. (2006). Pattern Recognition and Machine Learning. Information Science and Statistics. New York: Springer-Verlag.
  • Campbell and Broderick (2017) Campbell, T. and T. Broderick (2017, October). Automated Scalable Bayesian Inference via Hilbert Coresets. arXiv:1710.05053 [cs, stat].
  • Chan et al. (2004) Chan, M. T., A. Hoogs, J. Schmiederer, and M. Petersen (2004, August). Detecting rare events in video using semantic primitives with HMM. In Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004., Volume 4, pp. 150–154 Vol.4.
  • Foti et al. (2014) Foti, N., J. Xu, D. Laird, and E. Fox (2014). Stochastic variational inference for hidden Markov models. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger (Eds.), Advances in Neural Information Processing Systems 27, pp. 3599–3607. Curran Associates, Inc.
  • Fu and Zhang (2017) Fu, T. and Z. Zhang (2017, April). CPSG-MCMC: Clustering-Based Preprocessing method for Stochastic Gradient MCMC. In Artificial Intelligence and Statistics, pp. 841–850.
  • Hall and Willett (2013) Hall, E. C. and R. M. Willett (2013, July). Online Optimization in Dynamic Environments. arXiv:1307.5944 [cs, math, stat].
  • Hassan and Nath (2005) Hassan, M. R. and B. Nath (2005, September). Stock market forecasting using hidden Markov model: A new approach. In 5th International Conference on Intelligent Systems Design and Applications (ISDA’05), pp. 192–196.
  • Huang et al. (1990) Huang, X. D., Y. Ariki, and M. A. Jack (1990). Hidden Markov models for speech recognition.
  • Huggins et al. (2016) Huggins, J., T. Campbell, and T. Broderick (2016).

    Coresets for Scalable Bayesian Logistic Regression.

    In 30th Conference on Neural Information Processing Systems, Barcelona, Spain, pp.  9.
  • Hughes et al. (2015) Hughes, M. C., W. T. Stephenson, and E. Sudderth (2015). Scalable Adaptation of State Complexity for Nonparametric Hidden Markov Models. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett (Eds.), Advances in Neural Information Processing Systems 28, pp. 1198–1206. Curran Associates, Inc.
  • Jiang and Willett (2016) Jiang, X. and R. Willett (2016, September). Online Data Thinning via Multi-Subspace Tracking. arXiv:1609.03544 [cs, stat].
  • Johnson et al. (2016) Johnson, A. E. W., T. J. Pollard, L. Shen, L.-w. H. Lehman, M. Feng, M. Ghassemi, B. Moody, P. Szolovits, L. A. Celi, and R. G. Mark (2016, May). MIMIC-III, a freely accessible critical care database. Scientific Data 3.
  • Johnson and Willsky (2014) Johnson, M. and A. Willsky (2014). Stochastic variational inference for Bayesian time series models. In 31st International Conference on Machine Learning, ICML 2014, pp. 3872–3880.
  • Johnson (2014) Johnson, M. J. (2014). Bayesian Time Series Models and Scalable Inference. Thesis, Massachusetts Institute of Technology.
  • Ma et al. (2015) Ma, Y.-A., T. Chen, and E. B. Fox (2015). A Complete Recipe for Stochastic Gradient MCMC. Advances in Neural Information Processing Systems 28, 2899–2907.
  • Ma et al. (2017) Ma, Y.-A., N. J. Foti, and E. B. Fox (2017). Stochastic Gradient MCMC Methods for Hidden Markov Models.
  • Ourston et al. (2003) Ourston, D., S. Matzner, W. Stump, and B. Hopkins (2003, January). Applications of hidden Markov models to detecting multi-stage network attacks. In 36th Annual Hawaii International Conference on System Sciences, 2003. Proceedings of The, pp. 10 pp.–.
  • Petropoulos et al. (2016) Petropoulos, A., S. P. Chatzis, and S. Xanthopoulos (2016, July). A novel corporate credit rating system based on Student’s-t hidden Markov models. Expert Systems with Applications 53, 87–105.
  • Quiroz et al. (2018) Quiroz, M., R. Kohn, M. Villani, and M.-N. Tran (2018, March). Speeding Up MCMC by Efficient Data Subsampling. Journal of the American Statistical Association 0(0), 1–13.
  • Scott (2002) Scott, S. L. (2002, March). Bayesian Methods for Hidden Markov Models. Journal of the American Statistical Association 97(457), 337–351.
  • Scott et al. (2016) Scott, S. L., A. W. Blocker, F. V. Bonassi, H. A. Chipman, E. I. George, and R. E. McCulloch (2016, April). Bayes and big data: The consensus Monte Carlo algorithm. International Journal of Management Science and Engineering Management 11(2), 78–88.
  • Stigler et al. (2011) Stigler, J., F. Ziegler, A. Gieseke, J. C. M. Gebhardt, and M. Rief (2011, October). The Complex Folding Network of Single Calmodulin Molecules. Science 334(6055), 512–516.
  • Zhang et al. (2005) Zhang, D., D. Gatica-Perez, S. Bengio, and I. McCowan (2005, June). Semi-supervised adapted HMMs for unusual event detection. In

    2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05)

    , Volume 1, pp. 611–618 vol. 1.

References

  • Bishop (2006) Bishop, C. (2006). Pattern Recognition and Machine Learning. Information Science and Statistics. New York: Springer-Verlag.
  • Campbell and Broderick (2017) Campbell, T. and T. Broderick (2017, October). Automated Scalable Bayesian Inference via Hilbert Coresets. arXiv:1710.05053 [cs, stat].
  • Chan et al. (2004) Chan, M. T., A. Hoogs, J. Schmiederer, and M. Petersen (2004, August). Detecting rare events in video using semantic primitives with HMM. In Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004., Volume 4, pp. 150–154 Vol.4.
  • Foti et al. (2014) Foti, N., J. Xu, D. Laird, and E. Fox (2014). Stochastic variational inference for hidden Markov models. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger (Eds.), Advances in Neural Information Processing Systems 27, pp. 3599–3607. Curran Associates, Inc.
  • Fu and Zhang (2017) Fu, T. and Z. Zhang (2017, April). CPSG-MCMC: Clustering-Based Preprocessing method for Stochastic Gradient MCMC. In Artificial Intelligence and Statistics, pp. 841–850.
  • Hall and Willett (2013) Hall, E. C. and R. M. Willett (2013, July). Online Optimization in Dynamic Environments. arXiv:1307.5944 [cs, math, stat].
  • Hassan and Nath (2005) Hassan, M. R. and B. Nath (2005, September). Stock market forecasting using hidden Markov model: A new approach. In 5th International Conference on Intelligent Systems Design and Applications (ISDA’05), pp. 192–196.
  • Huang et al. (1990) Huang, X. D., Y. Ariki, and M. A. Jack (1990). Hidden Markov models for speech recognition.
  • Huggins et al. (2016) Huggins, J., T. Campbell, and T. Broderick (2016).

    Coresets for Scalable Bayesian Logistic Regression.

    In 30th Conference on Neural Information Processing Systems, Barcelona, Spain, pp.  9.
  • Hughes et al. (2015) Hughes, M. C., W. T. Stephenson, and E. Sudderth (2015). Scalable Adaptation of State Complexity for Nonparametric Hidden Markov Models. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett (Eds.), Advances in Neural Information Processing Systems 28, pp. 1198–1206. Curran Associates, Inc.
  • Jiang and Willett (2016) Jiang, X. and R. Willett (2016, September). Online Data Thinning via Multi-Subspace Tracking. arXiv:1609.03544 [cs, stat].
  • Johnson et al. (2016) Johnson, A. E. W., T. J. Pollard, L. Shen, L.-w. H. Lehman, M. Feng, M. Ghassemi, B. Moody, P. Szolovits, L. A. Celi, and R. G. Mark (2016, May). MIMIC-III, a freely accessible critical care database. Scientific Data 3.
  • Johnson and Willsky (2014) Johnson, M. and A. Willsky (2014). Stochastic variational inference for Bayesian time series models. In 31st International Conference on Machine Learning, ICML 2014, pp. 3872–3880.
  • Johnson (2014) Johnson, M. J. (2014). Bayesian Time Series Models and Scalable Inference. Thesis, Massachusetts Institute of Technology.
  • Ma et al. (2015) Ma, Y.-A., T. Chen, and E. B. Fox (2015). A Complete Recipe for Stochastic Gradient MCMC. Advances in Neural Information Processing Systems 28, 2899–2907.
  • Ma et al. (2017) Ma, Y.-A., N. J. Foti, and E. B. Fox (2017). Stochastic Gradient MCMC Methods for Hidden Markov Models.
  • Ourston et al. (2003) Ourston, D., S. Matzner, W. Stump, and B. Hopkins (2003, January). Applications of hidden Markov models to detecting multi-stage network attacks. In 36th Annual Hawaii International Conference on System Sciences, 2003. Proceedings of The, pp. 10 pp.–.
  • Petropoulos et al. (2016) Petropoulos, A., S. P. Chatzis, and S. Xanthopoulos (2016, July). A novel corporate credit rating system based on Student’s-t hidden Markov models. Expert Systems with Applications 53, 87–105.
  • Quiroz et al. (2018) Quiroz, M., R. Kohn, M. Villani, and M.-N. Tran (2018, March). Speeding Up MCMC by Efficient Data Subsampling. Journal of the American Statistical Association 0(0), 1–13.
  • Scott (2002) Scott, S. L. (2002, March). Bayesian Methods for Hidden Markov Models. Journal of the American Statistical Association 97(457), 337–351.
  • Scott et al. (2016) Scott, S. L., A. W. Blocker, F. V. Bonassi, H. A. Chipman, E. I. George, and R. E. McCulloch (2016, April). Bayes and big data: The consensus Monte Carlo algorithm. International Journal of Management Science and Engineering Management 11(2), 78–88.
  • Stigler et al. (2011) Stigler, J., F. Ziegler, A. Gieseke, J. C. M. Gebhardt, and M. Rief (2011, October). The Complex Folding Network of Single Calmodulin Molecules. Science 334(6055), 512–516.
  • Zhang et al. (2005) Zhang, D., D. Gatica-Perez, S. Bengio, and I. McCowan (2005, June). Semi-supervised adapted HMMs for unusual event detection. In

    2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05)

    , Volume 1, pp. 611–618 vol. 1.