 # Inference in Hidden Markov Models with Explicit State Duration Distributions

In this letter we borrow from the inference techniques developed for unbounded state-cardinality (nonparametric) variants of the HMM and use them to develop a tuning-parameter free, black-box inference procedure for Explicit-state-duration hidden Markov models (EDHMM). EDHMMs are HMMs that have latent states consisting of both discrete state-indicator and discrete state-duration random variables. In contrast to the implicit geometric state duration distribution possessed by the standard HMM, EDHMMs allow the direct parameterisation and estimation of per-state duration distributions. As most duration distributions are defined over the positive integers, truncation or other approximations are usually required to perform EDHMM inference.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Hidden Markov models (HMMs) are a fundamental tool for data analysis and exploration. Many variants of the basic HMM have been developed in response to shortcomings in the original HMM formulation . In this paper we address inference in the explicit state duration HMM (EDHMM). By state duration we mean the amount of time an HMM dwells in a state. In the standard HMM specification, a state’s duration is implicit and, a priori, distributed geometrically.

The EDHMM (or, equivalently, the hidden semi-Markov model ) was developed to allow explicit parameterization and direct inference of state duration distributions. EDHMM estimation and inference can be performed using the forward-backward algorithm; though only if the sequence is short or a tight “allowable” duration interval for each state is hard-coded a priori . If the sequence is short then forward-backward can be run on a state representation that allows for all possible durations up to the observed sequence length. If the sequence is long then forward-backward only remains computationally tractable if only transitions between durations that lie within pre-specified allowable intervals are considered. If the true state durations lie outside those intervals then the resulting model estimates will be incorrect: the learned duration distributions can only reflect what is allowed given the pre-specified duration intervals.

Our contribution is the development of a procedure for EDHMM inference that does not require any hard pre-specification of duration intervals, is efficient in practice, and, as it is an asymptotically exact procedure, does not risk incorrect inference. The technique we use to do this is borrowed from sampling procedures developed for nonparametric Bayesian HMM variants 

. Our key insight is simple: the machinery developed for inference in HMMs with a countable number of states is precisely the same as that which is needed for doing inference in an EDHMM with duration distributions over countable support. So, while the EDHMM is a distinctly parametric model, the tools from nonparametric Bayesian inference can be applied such that black-box inference becomes possible and, in practice, efficient.

In this work we show specifically that a “beam-sampling” approach 

works for estimating EDHMMs, learning both the transition structure and duration distributions simultaneously. In demonstrating our EDHMM inference technique we consider a synthetic system in which the state-cardinality is known and finite, but where each state’s duration distribution is unknown. We show that the EDHMM beam sampler performs accurate tracking whilst capturing the duration distributions as well as the probability of transitioning between states.

The remainder of the letter is organised as follows. In Section 2 we introduce the EDHMM; in Section 3 we review beam-sampling for the infinite Hidden Markov Model (iHMM)  and show how it relates to the EDHMM inference problem; and in Section 4 we show results from using the EDHMM to model synthetic data.

## 2 Explicit Duration Hidden Markov Model

The EDHMM captures the relationships among state , duration , and observation over time . It consists of four components: the initial state distribution, the transition distributions, the observation distributions, and the duration distributions.

We define the observation sequence ; the latent state sequence ; and the remaining time in each segment , where with the maximum number of states, , and

. We assume that the Markov chain on the latent states is homogenous, i.e., that

where is a matrix with element at row and column The prior on is row-wise Dirichlet with zero prior mass on self-transitions, i.e. where

is a row vector and the

th Dirichlet parameter is Each state is imbued with its own duration distribution with parameter . Each duration distribution parameter is drawn from a prior which can be chosen in an application specific way. The collection of all duration distribution parameters is . Each state is also imbued with an observation generating distribution with parameter . Each observation distribution parameter is drawn from a prior also to be chosen according to the application. The set of all observation distribution parameters is In the following exposition, explicit conditional dependencies on component distribution parameters are omitted to focus on the particulars unique to the EDHMM.

In an EDHMM the transitions between states are only allowed at the end of a segment:

 p(xt|xt−1,dt−1)={δ(xt,xt−1)if dt−1>1p(xt|xt−1)otherwise (1)

where the Kronecker delta if and zero otherwise. The duration distribution generates segment lengths at every state switch:

 p(dt|xt,dt−1)={δ(dt,dt−1−1)if dt−1>1p(dt|xt)otherwise. (2)

The joint distribution of the EDHMM is

 p(X,D,Y)=p(x0)p(d0)T∏t=1p(yt|xt,θ)p(xt|xt−1,dt−1,A)p(dt|xt,dt−1,λ) (3)

corresponding to the graphical model in Figure (a)a. Alternative choices to define the duration variable exist; see  for details. Algorithm 1 illustrates the EDHMM as a generative model.

## 3 EDHMM Inference

Our aim is to estimate the conditional posterior distribution of the latent states ( and ) and parameters ( and ) given observations by samples drawn via Markov chain Monte Carlo. Sampling and given proceeds per usual textbook approaches . Sampling given is straightforward in most situations. Indirect Gibbs sampling of is possible using auxiliary state-change indicator variables, but for reasons similar to those in , such a sampler will not mix well. The main contribution of this paper is to show how to generate posterior samples of and .

### 3.1 Forward Filtering, Backward Sampling

We can, in theory, use the forward messages from the forward backward algorithm  to sample the conditional posterior distribution of and To do this we treat each state-duration tuple as a single random variable (introducing the notation ). Doing so recovers the standard hidden Markov model structure and hence standard forward messages can be used directly. A forward filtering, backward sampler for conditioned on all other random variables requires the classical forward messages:

 αt(zt)=∑zt−1p(zt|zt−1)p(yt|zt)αt−1(zt−1) (4)

where the transition probability can be factorised according to our modelling assumptions:

 p(zt|zt−1)=p(xt|xt−1,dt−1)p(dt|dt−1,xt). (5)

Unfortunately the sum in (4) has at worst an infinite number of terms in the case of duration distributions with countably infinite support and at best a very large number of terms in the case of long sequences. The standard approach to EDHMM inference involves truncating considered durations to only those that lie between and or computation involving all possible durations up to the observed length of the sequence (). This leads to per-sample, forward-backward computational complexity of . Truncation yields inference that will simply fail if an actual duration lies outside hard-coded allowable duration intervals. Considering all possible durations up to length is often computationally impossible. The beam-sampler we propose behaves like a dynamic version of the truncation approach, automatically defining and scaling per-state duration truncation intervals. Better though, the way it does this results in an asymptotically exact sample with no risk of incorrect inference resulting from incorrectly pre-specified duration truncations. We do not characterize the computational complexity of the proposed beam sampler in this work but note that it is upper bounded by (i.e., the beam sampler admits durations of length equal to the entire sequence) but in practice is found to be as or more efficient than the risky hard-truncation approach.

### 3.2 EDHMM Beam Sampling

A recent contribution to inference in the infinite Hidden Markov Model (iHMM)  suggests a way around truncation . The iHMM is an HMM with a countable number of states. Computing the forward message for a forward filtering, backward sampler for the latent states in an iHMM also requires a sum over a countable number of elements. The “beam sampling” approach , which we can apply largely without modification, is to truncate this sum by introducing a “slice”  auxiliary variable at each time step. The auxiliary variables are chosen in such a way as to automatically limit each sum in the forward pass to a finite number of terms while still allowing all possible durations.

The particular choice of auxiliary variable is important. We follow  in choosing to be conditionally distributed given the current and previous state and duration in the following way (see the graphical model in Figure (b)b):

 p(ut|zt,zt−1)=I(0

where returns one if its operand is true and zero otherwise. Given it is possible to sample the state and duration conditional posterior. Using notation to indicate sub-ranges of a sequence, the new forward messages we compute are:

 ^αt(zt) = p(zt,Yt1,Ut1)=∑zt−1p(zt,zt−1,Yt1,Ut1) ∝ ∑zt−1p(ut|zt,zt−1)p(zt,zt−1,Yt1,Ut−11) = ∑zt−1I(0

The indicator function results in non-zero probabilities in the forward message for only those states whose likelihood given is greater than . The beam sampler derives its computational advantage from the fact that the set of ’s for which this is true is typically small.

The backwards sampling step recursively samples a state sequence from the distribution which can expressed in terms of the forward variable:

 p(zt−1|zt,Y,U) ∝ p(zt,zt−1,Y,U) ∝ p(ut|zt,zt−1)p(zt|zt−1)^αt−1(zt−1) ∝ I(0

The full EDHMM beam sampler is given in Algorithm 2, which makes use of the forward recursion in (3.2), the slice sampler in (6), and the backwards sampler in (3.2).

### 3.3 Related Work

The need to accommodate explicit state duration distributions in HMMs has long been recognised. Rabiner  details the basic approach which expands the state space to include dwell time before applying a slightly modified Baum-Welch algorithm. This approach specifies a maximum state duration, limiting practical application to cases with short sequences and dwell times. This approach, generalised under the name “segmental hidden Markov models”, includes more general transitions than those Rabiner considered, allowing the next state and duration to be conditioned on the previous state and duration . Efficient approximate inference procedures were developed in the context of speech recognition , speech synthesis , and evolved into symmetric approaches suitable for practical implementation . Recently, a “sticky” variant of the hierarchical Dirichlet process HMM (HDP-HMM) has been developed . The HDP-HMM has countable state-cardinality  allowing estimation of the number of states in the HMM; the sticky aspect addresses long dwell times by introducing a parameter in the prior that favours self-transition.

## 4 Experiments

### 4.1 Synthetic Data

The first experiment uses the 500 data points (Figure 2) generated from a three state EDHMM. The duration distributions were Poisson with rates , , ; each observation distribution was Gaussian with means of , , and

, each with a variance of 1. The transition distributions

were set to

 ⎡⎢⎣00.30.70.600.40.30.70⎤⎥⎦.

Broad, uninformative priors were chosen for the parameters of the duration and observation distributions. The observation distribution parameters were given a normal-inverse-Wishart (N-IW) prior with parameters , , and . The rate parameters for all states were given priors.

One thousand samples were collected from the EDHMM beam sampler after a burn-in of 500 samples. The learned posterior distribution of the state duration parameters and means of the observation distributions are shown in Figure 3. The EDHMM achieves high accuracy in the estimated posterior distribution of the observation means, despite the overlap in observation distributions. The rate parameter distributions are reasonably estimated given the small number of observed segments. Figure 4 shows the mean number of transitions visited per time point over each iteration of the sampler. (a) Figure 4: Mean number of transitions considered per time point by the beam sampler for 1000 post-burn-in sweeps on data from Figure 3. Consider this in comparison to the (KT)2=O(106) per time point transitions that would need to be considered by standard forward backward without truncation, a surely-safe, truncation-free, but computationally impractical alternative.

A second experiment was performed to demonstrate the ability of the EDHMM to distinguish between states having differing duration distributions but the same observation distribution. The same model and sampling procedure was used as above except here , , and . Figure 5 shows that the sampler clearly separates the high state associated with from the other states and clearly reveals the presence of two low states with differing duration distributions. Figure (b)b shows posterior samples that indicate that the model is mixing over ambiguities about states and as it should.

## 5 Discussion

We presented a beam sampler for the explicit state duration HMM. This sampler draws state sequences from the true posterior distribution without any need to make truncation approximations. It remains future work to combine the explicit state duration HMM and the iHMM. Python code associated with the EDHMM is available online.

## References

•  Matthew J Beal, Z Ghahramani, and C E Rasmussen. The Infinite Hidden Markov Model. Advances in Neural Information Processing Systems, 1:577–584, 2002.
•  C M Bishop. Springer, 2006.
•  Silvia Chiappa. Unified Treatment of Hidden Markov Switching Models, April 2011.
•  Emily B Fox, Erik B Sudderth, Michael I Jordan, and Alan S Willsky. An HDP-HMM for systems with state persistence. Proceedings of the 25th International Conference on Machine Learning (2008), 25:312–319, 2008.
•  M J F Gales and S J Young. The Theory of Segmental Hidden Markov Models. Technical report, Cambride University Engineering Department, 1993.
•  S. Goldwater, T.L. Griffiths, and M. Johnson. A Bayesian framework for word segmentation: Exploring the effects of context. Cognition, 112(1):21–54, 2009.
•  Radford M Neal. Slice sampling. Annals of Statistics, 31(3):705–767, 2003.
•  M Ostendorf, V V Digalakis, and O A Kimball. From HMMs to segment models: a unified view of stochastic modeling for speech recognition. IEEE Transactions on Speech and Audio Processing, 4(5):360 – 378, 1996.
•  L R Rabiner. A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition. Proceedings of the IEEE, 77(2):257–286, 1989.
•  Y W Teh, M I Jordan, M J Beal, and D M Blei. Hierarchical Dirichlet Processes. Journal of the American Statistical Association, 101(476):1566–1581, 2006.
•  Jurgen Van Gael, Yunus Saatci, Yee Whye Teh, and Zoubin Ghahramani. Beam sampling for the infinite hidden Markov model. Proceedings of the 25th International Conference on Machine Learning (2008), 25:1088–1095, 2008.
•  S. Yu. Hidden semi-Markov models. Artificial Intelligence, 174:215–243, 2010.
•  Hisahi Yu, Shun-Zeng, Kobayashi. Practical implementation of an efficient forward-backward algorithm for an explicit-duration hidden Markov model. IEEE Transactions on Signal Processing, 54(5):1947–1951, 2006.
•  Heiga Zen, Keiichi Tokuda, Takashi Masuko, Takao Kobayasih, and Tadashi Kitamura. A hidden semi-Markov model-based speech synthesis system. IEICE Transactions on Information and Systems, E90-D(5):825–834, 2007.