# Tech Report A Variational HEM Algorithm for Clustering Hidden Markov Models

The hidden Markov model (HMM) is a generative model that treats sequential data under the assumption that each observation is conditioned on the state of a discrete hidden variable that evolves in time as a Markov chain. In this paper, we derive a novel algorithm to cluster HMMs through their probability distributions. We propose a hierarchical EM algorithm that i) clusters a given collection of HMMs into groups of HMMs that are similar, in terms of the distributions they represent, and ii) characterizes each group by a "cluster center", i.e., a novel HMM that is representative for the group. We present several empirical studies that illustrate the benefits of the proposed algorithm.

## Authors

• 2 publications
• 32 publications
• 3 publications
• ### Clustering hidden Markov models with variational HEM

The hidden Markov model (HMM) is a widely-used generative model that cop...
10/24/2012 ∙ by Emanuele Coviello, et al. ∙ 0

• ### Viterbi Extraction tutorial with Hidden Markov Toolkit

An algorithm used to extract HMM parameters is revisited. Most parts of ...
08/07/2019 ∙ by Zulkarnaen Hatala, et al. ∙ 0

• ### Characterizing A Database of Sequential Behaviors with Latent Dirichlet Hidden Markov Models

This paper proposes a generative model, the latent Dirichlet hidden Mark...
05/24/2013 ∙ by Yin Song, et al. ∙ 0

• ### Anomaly Detection with HMM Gauge Likelihood Analysis

This paper describes a new method, HMM gauge likelihood analysis, or GLA...
06/14/2019 ∙ by Boris Lorbeer, et al. ∙ 4

• ### Modeling Group Dynamics Using Probabilistic Tensor Decompositions

We propose a probabilistic modeling framework for learning the dynamic p...
06/24/2016 ∙ by Lin Li, et al. ∙ 0

• ### Attacker Behaviour Profiling using Stochastic Ensemble of Hidden Markov Models

Cyber threat intelligence is one of the emerging areas of focus in infor...
05/28/2019 ∙ by Soham Deshmukh, et al. ∙ 0

• ### Probability Bracket Notation: Markov State Chain Projector, Hidden Markov Models and Dynamic Bayesian Networks

After a brief discussion of Markov Evolution Formula (MEF) expressed in ...
12/16/2012 ∙ by Xing M. Wang, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

The hidden Markov model (HMM) [1] is a probabilistic model that assumes a signal is generated by a double embedded stochastic process. A hidden state process, which evolves over discrete time instants as a Markov chain, encodes the dynamics of the signal, and an observation process, at each time conditioned on the current state, encodes the appearance of the signal. HMMs have been successfully applied to a variety of applications, including speech recognition [1], music analysis [2], on-line hand-writing recognition [3], analysis of biological sequences [4].

The focus of this paper is an algorithm for clustering HMMs. More precisely, we design an algorithm that, given a collection of HMMs, partitions them into

clusters of “similar” HMMs, while also learning a representative HMM “cluster center” that summarizes each cluster appropriately. This is similar to standard k-means clustering, except that the data points are HMMs now instead of vectors in

. Such HMM clustering algorithm has various potential applications, ranging from hierarchical clustering of sequential data (e.g., speech or motion sequences), over hierarchical indexing for fast retrieval, to reducing the computational complexity of estimating mixtures of HMMs from large datasets via hierarchical modeling (e.g., to learn semantic annotation models for music and video).

One possible approach is to group HMMs in parameter space. However, as HMM parameters lie on a non-linear manifold, they cannot be clustered by a simple application of the k-means algorithm, which assumes real vectors in a Euclidean space. One solution, proposed in [5], first constructs an appropriate similarity matrix between all HMMs that are to be clustered (e.g., based on the Bhattacharya affinity, which depends on the HMM parameters [6]

), and then applies spectral clustering. While this approach has proven successful to group HMMs into similar clusters

[5], it does not allow to generate novel HMMs as cluster centers. Instead, one is limited to representing each cluster by one of the given HMMs, e.g., the HMM which the spectral clustering procedure maps the closest to each spectral clustering center. This may be suboptimal for various applications of HMM clustering.

An alternative to clustering the HMMs in parameter space is to cluster them directly with respect to the probability distributions they represent. To cluster Gaussian probability distributions, Vasconcelos and Lipmann [7]

proposed a hierarchical expectation-maximization (HEM) algorithm. This algorithm starts from a Gaussian mixture model (GMM) with

components and reduces it to another GMM with fewer components, where each of the mixture components of the reduced GMM represents, i.e., clusters, a group of the original Gaussian mixture components. More recently, Chan et al. [8] derived an HEM algorithm to cluster dynamic texture (DT) models (i.e., linear dynamical systems, LDSs) through their probability distributions. HEM has been applied successfully to construct GMM hierarchies for efficient image indexing [9], to cluster video represented by DTs [10], and to estimate GMMs or DT mixtures (DTMs,i.e., LDS mixtures) from large datasets for semantic annotation of images [11], video [10] and music [12, 13].

To extend the HEM framework for GMMs to mixtures of HMMs (H3Ms), additional marginalization of the hidden-state processes is required, as for DTMs. However, while Gaussians and DTs allow tractable inference in the E-step of HEM, this is no longer the case for HMMs. Therefore, in this work, we derive a variational formulation of the HEM algorithm (VHEM) to cluster HMMs through their probability distributions, based on a variational approximation derived by Hershey [14]. The resulting algorithm not only allows to cluster HMMs, it also learns novel HMMs that are representative centers of each cluster, in a way that is consistent with the underlying generative probabilistic model of the HMM. The resulting VHEM algorithm can be generalized to handle other classes of graphical models, for which standard HEM would otherwise be intractable, by leveraging similar variational approximations. The efficacy of the VHEM-H3M algorithm is demonstrated for various applications, including hierarchical motion clustering, semantic music annotation, and online hand-writing recognition.

## 2 The hidden Markov model

A hidden Markov model (HMM) assumes a sequence of observations is generated by a double embedded stochastic process, where each observation at time depends on the state of a discrete hidden variable and where the sequence of hidden states evolves as a first order Markov process. The discrete variables can take one of values, and the evolution of the hidden process is encoded in a transition matrix whose entries are the state transition probabilities , and an initial state distribution , where

. Each state generates observation accordingly to an emission probability density function,

, which here we assume to be a Gaussian mixture model:

 p(y|x=β) = M∑m=1cβ,mN(y;μβ,m,Σβ,m) (1)

where is the number of Gaussian components and are the mixing weights. In the following, when referring to a sequences of length , we will use the notation to represent the probability that the HMM generates the state sequence . The HMM is specified by the parameters which can be efficiently learned with the forward-backward algorithm [1], which is based on maximum likelihood.

A hidden Markov mixture model (H3M) [15] models a set observation sequences as samples from a group of hidden Markov models, which represent different sub-behaviors. For a given sequence, an assignment variable selects the parameters of one of the HMMs, where the HMM is selected with probability . Each mixture component is parametrized by and the H3M is parametrized by . Given a collection of relevant observation sequences, the parameters of can be learned with recourse to the EM algorithm [15].

To reduce clutter, here we assume that all the HMMs have the same number of states and that all emission probabilities have mixture components, though a more general case could be derived.

## 3 Variational hierarchical EM algorithm for H3Ms

The hierarchical expectation maximization algorithm (HEM) [7]

was initially proposed to cluster Gaussian distributions, by reducing a GMM with a large number of components to a new GMM with fewer components, and then extended to dynamic texture models

[8]. In this section we derive a variational formulation of the HEM algorithm (VHEM) to cluster HMMs.

### 3.1 Formulation

Let be a base hidden Markov mixture model with components. The goal of the VHEM algorithm is to find a reduced mixture with components that represent . The likelihood of a random sequence is given by

 p(y1:τ|M(b))=K(b)∑i=1ω(b)ip(y1:τ|z(b)=i,M(b)), (2)

where is the hidden variable that indexes the mixture components. is the likelihood of under the ith mixture component, and is the prior weight for the ith component. The likelihood of the random sequence is

 p(y1:τ|M(r))=K(r)∑j=1ω(r)jp(y1:τ|z(r)=j,M(r)), (3)

where is the hidden variable for indexing components in . Note that we will always use and to index the components of the base model, , and the reduced model, , respectively. In addition, we will always use and to index the hidden states of and and for . To reduce clutter we will denote , and . In addition, we will use short-hands and when conditioning over specific state sequences. For example, we denote , and . Finally, we will use and for indexing the gaussian mixture components of the emission probabilities of the base respectively reduced mixture, which we will denote as and .

### 3.2 Parameter estimation - a variational formulation

To obtain the reduced model, we consider a set of virtual samples drawn from the base model , such that samples are drawn from the th component. We denote the set of virtual samples for the th component as , where , and the entire set of samples as . Note that, in this formulation, we are not generating virtual samples

according to the joint distribution of the base component,

. The reason is that the hidden state spaces of each base mixture component may have a different representation, (e.g., , the numbering of the hidden states may be permuted between the components). This basis mismatch will cause problems when the parameters of are computed from virtual samples of the hidden states of . Instead, we must treat as “missing” information, as in the standard EM formulation.

The likelihood of the virtual samples is

 J(M(r)) = logp(Y|M(r))=K(b)∑i=1logp(Yi|M(r))=K(b)∑i=1logK(r)∑j=1ω(r)jp(Yi|M(r)j). (4)

In particular, for a given , the computation of can be carried out solving the optimization problems [16, 17]:

 logp(Yi|M(r)) = maxPi(zi)logp(Yi|M(r))−D(Pi(zi)||P(zi=j|Yi,M(r))) (5) = maxPi(zi)∑jPi(zi=j)[logω(r)j+logp(Yi|M(r)j)−logPi(zi=j)] (6)

for , where are variational distributions and is the Kullback-Leibler (KL) divergence between two distributions, and . In order to obtain a consistent clustering [7], we assume the whole sample is assigned to the same component of the reduced model, i.e., , with and , and (6) becomes:

 logp(Yi|M(r)) = maxzij∑jzij[logω(r)j+logp(Yi|M(r)j)−logzij] (7)

Considering that virtual samples are independent for different values of , we can solve (6) independently for each , using the result in Section 6.2, and find

 ^zij=ω(r)jexp{NiEM(b)i[logp(Yi|M(r)j)]}∑K(r)j′=1ω(r)j′exp{NiEM(b)i[logp(Yi|M(r)j′)]}. (8)

For the likelihood of the virtual samples we use

 logp(Yi|M(r)j) = Ni∑m=1logp(y(i,m)1:τ|M(r)j)=Ni[1NiNi∑m=1logp(y(i,m)1:τ|M(r)j)] (9) ≈ NiEM(b)i[logp(y1:τ|M(r)j)] (10)

where (10

) follows from the law of large numbers

[7] (as ). Substituting (10) in (8) we get the formula for derived in [7]:

 ^zij=ω(r)jexp{NiEM(b)i[logp(y1:τ|M(r)j)]}∑K(r)j′=1ω(r)j′exp{NiEM(b)i[logp(y1:τ|M(r)j′)]}. (11)

We follow a similar approach to compute the expected log-likelihoods . We introduce variational distributions to approximate for observations emitted by state sequence , and solve the maximization problem

 E M(b)i[logp(y1:τ|M(r)j)]= (12) = maxPi,j∑β1:τπ(b),ix1:τEM(b)i,β1:τ[logp(y1:τ|M(r)j)−D(Pi,jβ1:τ||P(x1:τ|y1:τ,M(r)j))] (13) = maxPi,j∑β1:τπ(b),ix1:τEM(b)i,β1:τ⎡⎢⎣∑ρ1:τPi,jβ1:τ(x1:τ=ρ1:τ)logπ(r),jρ1:τp(y1:τ|M(r)j,ρ1:τ)Pi,jβ1:τ(x1:τ=ρ1:τ)⎤⎥⎦ (14)

where in (13

) we have used the law of total probability to condition the expectation over each state sequence

of

In general, maximizing (14) exactly sets the variational distribution to the true posterior and reduces (together with (11)) to the E-step of the HEM algorithm for hidden state models derived in [10]:

 EM(b)i[logp(y1:τ|M(r)j)] = (15)

where the inner expectation is taken with respect to the current estimate of , and is some term that does not depend on .

### 3.3 The variational HEM for HMMs

The maximization of (14) cannot be carried out in a efficient way, as it involves computing the expected log-likelihood of a mixture. To make it tractable we follow a variational approximation proposed by Hershey [14], and restrict the maximization to factored distribution in the form of a Markov chain, i.e.,

 Pi,jβ1:τ(x1:τ=ρ1:τ)=ϕi,jρ1:τ|β1:τ=ϕi,j1(ρ1,β1)τ∏t=2ϕi,jt(ρt−1,ρt,βt) (16)

where and .

Substituting (16) into (14) we get a lower bound to the expected log-likelihood (14), i.e.,

 EM(b)i[logp(y1:τ|M(r)j)] ≥ Ji,j(M(r)j,ϕi,j)∀ϕi,j (17)

where we have defined

 Ji,j(M(r)j,ϕi,j)=∑β1:τπ(b),iβ1:τ∑ρ1:τϕi,jρ1:τ|β1:τlogπ(r),jρ1:τexpEM(b)i,β1:τ[logp(y1:τ|M(r)j,ρ1:τ,)]ϕi,jρ1:τ|β1:τ. (18)

Using the property of HMM with memory one that observations at different time instants are independent given the corresponding states, we can break the expectation term in equation (18) in the following summation

 EM(b)i,β1:τ[logp(y1:τ|M(r)j,ρ1:τ)]=τ∑t=1L(M(b)i,βt||M(r)j,ρt) (19)

where . As the emission probabilities are GMMs, the computation (19) cannot be carried out efficiently. Hence, we use a variational approximation [18], and introducte variational parameters for , with , and . Intuitively, is the responsibility matrix between gaussian observation components for state in and state in , where means the probability that an observation from component of corresponds to component of . Again, we obtain a lower bound:

 L(M(b)i,βt||M(r)j,ρt) ≥ L(M(b)i,βt||M(r)j,ρt)∀η(i,β),(j,ρ) (20)

where we have defined:

 L(M(b)i,βt||M(r)j,ρt)=M∑m=1c(b),iβ,mM∑ℓ=1η(i,β),(j,ρ)ℓ|m[logc(r),jρ,ℓ+LG(M(b),iβ,m||M(r),jρ,ℓ)−logη(i,β),(j,ρ)ℓ|m] (21)

where can be computed exactly for Gaussians

 LG(M(b),iβ,m||M(r),jρ,ℓ) = −12dlog2π+log∣∣Σ(r)j,ρ∣∣−12tr(Σ(r)j,ρ−1Σ(b)i,β) (23) −12(μ(r)j,ρ−μ(b)i,β)TΣ(r)j,ρ−1(μ(r)j,ρ−μ(b)i,β).

Plugging (21) into (19) and (18) we get the lower bound to the expected log-likelihood:

 EM(b)i[logp(y1:τ|M(r)j)] ≥ Ji,j(M(r),ϕi,j,η)∀ϕi,j,η (24)

where we have defined:

 Ji,j(M(r)j,ϕi,j,η)=∑β1:τπ(b),iβ1:τ∑ρ1:τϕi,jρ1:τ|β1:τ[logπ(r),jρ1:τ+τ∑t=1L(M(b)i,βt||M(r)j,ρt)−logϕi,jρ1:τ|β1:τ]. (25)

Maximizing the right hand side of (24) with respect to finds the most accurate approximation to the real posterior within the restricted class, i.e., the one that achieves the tightest lower bound.

Finally, plugging (24) into (4), we obtain a lower bound on the log-likelihood of the virtual sample:

 J(M(r)) ≥ J(M(r),z,ϕ,η)∀z,ϕ,η (26)

where we have defined:

 J(M(r),z,ϕ,η) = K(b)∑i=1K(r)∑j=1zijlogω(r)j−K(b)∑i=1K(r)∑j=1zijlogzij (27) + K(b)∑i=1K(r)∑j=1zijNiS∑β=1π(b),iβ1:τS∑ρ=1ϕi,jρ|βlogπ(r),jρ1:τ + K(b)∑i=1K(r)∑j=1zijNiS∑β=1π(b),iβ1:τS∑ρ=1ϕi,jρ1:τ|β1:ττ∑t=1L(M(b)i,βt||M(r)j,ρt) − K(b)∑i=1K(r)∑j=1zijNiS∑β=1π(b),iβ1:τS∑ρ=1ϕi,jρ1:τ|β1:τlogϕi,jρ1:τ|β1:τ

To find the tightest possible lower bound to the log-likelihood of the virtual sample we need to solve

 J(M(r)) ≥ maxz,ϕ,ηJ(M(r),z,ϕ,η). (28)

Starting from an initial guess for , the parameters can be estimated by maximizing (28) irteratively with respect to (E-step) , , and (M-step)

### 3.4 E-step

The E-steps first considers the maximization of (28) with respect to for fixed and , i.e.,

 ^η=argmaxηJ(M(r),z,ϕ,η) (29)

It can be easily verified that the maximization (29) does not depend on and can be carried out independently for each tuple using the result in Section 6.2 [18], which gives:

 ^η(i,β),(j,ρ)ℓ|m = c(r),jρ,ℓexp{LG(M(b),iβ,m||M(r),jρ,ℓ)}∑Mℓ′=1c(r),jρ,ℓ′exp{LG(M(b),iβ,m||M(r),jρ′,ℓ)} (30)

and that the terms in (21) can then be computed for each as:

 L(M(b)i,β||M(r)j,ρ)=M∑m=1c(b),iβ,mlogM∑ℓ=1c(r),jρ,ℓexp{LG(M(b),iβ,m||M(r),jρ,ℓ)}. (31)

Next, (28) is maximized with respect to for fixed and , i.e.,

 ^ϕ=argmaxϕJ(M(r),z,ϕ,η) (32)

The maximization does not depends on and can be carried out independently for each pair with a backward recursion recursion [14] that computes

 ^ϕi,jt(ρt−1,ρt,βt) = a(r),jρt−1,ρtexp{L(M(b)i,βt||M(r)j,ρt)+Li,jt+1(βt,ρt)}∑ρa(r),jρt−1,ρexp{L(M(b)i,βt||M(r)j,ρ)+Li,jt+1(βt,ρ)} (33) Li,jt(ρt−1,βt−1) = N∑β=1a(b),iβt−1,βlogN∑ρ=1a(r),jρt−1,ρexp{L(M(b)i,β||M(r)j,ρ)+Li,jt+1(β,ρ)} (34)

for , where it is understood that , and terminates with

 ^ϕi,j1(ρ1,β1) = π(r),jρ1exp{L(M(b)i,β1||M(r)j,ρ1)+Li,j2(β1,ρ1)}∑ρπ(r),jρexp{L(M(b)i,β1||M(r)j,ρ)+Li,j2(β1,ρ)} (35) Ji,j(M(r)j,^ϕi,j,η) = N∑β=1π(b),iβlogN∑ρ=1π(r),jρexp{L(M(b)i,β||M(r)j,ρ)+Li,j2(β,ρ)}. (36)

Next, the maximization of (28) with respect to for fixed and , i.e.,

 ^z=argmaxzJ(M(r),z,ϕ,η) (37)

reduces to compute the as in (11) using (36) to approximate (10).

Finally, we compute the following summary statistics:

 νi,j1(σ,γ) = π(b),iγ^ϕi,j1(σ,γ) (38) ξi,jt(ρ,σ,γ) = ⎛⎝N∑β=1νi,jt−1(ρ,β)a(b),iβ,γ⎞⎠^ϕi,jt(ρ,σ,γ) for t=2,…,τ (39) νi,jt(σ,γ) = N∑ρ=1ξi,jt(ρ,σ,γ) for t=2,…,τ (40)

and the aggregates

 ^νi,j1(σ)=N∑γ=1νi,j1(σ,γ) (41) ^νi,j(σ,γ)=τ∑t=1νi,jt(σ,γ) (42) ^ξi,j(ρ,σ)=τ∑t=2N∑γ=1ξi,jt(ρ,σ,γ). (43)

The quantity is the responsibility between state of the HMM and state of the HMM at time , when modeling a sequence generated by . Similarly, the quantity is the responsibility between a transition from state to state