The hidden Markov model (HMM)  is a probabilistic model that assumes a signal is generated by a double embedded stochastic process. A hidden state process, which evolves over discrete time instants as a Markov chain, encodes the dynamics of the signal, and an observation process, at each time conditioned on the current state, encodes the appearance of the signal. HMMs have been successfully applied to a variety of applications, including speech recognition , music analysis , on-line hand-writing recognition , analysis of biological sequences .
The focus of this paper is an algorithm for clustering HMMs. More precisely, we design an algorithm that, given a collection of HMMs, partitions them into
clusters of “similar” HMMs, while also learning a representative HMM “cluster center” that summarizes each cluster appropriately. This is similar to standard k-means clustering, except that the data points are HMMs now instead of vectors in
. Such HMM clustering algorithm has various potential applications, ranging from hierarchical clustering of sequential data (e.g., speech or motion sequences), over hierarchical indexing for fast retrieval, to reducing the computational complexity of estimating mixtures of HMMs from large datasets via hierarchical modeling (e.g., to learn semantic annotation models for music and video).
One possible approach is to group HMMs in parameter space. However, as HMM parameters lie on a non-linear manifold, they cannot be clustered by a simple application of the k-means algorithm, which assumes real vectors in a Euclidean space. One solution, proposed in , first constructs an appropriate similarity matrix between all HMMs that are to be clustered (e.g., based on the Bhattacharya affinity, which depends on the HMM parameters 
), and then applies spectral clustering. While this approach has proven successful to group HMMs into similar clusters, it does not allow to generate novel HMMs as cluster centers. Instead, one is limited to representing each cluster by one of the given HMMs, e.g., the HMM which the spectral clustering procedure maps the closest to each spectral clustering center. This may be suboptimal for various applications of HMM clustering.
An alternative to clustering the HMMs in parameter space is to cluster them directly with respect to the probability distributions they represent. To cluster Gaussian probability distributions, Vasconcelos and Lipmann components and reduces it to another GMM with fewer components, where each of the mixture components of the reduced GMM represents, i.e., clusters, a group of the original Gaussian mixture components. More recently, Chan et al.  derived an HEM algorithm to cluster dynamic texture (DT) models (i.e., linear dynamical systems, LDSs) through their probability distributions. HEM has been applied successfully to construct GMM hierarchies for efficient image indexing , to cluster video represented by DTs , and to estimate GMMs or DT mixtures (DTMs,i.e., LDS mixtures) from large datasets for semantic annotation of images , video  and music [12, 13].
To extend the HEM framework for GMMs to mixtures of HMMs (H3Ms), additional marginalization of the hidden-state processes is required, as for DTMs. However, while Gaussians and DTs allow tractable inference in the E-step of HEM, this is no longer the case for HMMs. Therefore, in this work, we derive a variational formulation of the HEM algorithm (VHEM) to cluster HMMs through their probability distributions, based on a variational approximation derived by Hershey . The resulting algorithm not only allows to cluster HMMs, it also learns novel HMMs that are representative centers of each cluster, in a way that is consistent with the underlying generative probabilistic model of the HMM. The resulting VHEM algorithm can be generalized to handle other classes of graphical models, for which standard HEM would otherwise be intractable, by leveraging similar variational approximations. The efficacy of the VHEM-H3M algorithm is demonstrated for various applications, including hierarchical motion clustering, semantic music annotation, and online hand-writing recognition.
2 The hidden Markov model
A hidden Markov model (HMM) assumes a sequence of observations is generated by a double embedded stochastic process, where each observation at time depends on the state of a discrete hidden variable and where the sequence of hidden states evolves as a first order Markov process. The discrete variables can take one of values, and the evolution of the hidden process is encoded in a transition matrix whose entries are the state transition probabilities , and an initial state distribution , where
. Each state generates observation accordingly to an emission probability density function,, which here we assume to be a Gaussian mixture model:
where is the number of Gaussian components and are the mixing weights. In the following, when referring to a sequences of length , we will use the notation to represent the probability that the HMM generates the state sequence . The HMM is specified by the parameters which can be efficiently learned with the forward-backward algorithm , which is based on maximum likelihood.
A hidden Markov mixture model (H3M)  models a set observation sequences as samples from a group of hidden Markov models, which represent different sub-behaviors. For a given sequence, an assignment variable selects the parameters of one of the HMMs, where the HMM is selected with probability . Each mixture component is parametrized by and the H3M is parametrized by . Given a collection of relevant observation sequences, the parameters of can be learned with recourse to the EM algorithm .
To reduce clutter, here we assume that all the HMMs have the same number of states and that all emission probabilities have mixture components, though a more general case could be derived.
3 Variational hierarchical EM algorithm for H3Ms
The hierarchical expectation maximization algorithm (HEM) 
was initially proposed to cluster Gaussian distributions, by reducing a GMM with a large number of components to a new GMM with fewer components, and then extended to dynamic texture models. In this section we derive a variational formulation of the HEM algorithm (VHEM) to cluster HMMs.
Let be a base hidden Markov mixture model with components. The goal of the VHEM algorithm is to find a reduced mixture with components that represent . The likelihood of a random sequence is given by
where is the hidden variable that indexes the mixture components. is the likelihood of under the ith mixture component, and is the prior weight for the ith component. The likelihood of the random sequence is
where is the hidden variable for indexing components in . Note that we will always use and to index the components of the base model, , and the reduced model, , respectively. In addition, we will always use and to index the hidden states of and and for . To reduce clutter we will denote , and . In addition, we will use short-hands and when conditioning over specific state sequences. For example, we denote , and . Finally, we will use and for indexing the gaussian mixture components of the emission probabilities of the base respectively reduced mixture, which we will denote as and .
3.2 Parameter estimation - a variational formulation
To obtain the reduced model, we consider a set of virtual samples drawn from the base model , such that samples are drawn from the th component. We denote the set of virtual samples for the th component as , where , and the entire set of samples as . Note that, in this formulation, we are not generating virtual samples
according to the joint distribution of the base component,. The reason is that the hidden state spaces of each base mixture component may have a different representation, (e.g., , the numbering of the hidden states may be permuted between the components). This basis mismatch will cause problems when the parameters of are computed from virtual samples of the hidden states of . Instead, we must treat as “missing” information, as in the standard EM formulation.
The likelihood of the virtual samples is
for , where are variational distributions and is the Kullback-Leibler (KL) divergence between two distributions, and . In order to obtain a consistent clustering , we assume the whole sample is assigned to the same component of the reduced model, i.e., , with and , and (6) becomes:
For the likelihood of the virtual samples we use
) follows from the law of large numbers (as ). Substituting (10) in (8) we get the formula for derived in :
We follow a similar approach to compute the expected log-likelihoods . We introduce variational distributions to approximate for observations emitted by state sequence , and solve the maximization problem
where in (13
) we have used the law of total probability to condition the expectation over each state sequenceof
In general, maximizing (14) exactly sets the variational distribution to the true posterior and reduces (together with (11)) to the E-step of the HEM algorithm for hidden state models derived in :
where the inner expectation is taken with respect to the current estimate of , and is some term that does not depend on .
3.3 The variational HEM for HMMs
The maximization of (14) cannot be carried out in a efficient way, as it involves computing the expected log-likelihood of a mixture. To make it tractable we follow a variational approximation proposed by Hershey , and restrict the maximization to factored distribution in the form of a Markov chain, i.e.,
where and .
where we have defined
Using the property of HMM with memory one that observations at different time instants are independent given the corresponding states, we can break the expectation term in equation (18) in the following summation
where . As the emission probabilities are GMMs, the computation (19) cannot be carried out efficiently. Hence, we use a variational approximation , and introducte variational parameters for , with , and . Intuitively, is the responsibility matrix between gaussian observation components for state in and state in , where means the probability that an observation from component of corresponds to component of . Again, we obtain a lower bound:
where we have defined:
where can be computed exactly for Gaussians
where we have defined:
Maximizing the right hand side of (24) with respect to finds the most accurate approximation to the real posterior within the restricted class, i.e., the one that achieves the tightest lower bound.
where we have defined:
To find the tightest possible lower bound to the log-likelihood of the virtual sample we need to solve
Starting from an initial guess for , the parameters can be estimated by maximizing (28) irteratively with respect to (E-step) , , and (M-step)
The E-steps first considers the maximization of (28) with respect to for fixed and , i.e.,
and that the terms in (21) can then be computed for each as:
Next, (28) is maximized with respect to for fixed and , i.e.,
The maximization does not depends on and can be carried out independently for each pair with a backward recursion recursion  that computes
for , where it is understood that , and terminates with
Next, the maximization of (28) with respect to for fixed and , i.e.,
Finally, we compute the following summary statistics:
and the aggregates
The quantity is the responsibility between state of the HMM and state of the HMM at time , when modeling a sequence generated by . Similarly, the quantity is the responsibility between a transition from state to state