The hidden Markov model (HMM) (Rabiner and Juang, 1993) is a probabilistic model that assumes a signal is generated by a double embedded stochastic process. A discrete-time hidden state process, which evolves as a Markov chain, encodes the dynamics of the signal, while an observation process encodes the appearance of the signal at each time, conditioned on the current state. HMMs have been successfully applied to a variety of fields, including speech recognition (Rabiner and Juang, 1993), music analysis (Qi et al., 2007) and identification (Batlle et al., 2002), online hand-writing recognition (Nag et al., 1986), analysis of biological sequences (Krogh et al., 1994), and clustering of time series data (Jebara et al., 2007; Smyth, 1997).
This paper is about clustering HMMs. More precisely, we are interested in an algorithm that, given a collection of HMMs, partitions them into
clusters of “similar” HMMs, while also learning a representative HMM “cluster center” that concisely and appropriately represents each cluster. This is similar to standard k-means clustering, except that the data points are HMMs now instead of vectors in.
Various applications motivate the design of HMM clustering algorithms, ranging from hierarchical clustering of sequential data (e.g., speech or motion sequences modeled by HMMs (Jebara et al., 2007)), to hierarchical indexing for fast retrieval, to reducing the computational complexity of estimating mixtures of HMMs from large datasets (e.g., semantic annotation models for music and video) — by clustering HMMs, efficiently estimated from many small subsets of the data, into a more compact mixture model of all data. However, there has been little work on HMM clustering and, therefore, its applications.
Existing approaches to clustering HMMs operate directly on the HMM parameter space, by grouping HMMs according to a suitable pairwise distance defined in terms of the HMM parameters. However, as HMM parameters lie on a non-linear manifold, a simple application of the k-means algorithm will not succeed in the task, since it assumes real vectors in a Euclidean space. In addition, such an approach would have the additional complication that HMM parameters for a particular generative model are not unique, i.e., a permutation of the states leads to the same generative model. One solution, proposed by Jebara et al. (2007), first constructs an appropriate similarity matrix between all HMMs that are to be clustered (e.g., based on the Bhattacharya affinity, which depends non-linearly on the HMM parameters (Jebara et al., 2004)
), and then applies spectral clustering. While this approach has proven successful to group HMMs into similar clusters(Jebara et al., 2007), it does not directly address the issue of generating HMM cluster centers. Each cluster can still be represented by choosing one of the given HMMs, e.g., the HMM which the spectral clustering procedure maps the closest to each spectral clustering center. However, this may be suboptimal for some applications of HMM clustering, for example in hierarchical estimation of annotation models. Another distance between HMM distributions suitable for spectral clustering is the KL divergence, which in practice has been approximated by sampling sequences from one model and computing their log-likelihood under the other (Juang and Rabiner, 1985; Zhong and Ghosh, 2003; Yin and Yang, 2005).
Instead, in this paper we propose to cluster HMMs directly with respect to the probability distributions
they represent. The probability distributions of the HMMs are used throughout the whole clustering algorithm, and not only to construct an initial embedding as in(Jebara et al., 2007)
. By clustering the output distributions of the HMMs, marginalized over the hidden-state distributions, we avoid the issue of multiple equivalent parameterizations of the hidden states. We derive a hierarchical expectation maximization (HEM) algorithm that, starting from a collection of input HMMs, estimates a smaller mixture model of HMMs that concisely represents and clusters the input HMMs (i.e., the input HMM distributions guide the estimation of the output mixture distribution).
The HEM algorithm is a generalization of the EM algorithm — the EM algorithm can be considered as a special case of HEM for a mixture of delta functions as input. The main difference between HEM and EM is in the E-step. While the EM algorithm computes the sufficient statistics given the observed data, the HEM algorithm calculates the expected sufficient statistics averaged over all possible observations generated by the input probability models. Historically, the first HEM algorithm was designed to cluster Gaussian probability distributions (Vasconcelos and Lippman, 1998)
. This algorithm starts from a Gaussian mixture model (GMM) withcomponents and reduces it to another GMM with fewer components, where each of the mixture components of the reduced GMM represents, i.e., clusters, a group of the original Gaussian mixture components. More recently, Chan et al. (2010a) derived an HEM algorithm to cluster dynamic texture (DT) models (i.e., linear dynamical systems, LDSs) through their probability distributions. HEM has been applied successfully to construct GMM hierarchies for efficient image indexing (Vasconcelos, 2001), to cluster video represented by DTs (Chan et al., 2010b), and to estimate GMMs or DT mixtures (DTMs, i.e., LDS mixtures) from large datasets for semantic annotation of images (Carneiro et al., 2007), video (Chan et al., 2010b) and music (Turnbull et al., 2008; Coviello et al., 2011).
To extend the HEM framework for GMMs to mixtures of HMMs (or hidden Markov mixture models, H3Ms), additional marginalization over the hidden-state processes is required, as with DTMs. However, while Gaussians and DTs allow tractable inference in the E-step of HEM, this is no longer the case for HMMs. Therefore, in this work, we derive a variational formulation of the HEM algorithm (VHEM), and then leverage a variational approximation derived by Hershey et al. (2008) (which has not been used in a learning context so far) to make the inference in the E-step tractable. The resulting algorithm not only clusters HMMs, but also learns novel HMMs that are representative centers of each cluster. The resulting VHEM algorithm can be generalized to handle other classes of graphical models, for which exact computation of the E-step in the standard HEM would be intractable, by leveraging similar variational approximations — e.g., the more general case of HMMs with emission probabilities that are (mixtures of) continuous exponential family distributions.
Compared to the spectral clustering algorithm of (Jebara et al., 2007), the VHEM algorithm has several advantages that make it suitable for a variety of applications. First, the VHEM algorithm is capable of both clustering, as well as learning novel cluster centers, in a manner that is consistent with the underlying generative probabilistic framework. In addition, since it does not require sampling steps, it is also scalable with low memory requirements. As a consequence, VHEM for HMMs allows for efficient estimation of HMM mixtures from large datasets using a hierarchical estimation procedure. In particular, intermediate HMM mixtures are first estimated in parallel by running the EM algorithm on small independent portions of the dataset. The final model is then estimated from the intermediate models using the VHEM algorithm. Because VHEM is based on maximum-likelihood principles, it drives model estimation towards similar optimal parameter values as performing maximum-likelihood estimation on the full dataset. In addition, by averaging over all possible observations compatible with the input models in the E-step, VHEM provides an implicit form of regularization that prevents over-fitting and improves robustness of the learned models, compared to a direct application of the EM algorithm on the full dataset. Note that, in contrast to (Jebara et al., 2007), VHEM does not construct a kernel embedding, which is a costly operation that requires calculating of pairwise similarity scores between all input HMMs to form the similarity matrix, as well as the inversion of this matrix.
In summary, the contributions of this paper are three-fold: i) we derive a variational formulation of the HEM algorithm for clustering HMMs, which generates novel HMM centers representative of each cluster; ii) we evaluate VHEM on a variety of clustering, annotation, and retrieval problems involving time-series data, showing improvement over current clustering methods; iii) we demonstrate in experiments that VHEM can effectively learn HMMs from large sets of data, more efficiently than standard EM, while improving model robustness through better regularization. With respect to our previous work, the VHEM algorithm for H3M was originally proposed in (Coviello et al., 2012a)
The remainder of the paper is organized as follows. We review the hidden Markov model (HMM) and the hidden Markov mixture model (H3M) in Section 2. We present the derivation of the VHEM-H3M algorithm in Section 3, followed by a discussion in Section 4. Finally, we present experimental results in Sections 5 and 6.
2 The hidden Markov (mixture) model
A hidden Markov model (HMM) assumes a sequence of observations is generated by a double embedded stochastic process, where each observation (or emission) at time depends on the state of a discrete hidden variable , and the sequence of hidden states evolves as a first-order Markov chain. The hidden variables can take one of values, , and the evolution of the hidden process is encoded in a state transition matrix , where each entry, , is the probability of transitioning from state to state , and an initial state distribution , where .
generates observations according to an emission probability density function,. Here, we assume the emission density is time-invariant, and modeled as a Gaussian mixture model (GMM) with components:
where is the hidden assignment variable that selects the mixture component, with as the mixture weight of the
th component, and each component is a multivariate Gaussian distribution,
with mean and covariance matrix . The HMM is specified by the parameters
which can be efficiently learned from an observation sequence with the Baum-Welch algorithm (Rabiner and Juang, 1993), which is based on maximum likelihood estimation.
The probability distribution of a state sequence generated by an HMM is
while the joint likelihood of an observation sequence and a state sequence is
Finally, the observation likelihood of is obtained by marginalizing out the state sequence from the joint likelihood,
where the summation is over all state sequences of length , and can be performed efficiently using the forward algorithm (Rabiner and Juang, 1993).
A hidden Markov mixture model (H3M) (Smyth, 1997) models a set of observation sequences as samples from a group of hidden Markov models, each associated to a specific sub-behavior. For a given sequence, an assignment variable selects the parameters of one of the HMMs, where the th HMM is selected with probability . Each mixture component is parametrized by
and the H3M is parametrized by , which can be estimated from a collection of observation sequences using the EM algorithm (Smyth, 1997).
To reduce clutter, here we assume that all the HMMs have the same number of hidden states and that all emission probabilities have mixture components. Our derivation could be easily extended to the more general case though.
3 Clustering hidden Markov models
Algorithms for clustering HMMs can serve a wide range of applications, from hierarchical clustering of sequential data (e.g., speech or motion sequences modeled by HMMs (Jebara et al., 2007)), to hierarchical indexing for fast retrieval, to reducing the computational complexity of estimating mixtures of HMMs from large weakly-annotated datasets — by clustering HMMs, efficiently estimated from many small subsets of the data, into a more compact mixture model of all data.
In this work we derive a hierarchical EM algorithm for clustering HMMs (HEM-H3M) with respect to their probability distributions. We approach the problem of clustering HMMs as reducing an input HMM mixture with a large number of components to a new mixture with fewer components. Note that different HMMs in the input mixture are allowed to have different weights (i.e., the mixture weights are not necessarily all equal).
One method for estimating the reduced mixture model is to generate samples from the input mixture, and then perform maximum likelihood estimation, i.e., maximize the log-likelihood of these samples. However, to avoid explicitly generating these samples, we instead maximize the expectation of the log-likelihood with respect to the input mixture model, thus averaging over all possible samples from the input mixture model. In this way, the dependency on the samples is replaced by a marginalization with respect to the input mixture model. While such marginalization is tractable for Gaussians and DTs, this is no longer the case for HMMs. Therefore, in this work, we i) derive a variational formulation of the HEM algorithm (VHEM), and ii) specialize it to the HMM case by leveraging a variational approximation proposed by Hershey et al. (2008). Note that (Hershey et al., 2008) was proposed as an alternative to MCMC sampling for the computation of the KL divergence between two HMMs, and has not been used in a learning context so far.
Let be a base hidden Markov mixture model with components. The goal of the VHEM algorithm is to find a reduced hidden Markov mixture model with (i.e., fewer) components that represents well. The likelihood of a random sequence is given by
where is the hidden variable that indexes the mixture components. is the likelihood of under the th mixture component, as in (6), and is the mixture weight for the th component. Likewise, the likelihood of the random sequence is
where is the hidden variable for indexing components in .
At a high level, the VHEM-H3M algorithm estimates the reduced H3M model in (9) from virtual sequences distributed according to the base H3M model in (8). From this estimation procedure, the VHEM algorithm provides:
a soft clustering of the original components into groups, where cluster membership is encoded in assignment variables that represent the responsibility of each reduced mixture component for each base mixture component, i.e., , for and ;
novel cluster centers represented by the individual mixture components of the reduced model in (9), i.e., for .
Finally, because we take the expectation over the virtual samples, the estimation is carried out in an efficient manner that requires only knowledge of the parameters of the base model, without the need of generating actual virtual samples.
|variables||base model (b)||reduced model (r)|
|index for HMM components|
|number of HMM components|
|number of HMM states|
|HMM state sequence|
|index for component of GMM|
|number of Gaussian components|
|HMM component (of H3M)|
|Gaussian component (of GMM)|
|H3M mixture weights|
|HMM initial state|
|HMM state transition matrix|
|HMM state sequence (b)|
|HMM state sequence (r)|
|HMM observation likelihood (r)|
|GMM emission likelihood (r)|
|Gaussian component likelihood (r)|
|HMM observation sequence (b)|
|GMM emission (b)|
|Gaussian component (b)|
|expected log-likelihood||lower bound||variational distribution|
We will always use and to index the components of the base model and the reduced model , respectively. To reduce clutter, we will also use the short-hand notation and to denote the th component of and the th component of , respectively. Hidden states of the HMMs are denoted with for the base model , and with for the reduced model .
The GMM emission models for each hidden state are denoted as and . We will always use and for indexing the individual Gaussian components of the GMM emissions of the base and reduced models, respectively. The individual Gaussian components are denoted as for the base model, and for the reduced model. Finally, we denote the parameters of th HMM component of the base mixture model as , and for the th HMM in the reduced mixture as .
When appearing in a probability distribution, the short-hand model notation (e.g., ) always implies conditioning on the model being active. For example, we will use as short-hand for , or as short-hand for . Furthermore, we will use as short-hand for the probability of the state sequence according to the base HMM component , i.e., , and likewise for the reduced HMM component.
Expectations will also use the short-hand model notation to imply conditioning on the model. In addition, expectations are assumed to be taken with respect to the output variable ( or ), unless otherwise specified. For example, we will use as short-hand for .
Table 1 summarizes the notation used in the derivation, including the variable names, model parameters, and short-hand notations for probability distributions and expectations. The bottom of Table 1 also summarizes the variational lower bound and variational distributions, which will be introduced subsequently.
3.2 Variational HEM algorithm
To learn the reduced model in (9), we consider a set of virtual samples, distributed according to the base model in (8), such that samples are drawn from the th component. We denote the set of virtual samples for the th component as , where , and the entire set of samples as . Note that, in this formulation, we are not considering virtual samples
for each base component, according to its joint distribution. The reason is that the hidden-state space of each base mixture component may have a different representation (e.g., the numbering of the hidden states may be permuted between the components). This mismatch will cause problems when the parameters of are computed from virtual samples of the hidden states of . Instead, we treat as “missing” information, and estimate them in the E-step. The log-likelihood of the virtual samples is
where, in order to obtain a consistent clustering, we assume the entirety of samples is assigned to the same component of the reduced model (Vasconcelos and Lippman, 1998).
, and uses the law of large numbers to turn the virtual samplesinto an expectation over the base model components . In this paper, we will start with a different objective function to derive the VHEM algorithm. To estimate , we will maximize the average log-likelihood of all possible virtual samples, weighted by their likelihood of being generated by , i.e., the expected log-likelihood of the virtual samples,
where the expectation is over the base model components . Maximizing (11) will eventually lead to the same estimate as maximizing (10), but allows us to strictly preserve the variational lower bound, which would otherwise be ruined when applying the law of large numbers to (10).
A general approach to deal with maximum likelihood estimation in the presence of hidden variables (which is the case for H3Ms) is the EM algorithm (Dempster et al., 1977). In the traditional formulation the EM algorithm is presented as an alternation between an expectation step (E-step) and a maximization step (M-step). In this work, we take a variational perspective (Neal and Hinton, 1998; Wainwright and Jordan, 2008; Csisz et al., 1984), which views each step as a maximization step. The variational E-step first obtains a family of lower bounds to the (expected) log-likelihood (i.e., to (11)), indexed by variational parameters, and then optimizes over the variational parameters to find the tightest bound. The corresponding M-step then maximizes the lower bound (with the variational parameters fixed) with respect to the model parameters. One advantage of the variational formulation is that it readily allows for useful extensions to the EM algorithm, such as replacing a difficult inference in the E-step with a variational approximation. In practice, this is achieved by restricting the maximization in the variational E-step to a smaller domain for which the lower bound is tractable.
3.2.1 Lower bound to an expected log-likelihood
Before proceeding with the derivation of VHEM for H3Ms, we first need to derive a lower-bound to an expected log-likelihood term, e.g., (11). In all generality, let be the observation and hidden variables of a probabilistic model, respectively, where is the distribution of the hidden variables, is the conditional likelihood of the observations, and is the observation likelihood. We can define a variational lower bound to the observation log-likelihood (Jordan et al., 1999; Jaakkola, 2000):
where is the posterior distribution of given observation , and is the Kullback-Leibler (KL) divergence between two distributions, and . We introduce a variational distribution , which approximates the posterior distribution, where and . When the variational distribution equals the true posterior, , then the KL divergence is zero, and hence the lower-bound reaches . When the true posterior cannot be computed, then typically is restricted to some set of approximate posterior distributions that are tractable, and the best lower-bound is obtained by maximizing over ,
Using the lower bound in (14), we can now derive a lower bound to an expected log-likelihood expression. Let be the expectation with respect to with some distribution . Since is non-negative, taking the expectation on both sides of yields,
where (16) follows from Jensen’s inequality (i.e., when is convex), and the convexity of the function. Hence, (17) is a variational lower bound on the expected log-likelihood, which depends on the family of variational distributions .
3.2.2 Variational lower bound
We now derive a lower bound to the expected log-likelihood cost function in (11). The derivation will proceed by successively applying the lower bound from (17) to each expected log-likelihood term that arises. This will result in a set of nested lower bounds. We first define the following three lower bounds:
The first lower bound, , is on the expected log-likelihood of an H3M with respect to an HMM . The second lower bound, , is on the expected log-likelihood of an HMM , averaged over observation sequences from a different HMM . Although the data log-likelihood can be computed exactly using the forward algorithm (Rabiner and Juang, 1993), calculating its expectation is not analytically tractable since an observation sequence from a HMM is essentially an observation from a mixture model.111For an observation sequence of length , an HMM with states can be considered as a mixture model with components. The third lower bound, , is on the expected log-likelihood of a GMM emission density with respect to another GMM . This lower bound does not depend on time, as we have assumed that the emission densities are time-invariant.
Looking at an individual term in (11), is the likelihood under a mixture of HMMs, as in (9), where the observation variable is and the hidden variable is (the assignment of to a component of ). Hence, introducing the variational distribution and applying (17), we have
where in (23) we use the fact that is a set of i.i.d. samples. In (24), is the observation log-likelihood of an HMM, which is essentially a mixture distribution, and hence its expectation cannot be calculated directly. Instead, we use the lower bound defined in (19), yielding
Next, we calculate the lower bound . Starting with (19), we first rewrite the expectation to explicitly marginalize over the HMM state sequence from ,
For the HMM likelihood , given by (6), the observation variable is and the hidden variable is the state sequence . We therefore introduce a variational distribution on the state sequence , which depends on a particular sequence from . Applying (17) to (27), we have
Finally, we derive the lower bound for (20). First, we rewrite the expectation with respect to to explicitly marginalize out the GMM hidden assignment variable ,
Note that is a GMM emission distribution as in (1). Hence, the observation variable is , and the hidden variable is . Therefore, we introduce the variational distribution , which is conditioned on the observation arising from the th component in , and apply (17),
where is the expected log-likelihood of the Gaussian distribution with respect to the Gaussian , which has a closed-form solution (see Section 3.3.1).
In summary, we have derived a variational lower bound to the expected log-likelihood of the virtual samples in (11),
which is composed of three nested lower bounds, corresponding to different model elements (the H3M, the component HMMs, and the emission GMMs),