Hidden Markov models (HMMs) are widely used in machine learning when the data samples are timedependent, for example in speech recognition, language processing, and video analysis. The graphical model of a HMM is shown in Figure 1. HMM models a (time-dependent) sequence of data
as indirect observations of an underlying Markov chainwhich is not available to us. Homogeneous HMMs are parsimonious models, in the sense that they are fully characterized by the transition probability and the emission probability even though the size of the given data can be very large.
Consider a homogeneous HMM such that:
a latent variable can take possible outcomes ;
an ambient variable can take possible outcomes .
Given both and , the complete joint probability factors, and we can easily estimate the transition probability and the emission probability .
, but assuming we know the underlying transition and emission probabilities, we can calculate the observation likelihood using the forward algorithm, estimate the most likely hidden sequence using the Viterbi algorithm, and compute the posterior probability of the hidden states using the forward-backward algorithm.
The most natural problem setting, however, is when neither the hidden state sequence nor the underlying probabilities are known to us—we only have access to a sequence of observations, and our job is to reveal the HMM structure, characterized by the transition matrix and the emission probability from the set of observations .
1.1 Related work
The traditional way of learning a HMM from is via expectation-maximization (EM) (Rabiner and Juang, 1986), in which the expectation step is performed by calling the forward-backward algorithm. This specific instance of EM is also called the Baum-Welch algorithm (Baum et al., 1970; Ghahramani, 2001). However, the complexity of Baum-Welch is prohibitive when is relatively large—the complexity of the forward-backward algorithm is , but EM converges slowly, so the forward-backward algorithm must be called many times. This is a critical issue, because a HMM can only be learned with high accuracy when the number of observation samples is large enough.
One way of designing scalable algorithms for learning HMMs is to work with sufficient statistics – a summary of the given observation sequence, whose size does not grow with . Throughout this paper we assume that the HMM process is stationary (time-invariant), which is true almost surely if the underlying Markov process is ergodic and the process has been going on for a reasonable amount of time. With large enough, we can accurately estimate the co-occurrence probability between two consecutive emissions . According to the graphical model shown in Figure 1, it is easy to see that given the value of , is conditionally independent of all the other variables, leading to the factorization
Let , , and , with their elements defined as
Then, equations (1.1) can be written compactly as
Noticing that is a nonnegative matrix tri-factorization with a number of inconsequential constraints for and to properly represent probabilities, Vanluyten et al. (2008); Lakshminarayanan and Raich (2010); Cybenko and Crespi (2011) proposed using nonnegative matrix factorization (NMF) to estimate the HMM probabilities. However, NMF-based methods have a serious shortcoming in this context: the tri-factorization (2) is in general not unique, because it is fairly easy to find a nonsingular matrix such that both and , and then and are equally good solutions in terms of reconstructing the co-occurrence matrix . When we use and to perform HMM inference, such as estimating hidden states or predicting new emissions, the two models often yield completely different results, unless is a permutation matrix.
A number of works propose to use tensor methods to overcome the identifiability issue. Instead of working with the pairwise co-occurrence probabilities, they start by estimating the joint probabilities of three consecutive observations
. Noticing that these three random variables are conditionally independent given, the triple-occurrence probability factors into
which admits a tensor canonical polyadic decomposition (CPD) model (Hsu et al., 2009; Anandkumar et al., 2012, 2014). Assuming , the CPD is essentially unique if two of the three factor matrices have full column rank, and the other one is not rank one (Harshman, 1970); in the context of HMMs, this is equivalent to assuming and both have linearly independent columns, which is a relatively mild condition. The CPD is known to be unique under much more relaxed conditions (Sidiropoulos et al., ), but in order to uniquely retrieve the transition probability using the relationship
is actually the best we can achieve using triple-occurrences without making further assumptions. 111In the appendix, we prove that if the emission probability is generic and the transition probability is sparse, the HMM can be uniquely identified from triple-occurrence probability for using the latest tensor identifiability result (Chiantini and Ottaviani, 2012). A salient feature in this case is that if the triple-occurrence probability is exactly given (meaning the rank of the triple-occurrence tensor is indeed smaller than ), the CPD can be efficiently calculated using generalized eigendecomposition and related algebraic methods (Sanchez and Kowalski, 1990; Leurgans et al., 1993; De Lathauwer et al., 2004). These methods do not work well, however, when the low-rank tensor is perturbed – e.g., due to insufficient mixing / sample averaging of the triple occurence probabilities.
It is also possible to handle cases where . The key observation is that, given , is conditionally independent of and . Then, grouping
into a single categorical variable takingpossible outcomes, and into another one, we can construct a much bigger tensor of size , and then uniquely identify the underlying HMM structure with as long as certain linear independence requirements are satisfied for the conditional distribution of the grouped variables (Allman et al., 2009; Bhaskara et al., 2014; Huang et al., 2016b; Sharan et al., 2017). It is intuitively clear that for fixed , we need a much larger realization length in order to accurately estimate -occurrence probabilities as grows, which is the price we need to pay for learning a HMM with a larger number of hidden states.
1.2 This paper
The focus of this paper is on cases where , and is large enough to obtain accurate estimate of , but not large enough to accurately estimate triple or higher-order occurrence probabilities. We prove that it is actually possible to recover the latent structure of an HMM only from pairwise co-occurrence probabilities , provided that the underlying emission probability is sufficiently scattered. Compared to the existing NMF-based HMM learning approaches, our formulation employs a different (determinant-based) criterion to ensure identifiability of the HMM parameters. Our matrix factorization approach resolves cases that cannot be handled by tensor methods, namely when is insufficient to estimate third-order probabilities, under an additional condition that is quite mild: that the emission probability matrix must be sufficiently scattered, rather than simply full column-rank.
We apply our method to hidden topic Markov modeling (HTMM) (Gruber et al., 2007), in which case the number of hidden states (topics) is indeed much smaller than the number of ambient states (words). HTMM goes beyond the simple and widely used bag-of-words model by assuming that (ordered) words in a document are emitted from a hidden topic sequence that evolves according to a Markov model. We show improved performance on real data when using this simple and intuitive model to take word ordering into account when learning topics, which also benefits from our identifiability guaranteed matrix factorization method.
As an illustrative example, we showcase the inferred topic of each word in a news article (removing stop words) in Figure 2, taken from the Reuters21578 data set obtained at (Mimaroglu, 2007). As we can see, HTMM gets much more consistent and smooth inferred topics compared to that obtained from a bag-of-words model (cf. appendix for details). This result agrees with human understanding.
2 Second-order vs. Third-order Learning
We start by arguing that for the same observation data , the estimate of the pairwise co-occurrence probability is always more accurate than that of the triple co-occurrence probability .
Let us first explicitly describe the estimator we use for these probabilities. For each observation
, we define a coordinate vector, and if . The natural estimator for the pairwise co-occurrence probability matrix is
and similarly for the triple co-occurrence probability
where denotes vector outer-product. 222In some literature is written as the Kronecker product . Strictly speaking, the Kronecker product of three vectors is a very long vector, not a three-way array. For this reason, we chose to use instead of .
The first observation is that both and
are unbiased estimators: Obviouslyand likewise for the triple-occurrences, and taking their averages does not change the expectation. However, the individual terms in the summation are not independent of each other, making it hard to determine how fast estimates converge to their expectation. The state-of-the-art concentration result for HMMs (Kontorovich, 2006) states that for any 1-Lipschitz function
where is a constant that only depends on the specific HMM structure but not on the function (cf. (Kontorovich, 2006) for details). Taking as any entry in or , we can check that indeed it is 1-Lipschitz, meaning as goes to infinity, both estimators converge to their expectation with negligible fluctuations.
We now prove that for a given set of observations , is always going to be more accurate than
. Since both of them represent probabilities, we use two common metrics to measure the differences between the estimators and their expectations, the Kullback-Leibler divergenceand the total-variation difference .
Let and be obtained from the same set of observations , we have that
The proof of Proposition 1 is relegated to the appendix.
3 Identifiability of HMMs from Pairwise Co-occurrence Probabilities
The arguments made in the previous section motivate going back to matrix factorization methods for learning a HMM when the realization length is not large enough to obtain accurate estimates of triple co-occurrence probabilities. As we have explained in §1.1, the co-occurrence probability matrix admits a nonnegative matrix tri-factorization model (2). There are a number of additional equality constraints. Columns of represent conditional distributions, so . Matrix
represents the joint distribution between two consecutive Markovian variables, therefore. Furthermore, we have that and represent and respectively, and since we assume that the Markov chain is stationary, they are the same, i.e., . Notice that this does not imply that is symmetric, and in fact it is often not symmetric.
Huang et al. (2016a) considered a factorization model similar to (2) in a different context, and showed that identifiability can be achieved under a reasonable assumption called sufficiently scattered, defined as follows.
Definition 1 (sufficiently scattered).
Let denote the polyhedral cone , and denote the elliptical cone . Matrix is called sufficiently scattered if it satisfies that: (i) , and (ii) , where denotes the boundary of , .
The sufficiently scattered condition was first proposed in (Huang et al., 2014) to establish uniqueness conditions for the widely used nonnegative matrix factorization (NMF). For the NMF model , if both and are sufficiently scattered, then the nonnegative decomposition is unique up to column permutation and scaling. Follow up work strengthened and extended the identifiability results based on this geometry inspired condition (Fu et al., 2015; Huang et al., 2016a; Fu et al., 2018). A similar tri-factorization model was considered in (Huang et al., 2016a) in the context of bag-of-words topic modeling, and it was shown that among all feasible solutions of (2), if we find one that minimizes , then it recovers the ground-truth latent factors and , assuming the ground-truth is sufficiently scattered. In our present context, we therefore propose the following problem formulation:
Regarding Problem (5), we have the following identifiability result.
One may notice that in (Huang et al., 2016a), there are no constraints on the core matrix as we do in (5c). In terms of identifiability, it is easy to see that if the ground-truth satisfies (5c), solving (5) even without (5c) will produce a solution that satisfies (5c), thanks to uniqueness. In practice when we are given a less accurate , such “redundant” information will help us improve the estimation error, but that goes beyond identifiability consederations.
The proof of Theorem 1 is referred to (Huang et al., 2016a). Here we provide some insights on this geometry-inspired sufficiently scattered condition, and discuss why it is a reasonable (and thus practical) assumption. The notation comes from the convention in convex analysis that it is the dual cone of the conical hull of the row vectors of , i.e., . Similarly, we can derive that the dual cone of is . A useful property of the dual cone is that for two convex cones and , iff . Therefore, the first requirement of sufficiently scattered in Definition 1 equivalently means
We give a geometric illustration of the sufficiently scattered condition in Figure 2(b) for , and we focus on the 2-dimensional plane . The intersection between this plane and the nonnegative orthant is the probability simplex, which is the triangle in Figure 2(b). The outer circle represents , and the inner circle represents , again intersecting with the plane, respectively. The rows of are scaled to sum up to one, and they are represented by black dots in Figure 2(b). Their conical hull is represented by the shaded region. The polygon with dashed lines represents the dual of , which is indeed a subset of , and touches the boundary of only at the coordinate vectors.
Figure 2(a) shows a special case of sufficiently scattered called separability, which first appeared in (Donoho and Stodden, 2004) also to establish uniqueness of NMF. In this case, all the coordinate vectors appear in rows of , therefore equals the nonnegative orthant. It makes sense that this condition makes the identification problem easier, but it is also a very restrictive assumption. The sufficiently scattered condition, on the other hand, only requires that the shaded region contains the inner circle, as shown in Figure 2(b). Intuitively this requires that the rows of be “well scattered” in the probability simplex, but not to the extent of “separable”. Separability-based HMM identification has been considered in (Barlier et al., 2015; Glaude et al., 2015). However, the way they construct second-order statistics is very different from ours. Figure 2(c) shows a case where is not sufficiently scattered, and it also happens to be a case where is not identifiable.
As we can see, the elliptical cone is tangent to all the facets of the nonnegative orthant. As a result, for to be sufficiently scattered, it is necessary that there are enough rows of lie on the boundary of the nonnegative orthant, i.e., is relatively sparse. Specifically, if is sufficiently scattered, then each column of contains at least zeros (Huang et al., 2014). This is a very important insight, as exactly checking whether a matrix is sufficiently scattered may be computationally hard. In the present paper we further show the following result.
The ratio between the volume of the hyperball obtained by intersecting and and the probability simplex is
The proof is given in the appendix. As grows larger, the volume ratio (6) goes to zero at a super-exponential decay rate. This implies that the volume of the inner sphere quickly becomes negligible compared to the volume of the probability simplex, as becomes moderately large. The take home point is that, for a practical choice of , say , as long as satisfies that each column contains at least zeros, and the positions of the zeros appear relatively random, it is very likely that it is sufficiently scattered, and thus can be uniquely recovered via solving (5).
Our identifiability analysis based on the sufficiently scattered condition poses an interesting non-convex optimization problem (5). First of all, the given co-occurrence probability may not be exact, therefore it may not be a good idea to put (5b) as a hard constraint. For algorithm design, we propose the following modification to problem (5).
In the loss function of (4), the first term is the Kullback-Leibler distance between the empirical probability and the parameterized version (ignoring a constant), and the second term is our identifiability-driven regularization. We need to tune the parameter to yield good estimation results. However, intuitively we should use a with a relatively small value. Suppose is sufficiently accurate, then the priority is to minimize the difference between and ; when there exist equally good fits, then the second term comes into play and helps us pick out a solution that is sufficiently scattered.
Noticing that the constraints of (4) are all convex, but not the loss function, we propose to design an iterative algorithm to solve (4) using successive convex approximation. At iteration when the updates are and , we define
, which defines a probability distribution for fixedand . Using Jensen’s inequality (Jensen, 1906), we have that
where we define . Problem (4) decouples with respect to and , so we can work out their updates individually.
The update of admits a simple closed form solution, which can be derived via checking the KKT conditions. We denote the dual variable corresponding to as . Setting the gradient of the Lagrangian with respect to equal to zero, we have
and should be chosen so that the constraint is satisfied, which amounts to a simple re-scaling.
The update of is not as simple as a closed form expression, but it can still be obtained very efficiently. Noticing that the nonnegativity constraint is implicitly implied by the individual functions in the loss function, we propose to solve it using Newton’s method with equality constraints (Boyd and Vandenberghe, 2004, §10.2). Although Newton’s method requires solving a linear system of equations with number of variables in each iteration, there is special structure we can exploit to reduce the per-iteration complexity down to : The Hessian of the loss function of (4) is diagonal, and the linear equality constraints are highly structured; using block elimination (Boyd and Vandenberghe, 2004, §10.4.2), we ultimately only need to solve a positive definite linear system with variables. Together with the quadratic convergence rate of Newton’s method, the complexity of updating is , where is the desired accuracy for the update. Noticing that the complexity of a naive implementation of Newton’s method would be , the difference is big for moderately large . The in-line implementation of this tailored Newton’s method ThetaUpdate and the detailed derivation can be found in the appendix.
The entire proposed algorithm to solve Problem (4) is summarized in Algorithm 1. Notice that there is an additional line-search step to ensure decrease of the loss function. The constraint set of (4) is convex, so the line-search step will not incur infeasibility. Computationally, we find that any operation that involves can be carried out succinctly by defining the intermediate matrix , where “” denotes element-wise division between two matrices of the same size. The per-iteration complexity of Algorithm 1 is completely dominated by the operations that involve computing with , notably comparing with that of Theta-Update. In terms of initialization, which is important since we are optimizing a non-convex problem, we propose to use the method by Huang et al. (2016a) to obtain an initialization for ; for , it is best if we start with a feasible point (so that the Newton’s iterates will remain feasible), and a simple choice is scaling the matrix to sum up to one. Finally, we show that this algorithm converges to a stationary point of Problem (4), with proof relegated to the appendix based on (Razaviyayn et al., 2013).
5 Validation on Synthetic Data
In this section we validate the identifiability performance on synthetic data. In this case, the underlying transition and emission probabilities are generated synthetically, and we compare them with the estimated ones to evaluate performance. The simulations are conducted in MATLAB using the HMM toolbox, which includes functions to generate observation sequences given transition and emission probabilities, as well as an implementation of the Baum-Welch algorithm (Baum et al., 1970), i.e., the EM algorithm, to estimate the transition and emission probabilities using the observations. Unfortunately, even for some moderate problem sizes we considered, the istreamlined Matlab mplementation of the Baum-Welch algorithm was not able to execute within reasonable amount of time, so its performance is not included here. For the baselines, we compare with the plain NMF approach using multiplicative update (Vanluyten et al., 2008) and the tensor CPD approach (Sharan et al., 2017) using simultaneous diagonalization with Tensorlab (Vervliet et al., 2016). Since we work with empirical distributions instead of exact probabilities, the result of the simultaneous diagonalization is not going to be optimal. We therefore use it to initialize the EM algorithm for fitting a nonnegative tensor factorization with KL divergence loss (Shashanka et al., 2008) for refinement.
We focus on the cases when the number of hidden states is smaller than the number observed states . As we explained in the introduction, even for this seemingly easier case, it is not known that we can guarantee unique recovery of the HMM parameters just from the pair-wise co-occurrence probability. What is known is that the tensor CPD approach is able to guarantee identifiability given exact triple-occurrence probability. We will demonstrate in this section that it is much harder to obtain accurate triple-occurrence probability comparing with the co-occurrence probability. As a result, if the sufficiently scattered assumption holds for the emission probability, the estimated parameters obtained from our method are always more accurate than those obtained from tensor CPD.
Fixing and , we let the number of HMM realizations go from to , and compare the estimation error for the transition matrix and emission matrix by the aforementioned methods. We show the total variation distance between the ground truth probabilities and and their estimations and using various methods. The result is shown in Figure 4. As we can see, the proposed method indeed works best, obtaining almost perfect recovery when sample size is above . The CPD based method does not work as well since it cannot obtain accurate estimates of the third-order statistics that it needs. Initilaized by CPD, EM improves from CPD but the performance is still far away from the proposed method. NMF is also not working well since it does not have identifiability in this case.
6 Application: Hidden Topic Markov Model
Analyzing text data is one of the core application domains of machine learning. There are two prevailing approaches to model text data. The classical bag-of-words model assumes that each word is independently drawn from certain multinomial distributions. These distributions are different across documents, but can be efficiently summarized by a small number of topics, again mathematically modeled as distributions over words; this task is widely known as topic modeling (Hofmann, 2001; Blei et al., 2003). However, it is obvious that the bag-of-words representation is oversimplified. The -gram model, on the other hand, assumes that words are conditionally dependent up to a window-length of . This seems to be a much more realistic model, although the choice of is totally unclear, and is often dictated by memory and computatioal limitations in practice – since the size of the joint distribution grows exponentially with . What is more, it is somewhat difficult to extract “topics” from this model, despite some preliminary attempts (Wallach, 2006; Wang et al., 2007).
We propose to model a document as the realization of a HMM, in which the topics are hidden states emitting words, and the states are evolving according to a Markov chain, hence the name hidden topic Markov model (HTMM). For a set of documents, this means we are working with a collection of HMMs. Similar to other topic modeling works, we assume that the topic matrix is shared among all documents, meaning all the given HMMs share the same emission probability. For the bag-of-words model, each document has a specific topic distribution , whereas for our model, each document has its own topic transition probability ; as per our previous discussion, the row-sum and column-sum of are the same, which are also the topic probability for the specific document. The difference is the Markovian assumption on the topics rather than the over-simplifying independence assumption.
We can see some immediate advantages for the HTMM. Since the Markovian assumption is only imposed on the topics, which are not exposed to us, the observations (words) are not independent from each other, which agrees with our intuition. On the other hand, we now understand that although word dependencies exist for a wide neighborhood, we only need to work with pair-wise co-occurrence probabilities, or 2-grams. This releases us from picking a window length in the -gram model, while maintaining dependencies between words well beyond a neighborhood of words. It also includes the bag-of-words assumption as a special case: If the topics of the words are indeed independent, this just means that the transition probability has the special form . The closest work to ours is by Gruber et al. (2007), which is also termed hidden topic Markov model. However, they make a simplifying assumption that the transition probability takes the form , meaning the topic of the word is either the same as the previous one, or independently drawn from . Both models are special cases of our general HTMM.
In order to learn the shared topic matrix , we can use the co-occurrence statistics for the entire corpus: Denote the co-occurrence statistics for the -th document as , then ; consequently
which is an unbiased estimator for
where is the length of the -th document, and is conceptually a weighted average of all the topic-transition matrices. Then we can apply Algorithm 1 to learn the topic matrix.
We illustrate the performance of our HTMM by comparing it to three popular bag-of-words topic modeling approaches: pLSA (Hofmann, 2001), LDA (Blei et al., 2003), and FastAnchor (Arora et al., 2013), which guarantees identifiability if every topic has a characteristic anchor word. Our HTMM model guarantees identifiability if the topic matrix is sufficiently scattered, which is a more relaxed condition than the anchor word one. On the Reuters21578 data set obtained at (Mimaroglu, 2007), we use the raw document to construct the word co-occurrence statistics, as well as bag-of-words representations for each document for the baseline algorithms. We use the version in which the stop-words have been removed, which makes the HTMM model more plausible since any syntactic dependencies have been removed, leaving only semantic dependencies. The vocabulary size of Reuters21578 is around , making any method relying on triple-occurrences impossible to implement, and that is why tensor-based methods are not compared here.
Because of page limitations, we only show the quality of the topics learned by various methods in terms of coherence. Simply put, a higher coherence means more meaningful topics, and the concrete definition can be found in (Arora et al., 2013). In Figure 5, we can see that for different number of topics we tried on the entire dataset, HTMM consistently produces topics with the highest coherence. Additional evaluations can be found in the appendix.
We presented an algorithm for learning hidden Markov models in an unsupervised setting, i.e., using only a sequence of observations. Our approach is guaranteed to uniquely recover the ground-truth HMM structure using only pairwise co-occurrence probabilities of the observations, under the assumption that the emission probability is sufficiently scattered. Unlike EM, the complexity of the proposed algorithm does not grow with the length of the observation sequence. Compared to tensor-based methods for HMM learning, our approach only requires reliable estimates of pairwise co-occurrence probabilities, which are easier to obtain. We applied our method to topic modeling, assuming each document is a realization of a HMM rather than a simpler bag-of-words model, and obtained improved topic coherence results. We refer the reader to the appendix for detailed proofs of the propositions and additional experimental results.
Appendix A Proof of Proposition 1
For categorical probabilities and , their Kullback-Leiber divergence is defined as
and their total variation distance is defined as
The key to prove Proposition 1 is the fact that the cooccurrence probability can be obtained by marginalizing in the triple-occurrence probability , i.e.,
Similarly, this holds for the cumulative estimates described in §2 of the main paper as well,
Using the log sum inequality, we have that
Summing both sides over and , we result in
Using Hölder’s inequality with -norm and -norm, we have that
Summing both sides over and and then dividing by 2, we obtain
Appendix B Proof of Proposition 2
The volume of a hyper-ball in with radius is
The elliptical cone
intersecting with the hyperplaneis a hyperball in with radius . Therefore, the volume of the inner-ball is
The nonnegative orthan intersecting with is a regular simplex in with side length . Its volume is
Their ratio is
This function of volume ratio is plotted in Figure 6. As we can see, as increases, the volume ratio indeed goes to zero at a super-exponential rate.
Appendix C Derivation of ThetaUpdate
It is described in (Boyd and Vandenberghe, 2004, §10.2) that for solving a convex equality constrained problem
using the Newton’s method, we start at a feasible point , and the iterative update takes the form , where the Newton direction is calculated from solving the KKT system
Assuming and has full row rank, then the KKT system can be solved via elimination, as described in (Boyd and Vandenberghe, 2004, Algorithm 10.3). Suppose , if is diagonal, the cost of calculating is dominated by forming and inverting the matrix with being diagonal.
Now we follow the steps of (Boyd and Vandenberghe, 2004, Algorithm 10.3) to derive explicit Newton iterates for solving (11). First, we re-write the part of (11) that involve here:
Let , then equality constraint has the form where
Matrix does not have full row rank, because the last row of is implied by the rest. Therefore, we can discard the last equality constraint. We will keep it when calculating matrix multiplications for simpler expression, and discard the corresponding entry or column/row for other operations.
Obviously has the form
which costs flops. For a slightly more complicated multiplication
which also costs flops to compute. For ,
At point , the negative gradient is where
and the inverse of the Hessian where
and then delete the last column and row of , and
and then delete the last entry of . We can first solve for by
Then we append a zero at the end of and define
The Newton direction can then be obtained via
In matrix form, it is equivalent to
The in-line implementation of ThetaUpdate is given here.