1 Introduction
For many applications such as language modeling, it is useful to estimate Hidden Markov Models (HMMs) Rabiner (1989) in which observations drawn from a large vocabulary are generated from a much smaller hidden state. Standard HMM estimation techniques such as Gibbs sampling Geman & Geman (1984) and EM Baum et al. (1970); Dempster et al. (1977) methods, although very widely used, can require some effort to apply as they are often either slow or prone to get stuck in local optima. Hsu, Kakade and Zhang, in a path breaking paper, Hsu et al. (2009)
showed that HMMs can, in theory, be efficiently and accurately estimated using closed form calculations on trigrams of observations which have been projected onto a low dimensional space. Key to this approach is the use of singular value decomposition (SVD) on the matrix of covariances between adjacent observations to learn a matrix
that projects observations onto a space of the same dimension as the hidden state. Perhaps surprisingly, cooccurrence statistics on unigrams, pairs, and triples of observations are sufficient to accurately estimate a model equivalent to the original HMM.The true hidden state itself cannot, of course, be estimated (it is not observed), but one can estimate a linear transformation of the hidden state which contains sufficient information to give an optimal (in a sense to be made precise below) estimate of the probability of any sequence of observations being generated by the HMM
Hsu et al. (2009). The method of Hsu et al. (2009), and the extensions to it presented in this paper do not require any EM or Gibbs sampling, but only need an SVD on bigram observation counts. Since SVD is an efficient method guaranteed to return the correct result in a known number of steps, this is a major advantage over the iterative EM method.Hsu et al. Hsu et al. (2009) estimate a size matrix mapping between the the dimension observation space and a reduced dimension space of size
(the dimension of the hidden state space). They also need to estimate a tensor of size
. We provide an alternate formulation that replaces their tensor with one of size . Since the observation vocabulary, , is often much larger than the state space (), this provides significant reduction in model size, and hence, as we show below, in sample complexity.1.1 HMM setup and notation
We now introduce the notation and model used throughout our paper.
Consider an HMM where is an transition matrix on the hidden state, is a emission matrix giving the probabilities of hidden state emitting observation , and
is a vector of initial state probabilities in which
is the probability that . Jaeger Jaeger (2000) showed that the joint probability of a sequence of observations from this HMM is given by(1) 
where , is the unit vector of length with a single 1 in the th position and creates a matrix with the elements of the vector on its diagonal and zeros everywhere else.
is called an ’observation operator’, an idea dating back to multiplicity automata Schutzenbeegeb (1961); Carlyle & Paz (1971); Fliess (1974), and foundational in the theory of Observable Operator Models Jaeger (2000) and Predictive State Representations Littman et al. (2002). It is effectively a third order tensor, giving the distribution vector over states at time as a function of the state distribution vector at the current time and the current observation . Since depends on the hidden state, it is not observable, and hence cannot be directly estimated. But Hsu et al. (2009) showed that under certain conditions there exists a fully observable representation of the observable operator model. We now present a novel, fully reduced dimensional version of the observable representation.
1.2 The reduced dimension model
Define a random variable
, where has orthonormal columns and is a matrix mapping from observations to the reduced dimension space.We show below that
(2) 
holds where
and = , = , and
are easy to estimate using the method of moments.
^{1}^{1}1 Note that is a tensor. When multiplied by a vector , it produces a matrix. is linear in each of the three reduced dimension observations, , and .The matrix can be derived in several ways; Hsu et al. (2009) show that taking it to consist of the left singular vectors of corresponding to the largest singular values gives good properties, where is a matrix such that . The matrix and its properties will be discussed in more detail below.
Note that the model () will be estimated using only trigrams. Once a model has been learned, the probability of any observed sequence can be computed using equation 2, or the conditional probability of the next observation in a sequence can be computed by with recursive updates . The key term in the model is thus , which can be viewed as a tensor which takes as input the current observation and produces a matrix which maps (after normalization) from the current “hidden state estimate” to the next one . More precisely, is a linear function of the conditional expectation of the unobservable hidden state , which is the conditional probability vector over states at time .
1.3 Comparison to Hsu et al.
Hsu et al. Hsu et al. (2009) derive a similar model which we state here for comparison.
(3) 
where
and , as defined above, and are the frequencies of unigrams, bigrams, and trigrams in the observed data. Note that the subscripts on refer to their positions in trigrams of observations of the form .
Our major modeling change will be to replace in equation 3 with the lower dimensional tensor which depends on the reduced dimension projection instead of the unreduced . The models are easily related by the following lemma:
Lemma 1.
Assume the hidden state is of dimension and the rank of is also . Then:
(4)  
(5)  
(6) 
Where (5) requires to be invertible, and (6) requires .^{2}^{2}2 If the matrix is formed from the left singular vectors of corresponding to nonzero singular values, then it will satisfy this condition; See Hsu et al. (2009) lemma 2.
Proof sketch: Paper Jaeger (2000) showed (4), paper Hsu et al. (2009) showed (5), and (6) follows from a telescoping product of the following items:
where . More details are given in the supplemental material.
We improve Hsu et al. (2009) in three ways:

By reducing the size of the matrix that is estimated, we can achieve a lower sample complexity. In particular, our sample complexity does not depend on the size of the vocabulary nor on the frequency distribution of the vocabulary.

Since the conditions given in Hsu et al. (2009) are in terms of the transition matrix , they can not be checked. We instead focus on conditions that are checkable from the data.

Instead of using either a L1 error or a relative entropy error, we estimate the probabilities with relative accuracy. In other words, we show that is smaller than . This often is a more useful bound than just knowing is small. For example, it implies that computing conditional probabilities are off by less than . Both L1 and relative entropy errors can be computed from these bounds.
Our main theorem is weaker (as stated) than Hsu et al. (2009) in that we assume knowledge of rather than estimating it from a thin SVD of as they do. Since the accuracy lost when estimating is identical to that given in their paper, we will not discuss it here.
2 Theorems
The remainder of this paper presents one main theorem giving finite sample bounds for our reduced dimensional HMM estimation method. We first derive these in terms of properties of the first three moments of the reduced rank ’s, where is the random variable which takes on values of the reduced rank observation . We then convert those bounds to be in terms of the estimates, rather than the unobservable true values, of the model.
Our general strategy of estimating is via the method of moments. We have written in terms of , and . Since each of these three items can be written in terms of moments of the ’s we can plug in these moments to generate an estimate of . Thus we can define:
(7) 
where
where , and are the empirical estimates of the first, second and third moments of the ’s, namely , , , where indexes the different independent observations of our data. These moments estimate the mean vector
, the variance matrix
, and the skewness tensor
.Definition 1.
Define as the smallest element of , and . In other words,
where we define are the elements of the tensor . Likewise we define the empirical version as
Definition 2.
Define as the smallest singular value of , and the smallest singular value of .
The parameters and will be central to our analysis. Theorem 1 gives sample complexity bounds on relative error in estimating the probability of a sequence being generated from an HMM as a function of and , and the following lemmas reformulate those bounds into a more useful form in terms of their estimates. As quantified and proved below, both and must be “sufficiently large”; when they approach zero one loses the ability to accurately estimate the model.
If then will not be invertible, and one cannot infer the full information content of the hidden state from its associated observation, violating the condition required in Hsu et al. (2009) for (5) to hold. As becomes increasingly close to zero, it becomes increasingly hard to identify the hidden state, and more observations are required. Problems with small are intrinsically difficult. As has been pointed out by Hsu et al. (2009), some problems of estimating HMM’s are equivalent to the parity problem Terwijn (2002). So for such data, our algorithm need not perform well. For paritylike problems, is in fact zero, or close to it; Hence we end up with a useless bound for such hard problems.
If is close to zero, then even if the absolute error is small, the relative error can be arbitrarily large, as it involves dividing by the small true value of the parameter being estimated. Fortunately, as discussed below, since depends on the somewhat arbitrary matrix , one can shift away from zero by rotating and rescaling .
The proof of Theorem 1 is based on the idea that if we can estimate each term in , and accurately on an absolute scale (which will follow from basic central limit like theorems) then we can estimate them on a relative scale if is large. Hence, our main condition is that is bounded away from zero. In fact, if we take the usual statistical limit of having the sample size go to infinity and holding everything else constant, then:
with probability greater than when is large enough.
The following theorem gives the finite sample bound in terms of a sample complexity:
Theorem 1.
Let be generated by an state HMM. Suppose we are given a which has the property that and . Suppose we use equation (7) to estimate the probability based on independent triples. Then
(8) 
implies that
holds with probability at least .
Before proceeding with the proof of this theorem, we present and prove two corollaries that correspond directly to Theorems 6 and 7 of Hsu et al. (2009).
Corollary 1.
Assume Theorem 1 holds, then with probability at least ,
Corollary 2.
Assume Theorem 1 holds, then we have
Proof of Corollary 2: We have
and using the fact that for small enough we have and , plus the fact that we have
and using a similar fact from above that for small enough , , we get
Define to simplify the following statements. The proof proceeds in two steps. First lemma 2 converts the sample complexity bound into a more useful bounds on and . Then lemma 3 uses these bounds to show the theorem.
Lemma 2.
If
then
(9)  
(10) 
The proof is straightforward and given in the appendix.
Lemma 3.
Proof of Lemma 3: Our estimator (see equation 7) can be written as
We can rewrite this matrix product as
The components can be written as a scalar sum as:
So,
This is just a sum of a product of scalars. Lemma 4 (stated precisely and proven in the appendix) shows that accuracy of our estimates of all elements of , and are bounded by with probability .
Each term in the product can be rewritten as
and so our products can be thought of as, instead of a product of observed quantities, the product of the theoretical quantities times some relative error term. We can bound this relative error term for all entries, which will allow it to factor out nicely over all summands, giving us a relative error term for our overall probability.
Again thinking of as a generic item in , , or , then above has shown that and so the relative error of each term is bounded as
which will hold for all terms with probability . Since , we see that
Since our is a product of such terms, we see that
So by our bound on , we have
holds with probability .
The sample complexity bound in Theorem 1 relies on knowing unobserved parameters of the problem. To avoid this, we modify Lemma 3 to make it observable. In other words, we convert the assumptions of sample complexity into a checkable condition.
Corollary 3.
Let be generated by an state HMM. Suppose we are given a which has the property that . Suppose we use equation (7) to estimate the probability based on independent triples. Then with probability , if the following two inequalities hold
(11)  
(12) 
then
Proof:
Two technical lemma’s are needed for this corollary: Lemma 4 and Lemma 5. They are stated and proved in the supplemental material. Lemma 4 basically says that with high probability, each element of , and is estimated accurately. This is then used in Lemma 5 to show that and are estimated accurately.
Define the event to be the set where all the estimates given in Lemma 4 hold. This event happens with probability . On this event from Lemma 5 we know , so . Hence
thus on the set if (11) and (12) hold, then we see that (9) and (10) both hold and so we can apply Theorem 1. We can now use Theorem 1 to generate our claim on the accuracy of our probability bound. Technically, this proof as given only shows that our corollary holds with probability . But since the set where Theorem 1 fails is exactly , the probability lower bound is .
The advantage of the corollary is that the left hand sides of the two conditions are observable and the right hand sides involve known quantities. Hence one can tell if the condition is true or not–it doesn’t require knowing unobserved parameters. Note that the statement is of the form so interpretation must be done carefully.
3 Discussion: effect of and on accuracy
As discussed above, and have different effects on sample complexity. As approaches zero, model estimation becomes intrinsically hard; some problems do not admit easy estimation. In contrast, role of in sample complexity is more of an artifact. As approaches zero, the relative error can be arbitrarily large, even if the estimated model is good in the sense that the probability estimates are highly accurate.
The problem with can be addressed in a couple ways. In this section, we show that estimating a likelihood ratio rather than the sequence probabilities gives improves relative accuracy bounds. An alternate approach, which we do not pursue here, relies on the observation that depends on the (underspecified) matrix , and that one can thus search for a rotation and rescaling of the matrix that increases .
3.1 Likelihood instead of probabilities
Obscure words correspond to rows of the observation matrix with very small values throughout the row. If we were interested in only estimating the probability of such a word, then these are the easy words–basically guess zero or close to it. But, since we would like to estimate the relative probability accurately, these words are the most challenging. Further, such small probabilities would make computing conditional probabilities unstable since they would then become basically “0/0.” Further, since the values are all small in and in , they do not significantly improve our estimates of , and since they are essentially zeros. Both of these problems can be fixed by considering the problem of estimating a likelihood ratio instead of a probability. So define:
The could be taken to be the marginal probability of observing . It does not, in fact, have to be a probability–just any weighting which helps condition our matrix and our tensor . We can then use a modified version of and in all our existing lemma’s and theorems. The precise statement of these modified versions are in the appendix. What changes is that now is much larger and hence our relative accuracy will be greatly improved. This fact is shown in the empirical section.
3.2 Empirical estimates of and
Figure 2 shows estimates of and
, using the Internet as the corpus as summarized in the Google ngram dataset
^{3}^{3}3http://googleresearch.blogspot.com/2006/08/allourngramarebelongtoyou.html, which contains frequencies of the most frequent 1grams to 5grams occurring on the web. Details on how the figures were generated can be found in the supplementary material. As the size, , of the reduced dimension space is increased, smaller and smaller singular values, , occur in the model, and the value of the smallest parameter in the model decreases. Empirically, both fall off with a power of , giving straight lines on the loglog plot. This data indicates a large sample complexity, the reduction of which will be a focus of future work.4 Prior work and conclusion
Recently, ideas have been proposed that push spectral learning of HMMs in several different directions. Boots et al. (2010) provides a kernelized spectral algorithm that allows for learning an HMM in any domain in which there exists a kernel. This allows for learning of an HMM with continuous output without the need for discretization. Boots & Gordon (2011) provides an analogous algorithm that enables online learning for Transformed Predictive State Representations, and hence the setup in Hsu et al. (2009). Finally, Siddiqi et al. (2009) directly extends Hsu et al. (2009) by relaxing the requirement that the transition matrix be of rank , but instead allows rank , creating a ReducedRank HMM (RRHMM), and then applying the algorithm from Hsu et al. (2009) to learn the observable representation of this RRHMM.
All of the above extensions preserve the basic structure of the tensor , which updates the hidden state estimate (or more precisely, a linear transformation of it) based on the most recent observation . In this paper, we replace with a tensor , which updates the hidden state estimate using a low dimensional projection of the observation . contains only terms, in contrast to the terms contained in . Reducing the number of parameters to be estimated has both computational and statistical efficiency advantages, but requires some changes to the proofs in Hsu et al. (2009). While making these changes, we also give proofs that are simpler, that only use conditions that are checkable from the data, and that bound the relative, rather than absolute error.
This paper focused on the simplest case, in which HMMs have discrete states and discrete observations and in which the observations are reduced to the same sized space as the hidden state, but our approach can be generalized in all of the ways described above.
We have presented an improved spectral method for estimating HMMs. By using a tensor that depends on the reduced rank instead of the full observed in the tensor used by Hsu et al. (2009), we reduced the number of parameters to be estimated by a factor of the ratio of the size of the vocabulary divided by the size of the hidden state. This reduction has corresponding benefits in the sample complexity. We also showed that the sample complexity depends critically upon , the smallest singular value of the covariance matrix . As becomes small, the HMM becomes increasingly hard to identify, and increasing numbers of samples are needed.
References

Baum et al. (1970)
Baum, L.E., Petrie, T., Soules, G., and Weiss, N.
A maximization technique occurring in the statistical analysis of probabilistic functions of markov chains.
The annals of mathematical statistics, 41(1):164–171, 1970.  Boots & Gordon (2011) Boots, B. and Gordon, G.J. An online spectral learning algorithm for partially observable nonlinear dynamical systems. AAAI, 2011.

Boots et al. (2010)
Boots, B., Siddiqi, S.M., Gordon, G., and Smola, A.
Hilbert space embeddings of hidden markov models.
Proc. 27th Intl. Conf. on Machine Learning (ICML)
, 2010.  Carlyle & Paz (1971) Carlyle, J.W. and Paz, A. Realizations by stochastic finite automata. Journal of Computer and System Sciences, 5(1):26–40, 1971.
 Dempster et al. (1977) Dempster, A.P., Laird, N.M., and Rubin, D.B. Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society. Series B (Methodological), 39(1):1–38, 1977.
 Fliess (1974) Fliess, M. Matrices de hankel. J. Math. Pures Appl, 53(197222):423, 1974.
 Geman & Geman (1984) Geman, Stuart and Geman, Donald. Stochastic relaxation, gibbs distributions, and the bayesian restoration of images. Pattern Analysis and Machine Intelligence, IEEE Transactions on, PAMI6(6):721 –741, nov. 1984. ISSN 01628828. doi: 10.1109/TPAMI.1984.4767596.
 Hoeffding (1963) Hoeffding, Wassily. Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association, 58(301):pp. 13–30, 1963. ISSN 01621459. URL http://www.jstor.org/stable/2282952.
 Hsu et al. (2009) Hsu, Daniel, Kakade, Sham M., and Zhang, Tong. A spectral algorithm for learning hidden markov models. COLT, 2009.
 Jaeger (2000) Jaeger, Herbert. Observable operator models for discrete stochastic time series. Neural Computation, 12(6), 2000.
 Littman et al. (2002) Littman, M.L., Sutton, R.S., and Singh, S. Predictive representations of state. Advances in neural information processing systems, 2:1555–1562, 2002.
 Rabiner (1989) Rabiner, L.R. A tutorial on hidden markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2):257–286, 1989.
 Schutzenbeegeb (1961) Schutzenbeegeb, MP. On the definition of a family of automata. Information and control, 4(23), 1961.
 Siddiqi et al. (2009) Siddiqi, S.M., Boots, B., and Gordon, G.J. Reducedrank hidden markov models. Arxiv preprint arXiv:0910.0902, 2009.
 Terwijn (2002) Terwijn, S. On the learnability of hidden markov models. Grammatical Inference: Algorithms and Applications, pp. 344–348, 2002.
Appendix Supplemental Material
Lemma (Restatement of Lemma 1).
Proof:
As pointed out in the main text, Jaeger (2000) showed (4), and Hsu et al. (2009) showed (5). To show (6), we will first write the characteristics , and in terms of the theoretical matrices, , , , and :
By definition, we have
likewise,
For ,
Note that is a projection operator and since its range is the same as that of we have . So, if , then:
Thus (6) follows from a telescoping product.
Proof of lemma 2:
The proof is simply algebraic manipulation. We have
which implies that
Comments
There are no comments yet.