# Spectral dimensionality reduction for HMMs

Hidden Markov Models (HMMs) can be accurately approximated using co-occurrence frequencies of pairs and triples of observations by using a fast spectral method in contrast to the usual slow methods like EM or Gibbs sampling. We provide a new spectral method which significantly reduces the number of model parameters that need to be estimated, and generates a sample complexity that does not depend on the size of the observation vocabulary. We present an elementary proof giving bounds on the relative accuracy of probability estimates from our model. (Correlaries show our bounds can be weakened to provide either L1 bounds or KL bounds which provide easier direct comparisons to previous work.) Our theorem uses conditions that are checkable from the data, instead of putting conditions on the unobservable Markov transition matrix.

## Authors

• 11 publications
• 2 publications
• 8 publications
06/04/2015

### Spectral Learning of Large Structured HMMs for Comparative Epigenomics

We develop a latent variable model and an efficient spectral algorithm m...
06/24/2021

### Fundamental limits for learning hidden Markov model parameters

We study the frontier between learnable and unlearnable hidden Markov mo...
12/10/2013

### Guaranteed Model Order Estimation and Sample Complexity Bounds for LDA

The question of how to determine the number of independent latent factor...
02/19/2018

### Learning Hidden Markov Models from Pairwise Co-occurrences with Applications to Topic Modeling

We present a new algorithm for identifying the transition and emission p...
09/28/2019

### Emirati-Accented Speaker Identification in Stressful Talking Conditions

This research is dedicated to improving text-independent Emirati-accente...
02/15/2021

### Online learning of Riemannian hidden Markov models in homogeneous Hadamard spaces

Hidden Markov models with observations in a Euclidean space play an impo...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

For many applications such as language modeling, it is useful to estimate Hidden Markov Models (HMMs) Rabiner (1989) in which observations drawn from a large vocabulary are generated from a much smaller hidden state. Standard HMM estimation techniques such as Gibbs sampling Geman & Geman (1984) and EM Baum et al. (1970); Dempster et al. (1977) methods, although very widely used, can require some effort to apply as they are often either slow or prone to get stuck in local optima. Hsu, Kakade and Zhang, in a path breaking paper, Hsu et al. (2009)

showed that HMMs can, in theory, be efficiently and accurately estimated using closed form calculations on trigrams of observations which have been projected onto a low dimensional space. Key to this approach is the use of singular value decomposition (SVD) on the matrix of covariances between adjacent observations to learn a matrix

that projects observations onto a space of the same dimension as the hidden state. Perhaps surprisingly, co-occurrence statistics on unigrams, pairs, and triples of observations are sufficient to accurately estimate a model equivalent to the original HMM.

The true hidden state itself cannot, of course, be estimated (it is not observed), but one can estimate a linear transformation of the hidden state which contains sufficient information to give an optimal (in a sense to be made precise below) estimate of the probability of any sequence of observations being generated by the HMM

Hsu et al. (2009). The method of Hsu et al. (2009), and the extensions to it presented in this paper do not require any EM or Gibbs sampling, but only need an SVD on bigram observation counts. Since SVD is an efficient method guaranteed to return the correct result in a known number of steps, this is a major advantage over the iterative EM method.

Hsu et al. Hsu et al. (2009) estimate a size matrix mapping between the the dimension observation space and a reduced dimension space of size

(the dimension of the hidden state space). They also need to estimate a tensor of size

. We provide an alternate formulation that replaces their tensor with one of size . Since the observation vocabulary, , is often much larger than the state space (), this provides significant reduction in model size, and hence, as we show below, in sample complexity.

### 1.1 HMM set-up and notation

We now introduce the notation and model used throughout our paper.

Consider an HMM where is an transition matrix on the hidden state, is a emission matrix giving the probabilities of hidden state emitting observation , and

is a vector of initial state probabilities in which

is the probability that . Jaeger Jaeger (2000) showed that the joint probability of a sequence of observations from this HMM is given by

 Pr(x1,x2,…,xt)=1⊤AxtAxt−1⋯Ax1π, (1)

where , is the unit vector of length with a single 1 in the th position and creates a matrix with the elements of the vector on its diagonal and zeros everywhere else.

is called an ’observation operator’, an idea dating back to multiplicity automata Schutzenbeegeb (1961); Carlyle & Paz (1971); Fliess (1974), and foundational in the theory of Observable Operator Models Jaeger (2000) and Predictive State Representations Littman et al. (2002). It is effectively a third order tensor, giving the distribution vector over states at time as a function of the state distribution vector at the current time and the current observation . Since depends on the hidden state, it is not observable, and hence cannot be directly estimated. But Hsu et al. (2009) showed that under certain conditions there exists a fully observable representation of the observable operator model. We now present a novel, fully reduced dimensional version of the observable representation.

### 1.2 The reduced dimension model

Define a random variable

, where has orthonormal columns and is a matrix mapping from observations to the reduced dimension space.

We show below that

 Pr(x1,x2,…,xt)=c⊤∞CytCyt−1⋯Cy1c1 (2)

holds where

 c1 = μ c⊤∞ = μ⊤Σ−1 Cy≡C(y) = K(y)Σ−1

and = , = , and

are easy to estimate using the method of moments.

111 Note that is a tensor. When multiplied by a vector , it produces a matrix. is linear in each of the three reduced dimension observations, , and .

The matrix can be derived in several ways; Hsu et al. (2009) show that taking it to consist of the left singular vectors of corresponding to the largest singular values gives good properties, where is a matrix such that . The matrix and its properties will be discussed in more detail below.

Note that the model () will be estimated using only trigrams. Once a model has been learned, the probability of any observed sequence can be computed using equation 2, or the conditional probability of the next observation in a sequence can be computed by with recursive updates . The key term in the model is thus , which can be viewed as a tensor which takes as input the current observation and produces a matrix which maps (after normalization) from the current “hidden state estimate” to the next one . More precisely, is a linear function of the conditional expectation of the unobservable hidden state , which is the conditional probability vector over states at time .

### 1.3 Comparison to Hsu et al.

Hsu et al. Hsu et al. (2009) derive a similar model which we state here for comparison.

 Pr(x1,x2,...,xt)=b⊤∞BxtBxt−1…Bx1b1 (3)

where

 b1 = U⊤P1 b⊤∞ = P⊤1(U⊤P21)+ Bx = (U⊤P3x1)(U⊤P21)+

and , as defined above, and are the frequencies of unigrams, bigrams, and trigrams in the observed data. Note that the subscripts on refer to their positions in trigrams of observations of the form .

Our major modeling change will be to replace in equation 3 with the lower dimensional tensor which depends on the reduced dimension projection instead of the unreduced . The models are easily related by the following lemma:

###### Lemma 1.

Assume the hidden state is of dimension and the rank of is also . Then:

 Pr(x1,x2,…,xt) = 1⊤AxtAxt−1⋯Ax1π (4) = b⊤∞BxtBxt−1⋯Bx1b1 (5) = c⊤∞CytCyt−1⋯Cy1c1 (6)

Where (5) requires to be invertible, and (6) requires .222 If the matrix is formed from the left singular vectors of corresponding to nonzero singular values, then it will satisfy this condition; See Hsu et al. (2009) lemma 2.

Proof sketch: Paper Jaeger (2000) showed (4), paper Hsu et al. (2009) showed (5), and (6) follows from a telescoping product of the following items:

 c1 = U⊤Oπ c⊤∞ = 1⊤(U⊤O)−1 Cy=C(y) = U⊤OAx(U⊤O)−1

where . More details are given in the supplemental material.

We improve Hsu et al. (2009) in three ways:

1. By reducing the size of the matrix that is estimated, we can achieve a lower sample complexity. In particular, our sample complexity does not depend on the size of the vocabulary nor on the frequency distribution of the vocabulary.

2. Since the conditions given in Hsu et al. (2009) are in terms of the transition matrix , they can not be checked. We instead focus on conditions that are checkable from the data.

3. Instead of using either a L1 error or a relative entropy error, we estimate the probabilities with relative accuracy. In other words, we show that is smaller than . This often is a more useful bound than just knowing is small. For example, it implies that computing conditional probabilities are off by less than . Both L1 and relative entropy errors can be computed from these bounds.

Our main theorem is weaker (as stated) than Hsu et al. (2009) in that we assume knowledge of rather than estimating it from a thin SVD of as they do. Since the accuracy lost when estimating is identical to that given in their paper, we will not discuss it here.

## 2 Theorems

The remainder of this paper presents one main theorem giving finite sample bounds for our reduced dimensional HMM estimation method. We first derive these in terms of properties of the first three moments of the reduced rank ’s, where is the random variable which takes on values of the reduced rank observation . We then convert those bounds to be in terms of the estimates, rather than the unobservable true values, of the model.

Our general strategy of estimating is via the method of moments. We have written in terms of , and . Since each of these three items can be written in terms of moments of the ’s we can plug in these moments to generate an estimate of . Thus we can define:

 ˆPr(xt,xt−1,…,x1)=ˆc⊤∞ˆC(yt)ˆC(yt−1)⋯ˆC(y1)ˆc1 (7)

where

 ˆc1 = ˆμ ˆc⊤∞ = ˆμ⊤ˆΣ−1 ˆC(y) = ˆK(y)ˆΣ−1

where , and are the empirical estimates of the first, second and third moments of the ’s, namely , , , where indexes the different independent observations of our data. These moments estimate the mean vector

, the variance matrix

, and the skewness tensor

.

###### Definition 1.

Define as the smallest element of , and . In other words,

 Λ≡min{mini|μi|,mini,j|Σ−1ij|,mini,j,k|Kijk|}

where we define are the elements of the tensor . Likewise we define the empirical version as

 ˆΛ≡min{mini|ˆμi|,mini,j|ˆΣ−1ij|,mini,j,k|ˆKijk|}
###### Definition 2.

Define as the smallest singular value of , and the smallest singular value of .

The parameters and will be central to our analysis. Theorem 1 gives sample complexity bounds on relative error in estimating the probability of a sequence being generated from an HMM as a function of and , and the following lemmas reformulate those bounds into a more useful form in terms of their estimates. As quantified and proved below, both and must be “sufficiently large”; when they approach zero one loses the ability to accurately estimate the model.

If then will not be invertible, and one cannot infer the full information content of the hidden state from its associated observation, violating the condition required in Hsu et al. (2009) for (5) to hold. As becomes increasingly close to zero, it becomes increasingly hard to identify the hidden state, and more observations are required. Problems with small are intrinsically difficult. As has been pointed out by Hsu et al. (2009), some problems of estimating HMM’s are equivalent to the parity problem Terwijn (2002). So for such data, our algorithm need not perform well. For parity-like problems, is in fact zero, or close to it; Hence we end up with a useless bound for such hard problems.

If is close to zero, then even if the absolute error is small, the relative error can be arbitrarily large, as it involves dividing by the small true value of the parameter being estimated. Fortunately, as discussed below, since depends on the somewhat arbitrary matrix , one can shift away from zero by rotating and rescaling .

The proof of Theorem 1 is based on the idea that if we can estimate each term in , and accurately on an absolute scale (which will follow from basic central limit like theorems) then we can estimate them on a relative scale if is large. Hence, our main condition is that is bounded away from zero. In fact, if we take the usual statistical limit of having the sample size go to infinity and holding everything else constant, then:

 ∣∣ ∣∣ˆPr(x1,…,xt)Pr(x1,…,xt)−1∣∣ ∣∣≤18mtσ2mΛ√N√log(m/δ)

with probability greater than when is large enough.

The following theorem gives the finite sample bound in terms of a sample complexity:

###### Theorem 1.

Let be generated by an state HMM. Suppose we are given a which has the property that and . Suppose we use equation (7) to estimate the probability based on independent triples. Then

 N≥128m2(2t+3√1+ϵ−1)2Λ2σ4mlog(2mδ) (8)

implies that

 1−ϵ≤∣∣ ∣∣ˆPr(x1,…,xt)Pr(x1,…,xt)∣∣ ∣∣≤1+ϵ

holds with probability at least .

Before proceeding with the proof of this theorem, we present and prove two corollaries that correspond directly to Theorems 6 and 7 of Hsu et al. (2009).

###### Corollary 1.

Assume Theorem 1 holds, then with probability at least ,

 ∑x1,…,xt|ˆPr(xt,…,xt)−Pr(x1,…,xt)|≤ϵ

Proof of Corollary 1: We have

 1−ϵ≤∣∣ ∣∣ˆPr(x1,…,xt)Pr(x1,…,xt)∣∣ ∣∣≤1+ϵ ⇒ ∣∣ ∣∣ˆPr(x1,…,xt)Pr(x1,…,xt)−1∣∣ ∣∣≤ϵ ⇒ ∣∣ˆPr(x1,…,xt)−Pr(x1,…,xt)∣∣≤ϵPr(x1,…,xt) ⇒ ∑x1,…,xt∣∣ˆPr(x1,…,xt)−Pr(x1,…,xt)∣∣ ≤ϵ∑x1,…,xtPr(x1,…,xt) ⇒ ∑x1,…,xt∣∣ˆPr(x1,…,xt)−Pr(x1,…,xt)∣∣≤ϵ

###### Corollary 2.

Assume Theorem 1 holds, then we have

 KL(Pr(xt|x1, …xt−1)||ˆPr(xt|x1,…xt−1)) =E(lnPr(xt|x1,…xt−1)ˆPr(xt|x1,…xt−1))≤6ϵ

Proof of Corollary 2: We have

 1−ϵ≤∣∣ ∣∣ˆPr(x1,…,xt)Pr(x1,…,xt)∣∣ ∣∣≤1+ϵ ⇒ 1−ϵ≤∣∣ ∣∣ˆPr(xt|x1:t−1)ˆPr(x1:t−1)Pr(xt|x1:t−1)Pr(x1:t−1)∣∣ ∣∣≤1+ϵ ⇒ 1−ϵ1+ϵ≤∣∣ ∣∣ˆPr(xt|x1:t−1)Pr(xt|x1:t−1)∣∣ ∣∣≤1+ϵ1−ϵ

and using the fact that for small enough we have and , plus the fact that we have

 ⇒ 1−3ϵ≤∣∣ ∣∣ˆPr(xt|x1:t−1)Pr(xt|x1:t−1)∣∣ ∣∣≤1+3ϵ ⇒ 11+3ϵ≤∣∣ ∣∣Pr(xt|x1:t−1)ˆPr(xt|x1:t−1)∣∣ ∣∣≤11−3ϵ

and using a similar fact from above that for small enough , , we get

 ⇒ ∣∣ ∣∣Pr(xt|x1:t−1)ˆPr(xt|x1:t−1)∣∣ ∣∣≤1+6ϵ ⇒ ln[ˆPr(xt|x1:t−1)Pr(xt|x1:t−1)]≤ln(1+6ϵ)≤6ϵ ⇒ ∑x1,…,xtPr(x1,…,xt)ln[ˆPr(xt|x1:t−1)Pr(xt|x1:t−1)] ≤6ϵ∑x1,…,xtPr(x1,…,xt) ⇒ Eln[ˆPr(xt|x1:t−1)Pr(xt|x1:t−1)]≤6ϵ

Define to simplify the following statements. The proof proceeds in two steps. First lemma 2 converts the sample complexity bound into a more useful bounds on and . Then lemma 3 uses these bounds to show the theorem.

###### Lemma 2.

If

 N≥128m2(2t+3√1+ϵ−1)2Λ2σ4mlog(2mδ)

then

 Λ ≥ 3Jσ2m(2t+3√1+ϵ−1) (9) σm ≥ 4J (10)

The proof is straightforward and given in the appendix.

###### Lemma 3.

If equation (8) of Theorem 1 is replaced by (9) and (10) then the results of the theorem follow.

Proof of Lemma 3: Our estimator (see equation 7) can be written as

 ˆPr(x1,…,xt)=ˆμ⊤ˆΣ−1ˆK(yt)ˆΣ−1⋯ˆK(y1)ˆΣ−1ˆμ

We can rewrite this matrix product as

 ˆPr(x1,…,xt)= m∑i1=1⋯m∑i2t+3=1 [ˆμ]i1[ˆΣ−1]i1,i2[ˆK(yt)]i2,i3[ˆΣ−1]i3,i4 ⋯[ˆμ]i2t+3

The components can be written as a scalar sum as:

 [ˆK(y)]a,b=y1[ˆK]a,b,1+y2[ˆK]a,b,2+…+ym[ˆK]a,b,m

So,

 ˆPr(x1,…,xt)= ∑i1,…,i2t+3j1,…,jt[ˆμ]i1[ˆΣ−1]i1,i2[ˆK]i2,i3,j1[yt]j1 ⋅[ˆΣ−1]i3,i4[ˆK]i5,i6,j2[yt−1]j2⋯[ˆμ]i2t+3

This is just a sum of a product of scalars. Lemma 4 (stated precisely and proven in the appendix) shows that accuracy of our estimates of all elements of , and are bounded by with probability .

Each term in the product can be rewritten as

 ˆθ = θ(1+ˆθ−θθ)

and so our products can be thought of as, instead of a product of observed quantities, the product of the theoretical quantities times some relative error term. We can bound this relative error term for all entries, which will allow it to factor out nicely over all summands, giving us a relative error term for our overall probability.

Again thinking of as a generic item in , , or , then above has shown that and so the relative error of each term is bounded as

 1−3Jσ2mθ≤ˆθθ≤1+3Jσ2mθ

which will hold for all terms with probability . Since , we see that

 1−3Jσ2mΛ≤ˆθθ≤1+3Jσ2mΛ

Since our is a product of such terms, we see that

 (1−3Jσ2mΛ)2t+3≤ˆPr()Pr()≤(1+3Jσ2mΛ)2t+3

So by our bound on , we have

 1−ϵ≤ˆPr()Pr()≤1+ϵ

holds with probability .

The sample complexity bound in Theorem 1 relies on knowing unobserved parameters of the problem. To avoid this, we modify Lemma 3 to make it observable. In other words, we convert the assumptions of sample complexity into a checkable condition.

###### Corollary 3.

Let be generated by an state HMM. Suppose we are given a which has the property that . Suppose we use equation (7) to estimate the probability based on independent triples. Then with probability , if the following two inequalities hold

 ˆΛˆσ2m ≥ (12m+6m(2t+3√1+ϵ−1))√2log2mδN (11) ˆσm ≥ 10m√2log2mδN. (12)

then

 1−ϵ≤∣∣ ∣∣ˆPr(x1,…,xt)Pr(x1,…,xt)∣∣ ∣∣≤1+ϵ

Proof:

Two technical lemma’s are needed for this corollary: Lemma 4 and Lemma 5. They are stated and proved in the supplemental material. Lemma 4 basically says that with high probability, each element of , and is estimated accurately. This is then used in Lemma 5 to show that and are estimated accurately.

Define the event to be the set where all the estimates given in Lemma 4 hold. This event happens with probability . On this event from Lemma 5 we know , so . Hence

 ˆΛ≥6mσ2m(2t+3√1+ϵ−1)√2log2mδN+6mσ2m√2log2mδN,

thus on the set if (11) and (12) hold, then we see that (9) and (10) both hold and so we can apply Theorem 1. We can now use Theorem 1 to generate our claim on the accuracy of our probability bound. Technically, this proof as given only shows that our corollary holds with probability . But since the set where Theorem 1 fails is exactly , the probability lower bound is .

The advantage of the corollary is that the left hand sides of the two conditions are observable and the right hand sides involve known quantities. Hence one can tell if the condition is true or not–it doesn’t require knowing unobserved parameters. Note that the statement is of the form so interpretation must be done carefully.

## 3 Discussion: effect of Λ and σm on accuracy

As discussed above, and have different effects on sample complexity. As approaches zero, model estimation becomes intrinsically hard; some problems do not admit easy estimation. In contrast, role of in sample complexity is more of an artifact. As approaches zero, the relative error can be arbitrarily large, even if the estimated model is good in the sense that the probability estimates are highly accurate.

The problem with can be addressed in a couple ways. In this section, we show that estimating a likelihood ratio rather than the sequence probabilities gives improves relative accuracy bounds. An alternate approach, which we do not pursue here, relies on the observation that depends on the (underspecified) matrix , and that one can thus search for a rotation and rescaling of the matrix that increases .

### 3.1 Likelihood instead of probabilities

Obscure words correspond to rows of the observation matrix with very small values throughout the row. If we were interested in only estimating the probability of such a word, then these are the easy words–basically guess zero or close to it. But, since we would like to estimate the relative probability accurately, these words are the most challenging. Further, such small probabilities would make computing conditional probabilities unstable since they would then become basically “0/0.” Further, since the values are all small in and in , they do not significantly improve our estimates of , and since they are essentially zeros. Both of these problems can be fixed by considering the problem of estimating a likelihood ratio instead of a probability. So define:

 λq(x1,…,xt)=Pr(x1,x2,…,xt)P1(x1)P1(x2)⋯P1(xt)

The could be taken to be the marginal probability of observing . It does not, in fact, have to be a probability–just any weighting which helps condition our matrix and our tensor . We can then use a modified version of and in all our existing lemma’s and theorems. The precise statement of these modified versions are in the appendix. What changes is that now is much larger and hence our relative accuracy will be greatly improved. This fact is shown in the empirical section.

### 3.2 Empirical estimates of Λ and σm

Figure 2 shows estimates of and

, using the Internet as the corpus as summarized in the Google n-gram dataset

, which contains frequencies of the most frequent 1-grams to 5-grams occurring on the web. Details on how the figures were generated can be found in the supplementary material. As the size, , of the reduced dimension space is increased, smaller and smaller singular values, , occur in the model, and the value of the smallest parameter in the model decreases. Empirically, both fall off with a power of , giving straight lines on the log-log plot. This data indicates a large sample complexity, the reduction of which will be a focus of future work.

## 4 Prior work and conclusion

Recently, ideas have been proposed that push spectral learning of HMMs in several different directions. Boots et al. (2010) provides a kernelized spectral algorithm that allows for learning an HMM in any domain in which there exists a kernel. This allows for learning of an HMM with continuous output without the need for discretization. Boots & Gordon (2011) provides an analogous algorithm that enables online learning for Transformed Predictive State Representations, and hence the setup in Hsu et al. (2009). Finally, Siddiqi et al. (2009) directly extends Hsu et al. (2009) by relaxing the requirement that the transition matrix be of rank , but instead allows rank , creating a Reduced-Rank HMM (RR-HMM), and then applying the algorithm from Hsu et al. (2009) to learn the observable representation of this RR-HMM.

All of the above extensions preserve the basic structure of the tensor , which updates the hidden state estimate (or more precisely, a linear transformation of it) based on the most recent observation . In this paper, we replace with a tensor , which updates the hidden state estimate using a low dimensional projection of the observation . contains only terms, in contrast to the terms contained in . Reducing the number of parameters to be estimated has both computational and statistical efficiency advantages, but requires some changes to the proofs in Hsu et al. (2009). While making these changes, we also give proofs that are simpler, that only use conditions that are checkable from the data, and that bound the relative, rather than absolute error.

This paper focused on the simplest case, in which HMMs have discrete states and discrete observations and in which the observations are reduced to the same sized space as the hidden state, but our approach can be generalized in all of the ways described above.

We have presented an improved spectral method for estimating HMMs. By using a tensor that depends on the reduced rank instead of the full observed in the tensor used by Hsu et al. (2009), we reduced the number of parameters to be estimated by a factor of the ratio of the size of the vocabulary divided by the size of the hidden state. This reduction has corresponding benefits in the sample complexity. We also showed that the sample complexity depends critically upon , the smallest singular value of the covariance matrix . As becomes small, the HMM becomes increasingly hard to identify, and increasing numbers of samples are needed.

## References

• Baum et al. (1970) Baum, L.E., Petrie, T., Soules, G., and Weiss, N.

A maximization technique occurring in the statistical analysis of probabilistic functions of markov chains.

The annals of mathematical statistics, 41(1):164–171, 1970.
• Boots & Gordon (2011) Boots, B. and Gordon, G.J. An online spectral learning algorithm for partially observable nonlinear dynamical systems. AAAI, 2011.
• Boots et al. (2010) Boots, B., Siddiqi, S.M., Gordon, G., and Smola, A. Hilbert space embeddings of hidden markov models.

Proc. 27th Intl. Conf. on Machine Learning (ICML)

, 2010.
• Carlyle & Paz (1971) Carlyle, J.W. and Paz, A. Realizations by stochastic finite automata. Journal of Computer and System Sciences, 5(1):26–40, 1971.
• Dempster et al. (1977) Dempster, A.P., Laird, N.M., and Rubin, D.B. Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society. Series B (Methodological), 39(1):1–38, 1977.
• Fliess (1974) Fliess, M. Matrices de hankel. J. Math. Pures Appl, 53(197-222):423, 1974.
• Geman & Geman (1984) Geman, Stuart and Geman, Donald. Stochastic relaxation, gibbs distributions, and the bayesian restoration of images. Pattern Analysis and Machine Intelligence, IEEE Transactions on, PAMI-6(6):721 –741, nov. 1984. ISSN 0162-8828.
• Hoeffding (1963) Hoeffding, Wassily. Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association, 58(301):pp. 13–30, 1963. ISSN 01621459.
• Hsu et al. (2009) Hsu, Daniel, Kakade, Sham M., and Zhang, Tong. A spectral algorithm for learning hidden markov models. COLT, 2009.
• Jaeger (2000) Jaeger, Herbert. Observable operator models for discrete stochastic time series. Neural Computation, 12(6), 2000.
• Littman et al. (2002) Littman, M.L., Sutton, R.S., and Singh, S. Predictive representations of state. Advances in neural information processing systems, 2:1555–1562, 2002.
• Rabiner (1989) Rabiner, L.R. A tutorial on hidden markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2):257–286, 1989.
• Schutzenbeegeb (1961) Schutzenbeegeb, MP. On the definition of a family of automata. Information and control, 4(2-3), 1961.
• Siddiqi et al. (2009) Siddiqi, S.M., Boots, B., and Gordon, G.J. Reduced-rank hidden markov models. Arxiv preprint arXiv:0910.0902, 2009.
• Terwijn (2002) Terwijn, S. On the learnability of hidden markov models. Grammatical Inference: Algorithms and Applications, pp. 344–348, 2002.

## Appendix- Supplemental Material

###### Lemma (Restatement of Lemma 1).

Assume the hidden state is of dimension and the rank of is also . Then:

 Pr(x1,x2,…,xt)=1⊤AxtAxt−1⋯Ax1π (4)
 Pr(x1,x2,…,xt)=b⊤∞BxtBxt−1⋯Bx1b1 (5)
 Pr(x1,x2,…,xt)=c⊤∞CytCyt−1⋯Cy1c1 (6)

Where (5) requires to be invertible, and (6) requires .

Proof:

As pointed out in the main text, Jaeger (2000) showed (4), and Hsu et al. (2009) showed (5). To show (6), we will first write the characteristics , and in terms of the theoretical matrices, , , , and :

 μ = U⊤Oπ Σ = U⊤OTdiag(π)O⊤U Σ−1 = (O⊤U)−1diag(π)−1T−1(U⊤O)−1 K(y) = U⊤OTdiag(O⊤Uy)Tdiag(π)O⊤U

By definition, we have

 c1≡μ=U⊤Oπ

likewise,

 c⊤∞ ≡ μ⊤Σ−1 = (π⊤O⊤U)((O⊤U)−1% diag(π)−1 ⋅T−1(U⊤O)−1) = π⊤diag(π)−1T−1(U⊤O)−1 = 1⊤T−1(U⊤O)−1 = 1⊤(U⊤O)−1

For ,

 C(y) = K(y)Σ−1 = U⊤OTdiag(O⊤Uy) ⋅Tdiag(π)O⊤UΣ−1 = U⊤OTdiag(O⊤Uy)(U⊤O)−1

Note that is a projection operator and since its range is the same as that of we have . So, if , then:

 C(y) = U⊤OTdiag(O⊤UU⊤δx)(U⊤O)−1 = U⊤OTdiag(O⊤δx)(U⊤O)−1 = U⊤OAx(U⊤O)−1

Thus (6) follows from a telescoping product.

Proof of lemma 2:

The proof is simply algebraic manipulation. We have

 N≥128m2(2t+3√1+ϵ−1)2Λ2σ4mlog(2mδ)

which implies that

 Λ2 ≥ 128m2(2t+3√1+ϵ−1)2Nσ