 # Estimate exponential memory decay in Hidden Markov Model and its applications

Inference in hidden Markov model has been challenging in terms of scalability due to dependencies in the observation data. In this paper, we utilize the inherent memory decay in hidden Markov models, such that the forward and backward probabilities can be carried out with subsequences, enabling efficient inference over long sequences of observations. We formulate this forward filtering process in the setting of the random dynamical system and there exist Lyapunov exponents in the i.i.d random matrices production. And the rate of the memory decay is known as λ_2-λ_1, the gap of the top two Lyapunov exponents almost surely. An efficient and accurate algorithm is proposed to numerically estimate the gap after the soft-max parametrization. The length of subsequences B given the controlled error ϵ is B=(ϵ)/(λ_2-λ_1). We theoretically prove the validity of the algorithm and demonstrate the effectiveness with numerical examples. The method developed here can be applied to widely used algorithms, such as mini-batch stochastic gradient method. Moreover, the continuity of Lyapunov spectrum ensures the estimated B could be reused for the nearby parameter during the inference.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction to HMM

Hidden Markov models(HMM) are a class of discrete-time stochastic process : is a latent discrete valued state sequence generated by a Markov chain, with values taking in the finite set ; is corresponding observations generated from distributions determined by the latent states . Here it assumes taking values in , but it can easily extended to discrete states.

We can use the forward algorithm to compute the joint distribution

by marginalizing over all other state sequences . is conditionally independent of everything but and is conditionally independent of everything but , i.e, and . The algorithm takes advantage of the conditional independence rules of HMM to perform the calculation recursively. With Bayes’s rule, it follows,

 p(xn,y1:n) =∑xn−1p(yn|xn)p(xn|xn−1)p(xn−1,y1:n−1) (1)

In (1), is called emission distribution with emission parameter , is the transition probability of the Markov chain which is represented by a transition matrix . In most cases, we assume is primitive, i.e, the corresponding Markov chain is irreducible and aperiodic. We denote the parameter of interest as

. If the emission distribution is Gaussian distribution, then the emission parameters are the mean

and the covariance . By using notation of vectors and matrix operations, this joint distribution can be represented by a row vector with th component is . The forward algorithm can be calculated by the following non-homogenous matrix product.

 pn=p0MD1MD2…MDn (2)

where is the initial state distribution , is a diagonal matrix with th entry as which is the emission distribution assume the current state is at . Moreover, if one treats the observation as random events, then are random matrices sampled independently at each step. If one starts with invariant distribution of the Markov chain initially, , then these matrices are sampled in i.i.d manner with probability density distribution

 f(y)=∑jπjp(y|xn=j,ϕj) (3)

If the initial distribution is not , after couple time steps, the distribution follows the invariant distribution and one can assume these matrices are sampled in i.i.d manner anyway. Now it is turned into a product of random matrices problem and these diagonal matrices randomly rescale the columns of . is called forward probability.

If we normalize the vector , it obtains the filtered state probability, , which is not the invariant distribution of the Markov chain,

 ρn=p(xn,y1:n|θ)p(y1:n|θ)=pnpn⋅1 (4)

This process is called filtering. The normalization constant gives the total probability for observing the given sequence up to step irrespective of the final states, which is also called marginal likelihood . Not only this process ensures the numerical stability of random matrices production, but also provides the scaled probability vector of being each state at step . Note the probability vector lives in a simplex,

, which is also called projective space in dynamical system, or space of measure in probability theory. Instead, the joint probability

is in .

Another joint probability column vector is the probability of observing all future events starting with a particular state . It can be computed by the backward algorithm similarly and it is called backward probability. We begin with , and it gives

 bi=MDi+1…MDn1 (5)

It is again a product of random matrices. One can similarly renormalize the backward probability vector for better numerical stability, such that . In fact, with forward and backward probability, we can calculate the probability which is the Hadamard product of two vectors. In fact, the entry for the highest entry of this probability vector can give rough idea which latent state at step lies.

## 2 Exponential Forgetting

Heuristically, in this very long heterogenous matrix multiplication (2), one observes that the final vector is irrelative to the initial vector and almost determined by the last several matrices multiplications, up to a normalization constant. As a matter of fact, if one is not interested in the precise value of the final vector, the subsequence of matrices with length are sufficient to approximate the vector. In more mathematical precise writings: Start any two different initial state probability vector and and after applying exactly the same sequence of matrices, they generate two sequence of filtered state probability and . The distance of two sequence goes to 0 asymptotically almost surely, i.e,

 limn→+∞∥ρn−ρ′n∥=0 a.s. (6)

This phenomenon is called exponential forgetting of prediction filter or loss of memory of HMM.

Example: In the figure 1, Markov chain has three state, emission distribution is a one-dimensional Gaussian on each state and the parameter is

 M=⎡⎢⎣0.0050.990.0050.010.030.960.950.0050.045⎤⎥⎦,μ=[0,0.5,−0.5],σ=[1,1,1]

Starting with every point in the simplex as initial conditions, we apply these points by the same sequence of random matrices. One observes that the triangle consisting all points starts to shrink along and after 40 steps, the triangle is contained within a small circle with radius . As goes to , it will synchronize into a random fixed point, since it is sequence dependent. That implies if one allows error of , it may only requires the last 40 matrices which is irrelevant with the initial condition. So it significantly simplifies computational complexity.

If the diagonal matrices are homogenous, it degenerates to the corollary of Perron-Frobenius theorem for primitive matrices. Now natural questions to arise are: what are conditions for such phenomenon and under these conditions, what are the rate of convergence. This rate in fact answers the critical question that how long the length should be for a given . More questions about the rate are how to estimate the rate numerically or even analytically and does the rate continuously depends on the parameter .

In fact, the sufficient conditions for this phenomenon are given in Le Gland et al [20, 19],

###### Theorem 2.1

If Markov transition matrix is primitive and the emission distribution is strictly positive, then for any , there exists a strictly negative

 limsupn→+∞1nlog∥ρn−ρ′n∥≤−c,  almost\ surely (7)

The theorem implies the filtered state probability forget almost surely their initial conditions exponentially fast and the rate is at least . The techniques they used are Hilbert metric and Birkhoff contraction coefficient , which are extensively applied in non-negative matrix theory [31, 11]. Definitions of both terms are included in the appendix A and it also showed that for positive matrix

which is a sub-class of primitive matrix. It is a bit surprising that eigenvalues of each matrix in the heterogenous matrix production have little to do with this asymptotic behavior. In particular, one can construct a matrix sequence that spectrum radius of each is uniformly less 1, but the product doesn’t even converge to 0. It is because the spectral radius doesn’t process sub-multiplicity property, on the other hand, this Birkhoff contraction coefficient does. Moreover,

if and only if each row of is a scalar multiple of the first row, which is also called weak ergodicity. At last, this coefficient is invariant with rescaling rows and columns of matrix. From these three properties, one immediately concludes when is positive, the heterogeneous matrix production in (2) has the weak ergodicity and the exponential forgetting of the prediction filter follows with convergence rate . To further relax the positive matrix to primitive matrix, the approach is rather technical.

On the other hand, the long time behavior of random matrices production is well studied in multiplicative ergodic theorem (MET) through Lyapunov exponent. It is the heart of a field called Random Dynamical System (RDS) . Lyapunov exponent is exactly the generalization of absolute value of eigenvalues in the terms of random matrices production. Atar et al  and Collet et al  gave the exact convergence rate by Lyapunov exponents,

###### Theorem 2.2
 limsupn→+∞1nlog∥ρn−ρ′n∥=λ2−λ1,  almost\ surely (8)

So the convergence rate is upper bounded by the gap between the first two Lyapunov exponents of the products of random matrices in (2) and in fact realized for almost all realizations. Furthermore, they showed this gap is strictly negative when the transition matrix is primitive and the emission distribution is positive. Then it recovered Le Gland’s results. There is a nice connection between two results: Peres  proved the gap of the first two Lyapunov exponents, in i.i.d random matrices production is upper bounded by . So for positive matrices, these two results connect with each other naturally.

Random dynamical system, as an extension of the theory of non-autonomous dynamical system, has different setup from stochastic process and is somewhat inaccessible to a nonspecialist. Here we will present this theory in the setting of product of random matrices intuitively. The rigorous definition is included in appendix B.

We will describe an i.i.d RDS for the sake of convenience and one can extend easily to independent but not identical RDS. Results presented in this paper can be extended to ergodic case. The state space is and the family of matrices is all the possible diagonal matrices . We would like to study the dynamics of in (2). Although itself has probability meaning, here we are merely treating it as dimensional random variable. Initially, starting from an initial condition , a diagonal matrix is chosen according to the probability density distribution in (3). Then the system moves to the state in step 1. Again, independently of previous maps, another matrix

is chosen according to the same probability density function and the system moves to the state

. The procedure repeats. The random variable is now constructed by means of multiplication of independent random matrices.

The asymptotic limit of the rate of growth for the product of independent random matrices, , is as been studied started at the beginning of the 60 s. It has great relevance for development of the ergodic theory of dynamical system. Furstenberg and Kesten [10, 9] showed

###### Theorem 2.3
 λ1=limn→+∞1nlog∥pn∥∥p0∥,  almost\ surely (9)

this limit exists almost surely, moreover, it is nonrandom quantity and independent of the choice of metric and initial condition.

It is considered as the extension of strong law of large number to i.i.d random matrices

. This limit is called maximum Lyapunov exponent. It is rather surprising result since the order of sequence seems not much important even for non-commutative matrix multiplication. However, the Furstenberg-Kesten theorem neglects the finer structure given by the lower growth rates, other than the maximum Lyapunov exponent. Later Oseledets  showed there exists Lyapunov spectrum

, like eigenvalue spectrum, from the multiplicative ergodic theorem (MET). Similarly Lyapunov spectrum doesn’t depend on the choice of sequence almost surely and thus it is a global property for this random matrix multiplication. For a given initial vector, such set of sequences that gives different asymptotic limit of growth rate has zero measure. Analog with eigenvector, it also has Lyapunov vector which describes characteristic expanding and contracting directions, but it depends on the particular ergodic sequence. The statement of the theorem is included in appendix

B.

The filtered state probability is projected onto the simplex in (4) and the dynamics of it will be an induced RDS. There is a nice theorem connecting Lyapunov spectrum of both RDS. 

###### Theorem 2.4

Lyapunov spectrum of the induced RDS is that of the corresponding RDS subtracts the maximum Lyapunov exponent, i.e, .

Specifically, when the condition in theorem 2.1 is fulfilled, then maximum Lyapunov exponent of the induced RDS, with multiplicity 1 and the next one is which is what we desire to estimate.

In the framework of RDS, the exponential forgetting property defined above is equivalent with synchronization by noise. Synchronization is the phenomenon when trajectories of random dynamical systems subjected to the same randomness, but starting from different initial vectors converge in time to a single (random) solution almost surely, like in eq (6). So these trajectories are not independent. Synchronization has been widely discovered as a relevant property in modeling of external noises. In neurosciences, one observes this synchronization by common noise as a reliable response of one single neural oscillator on a repeatedly applied external pre-recorded input, which may be seen as a dynamical system driven by the same noise path but different initial conditions. [21, 18]

However, we note not every RDS processes this property. Specifically, Newman  showed the necessary and sufficient conditions for stable synchronization in continuous state RDS. Crudely speaking, in order to see synchronization, one needs two ingredients: local contraction (negative maximum Lyapunov exponents) so that nearby points approach each other; along with a global irreducibility condition. In discrete state RDS, conditions for synchronization are discussed as well . In HMM, the global irreducibility holds since the transition matrix is primitive and the local contraction is guaranteed by this gap . It recovers the results previously obtained. Much intuitive picture will be presented in the next section. So the 2-norm of difference for two nearby trajectories has the following behavior,

 ∥ρn−ρ′n∥≤Cexp((λ2−λ1)n)∥p0−p′0∥ (10)

where is some constant. If one would like to have the error within the radius of , then the length of the subsequence should be . However, from the previous literature [20, 19, 2], the explicit analytical estimate of the gap for a given parameter is either too loose or still difficult to find. So the numerical algorithm of efficient estimation is on demand.

## 3 Algorithm

In fact, one could sample two sequences of and with the same matrices sequence and monitor the maximum length needed to achieve

error. However, it suffers numerical instability and lack of robustness, such that some rare cases could deviate the estimate. Or one use QR decomposition directly to find the Lyapunov spectrum for (

2) which takes about order of multiplications per each iteration. But what needed is merely the second largest one instead of the whole spectrum. Then it is possible to have a more efficient algorithm and may provide some new insight that why this gap governs the exponential forgetting rate. In realistic scenario, one may possibly access the forward probability or the filtered state probability , or at least some portion of them. We would like to take advantage of these information without redoing this time-consuming filtering process.

If , define a projection from simplex to as the log ratio relative to the last component. The projection is illustrated for the example in figure 1.

 Π:ρ=[a1,a2,…,aK]→r=[log(a1aK)r1,log(a2aK)r2,…,log(aK−1aK)rK−1,0] (11) Figure 1: (a) Starting with every point in the simplex, apply the same sequence of random matrices, and the triangle is contained within a small circle with radius ϵ after 40 steps. (b) Diagram of the projection from a point in the simplex S2 to R2.

Denote as convention, such that is embedded in . Since is primitive, cannot be 0 except for at most initial steps. In the mean time, will be in the interior of the simplex. Such projection from compact space to non-constraint space, illustrated in figure 1, is relatively common in numerical optimization which is called soft-max parametrization. It directly implies the constraint condition and . The inverse of the projection, is

 Π−1:r=[r1,r2,…,rK−1,0]→ρ=[exp(r1)∑iexp(ri)a1,exp(r2)∑iexp(ri)a2,…,exp(rK)∑iexp(ri)aK] (12)

The index of the summation is from 1 to . This projection naturally defines an induced RDS for the dynamics of . Furthermore,

###### Theorem 3.1

If the coordinate transformation is one-to-one, the derivative and its inverse exists, then Lyapunov spectrum is invariant under such coordinate transformation.

Then the projection preserves the Lyapunov spectrum. It also means the synchronization with the variable implies the synchronization with and vice versa. Heuristically understanding, is due to the constraint condition and after the parametrization, the condition is inherited in the last component . If we only study the dynamics for the first unconstrained coordinates, it removes this particular Lyapunov exponent of the induced RDS but keeps the rest of the spectrum the same. Now the maximum Lyapunov exponent is the desired difference .

In addition, the dynamics of has the following nice property. The random map for has the form as

 rn+1 =Gdn(rn)=dn+F(rn) (13)

It is composed with random translation and deterministic map . Each component of the map is explicitly given as

 Fi(r) =ln(∑Kj=1exp(rj)Mji∑Kj=1exp(rj)MjK), 1≤i≤K−1 (14)

If we denote as the -th column of the transition matrix and as component-wise exponent, eq (14) can be rewritten by inner product form

 Fi(r)=ln(exp(r)⋅miexp(r)⋅mK), 1≤i≤K−1 (15)

The random translation is similarly defined as the log ratio of diagonal of relative to the last component, . Since the emission distribution is positive, the log ratio is well defined. The random map is the translation of the deterministic smooth map by the i.i.d random variable and is solely dependent on the transition matrix . It is even more interesting to notice the Jacobian of this random map is independent with , it is since the random translation will not affect the local contraction or expansion.

The -by- Jacobian matrix can be explicitly expressed as follows,

 Jij(r)=exp(rj)Mjiexp(r)⋅mi−exp(rj)MjKexp(r)⋅mK, 1≤i,j≤K−1 (16)

Then we will have the corollary following by Theorem 2.4 and Theorem 3.1

###### Corollary 1
 λ2−λ1=limsupn→+∞1nlog∥J(rn)J(rn−1)⋯J(r1))∥ (17)

Now the maximum Lyapunov exponent is approximated by the finite time Lyapunov exponent. Instead of using QR decomposition, the maximum Lyapunov exponent can be estimated by averaging finite time approximations which is much faster and easier to implement. One can start with a unit test vector, apply these Jacobian matrices sequentially and renormalize the vector at each step. Averaging all these renormalization constants along the timeline will give the approximation of maximum Lyapunov exponent. It is not a concern that all vectors are alignment along the direction of maximal expansion because we are not interested in finer structure of the spectrum. The order of multiplication needed is per each iteration which is faster than QR decomposition. More importantly, if one already has partial data set of the filtered state probability, they can be projected to and estimate the Lyapunov exponent directly without further information on observed sequences. Figure 2: (a) We use algorithm 1 to estimate the gap of Lyapunov exponent with the observation sequence with length of 10000. (b) We sample 10 independent sequences for 1nlog∥ρn−ρ′n∥2 and compare with the theoretical limit (black line). (c) We average 500 independent sample sequences and compare with the theoretical limit (black line).

The maximum Lyapunov exponent for this induced random map characterizes the rate of separation of infinitesimally close trajectories in . If two vectors and are close enough, one could use their difference to approximate the 2-norm of the difference of and , . Then the rate of separation for in fact is upper bounded by the gap , which is the estimation of exponential forgetting rate. This algorithm provides some new insight for the analytical justification for the gap.

We apply this algorithm to approximate the gap of Lyapunov exponent in the previous example. In the figure 2, the estimated gap is with data of 10000 and the length needed for is about 178. On the other hand, one starts with two different initial conditions and and applies the same sequences of random matrices to obtain and after normalization. We plot along for ten independent sequences and they roughly converge to the theoretical limit . However, as increases, reaches the machine epsilon and becomes numerically unstable, such that some sequences are cut off beyond . So we are not able to visualize the strong convergence directly. If we average 500 sample sequences, then we can clearly visualize the convergence in mean. With the uniform integrability, convergence in mean is granted by the strong convergence.

Right now, it seems one needs to estimate the gap for each parameter . One related result is if matrices are nonsingular and maximum Lyapunov exponents are simple, then it depends continuously on the probability . Bocker and Viana  showed Lyapunov spectrum depend continuously on matrices and probability for 2-dimensional case, as far as all probabilities are positive. Moreover, a few of Avila’s deepest results that are still in preparation with Eskin and Viana, extend the statement to arbitrary dimension. The book  gives a nice introduction on this most recent approach. The direct consequence for this result is it doesn’t need to estimate the gap every time and it is safe to reuse the previous estimation for couple steps in parameter inference. The pseudocode of estimating the length is given in Algorithm 1.

## 4 Example

### 4.1 Inference

Traditionally, EM, variational inference or MCMC are used to perform inference over

. These algorithms have found widespread use in statistics and machine learning

[24, 8]. However, it is a computational challenge in terms of scalability and numerical stability, to marginalize all hidden state variables given a long sequence of observations. There are many other gradient based algorithms to obtain the maximum likelihood estimator(MLE) or maximum a posteriori (MAP), for instance, stochastic gradient descent method. We must be able to efficiently estimate the gradient of the log-likelihood function or log-posterior function, . With the prior function , the gradient is written as

 ∂lnp(θ|y1:n)∂θi =∂lnp(y1:n|θ)∂θi+∂lnp(θ)∂θi =n∑j=1ρj−1∂MDj∂θiβjρn⋅βn+∂lnp(θ)∂θi (18)

The complexity of matrix multiplication needed to calculate one component of the gradient is and the space needed is also . So it is prohibitively expensive to compute directly in space and time when is very large. Moreover, this direct computation is not numerically stable since the numerator and denominator are usually extremely small in such massive matrix multiplication. In fact, there are various algorithms to reduce the complexity, including the following mini-batch gradient descent method, which employs noisy estimates of the gradient based on minibatch of data [23, 7].

First, instead of summing over all index from 1 to , uniformly sample a subset of summand with cardinality at each step and use the following estimator for the direction of the full gradient. Here we assume the prior distribution as uniform for the sake of simplicity,

 ∂ln˜p(θ|y1:n)∂θi=ns∑j∈Sρj−1∂MDj∂θiβjρj−1MDjβj (19)

Then we expect . This is typically referred to mini-batch gradient descent based techniques and it is very effective in the case of large-scale problems.

Second, instead of computing normalized forward and backward probability and recursively, we introduce a buffer of length and on left and right ends of the subsequence of random matrices and both vectors are estimated by this much shorter subsequence,

 ˜pLB1=p0MDj−B1…MDj−1,  ˜ρj−1=˜pLB1/(˜pLB1⋅1) (20) ˜bRB2=MDj+1…MDj+B21,  ˜βj=˜bRB2/(˜bRB2⋅1) (21)

So the gradient (19) is approximated as

 ∂ln˜p(θ|y1:n)∂θi=ns∑j∈S˜ρj−1∂MDj∂θi˜βj˜ρj−1MDj˜βj (22)

Note that (22), the matrix multiplication required is after using the buffer and the space needed is . When , this results in significant computational speedups over the full batch inference algorithm. This techniques are exactly due to the memory decaying property and the buffer length is calculated in Algorithm 1. can be similarly calculated by using the backward probability and normalized backward probability in Algorithm 1.

In order to cooperate with this technique, we can uniformly sample the subset of summand in the domain . Moreover, one can enforce each subsequence sampled is not overlapping to have better effect. In pseudocode, the algorithm can be presented in Algorithm 2. The figure 3 illustrates the idea of the subsequence sampled from the full observation sequence.

This memory decay property not only can be taken advantage of in mini-batched gradient descent based inference on MLE or MAP, but also be used in stochastic gradient-MCMC [23, 22], stochastic variational inference , stochastic EM and online learning . There are many more algorithms could be built based on this fundamental property in HMM. No matter what mini-batched based algorithm is used, it is important to estimate the buffer length efficiently and accurately. Figure 3: Diagram of left-subsequences and right-subsequence sampling from full observation sequence. It approximates the stochastic gradients using subchains of length B1+B2+1.

### 4.2 Synthetic Example

In order to demonstrate the algorithm, we sampled a long observation sequence with length and the parameter given in example 1. We assume we know all parameters except and . In the algorithm, we use the same left and right buffer length and sample size . The learning rate starts with 0.05 and decays with the rate of 0.95 along the steps to prevent oscillations. After 25 steps, the algorithm will restart with the latest parameter, until the difference of parameter from the previous restart is within the threshold, which in this case is 0.02. From the figure 4, the parameter reaches the desire MLE after 8 starts from the initial guess . Note from the contour plot of the log-likelihood function, there is a region around which is the flip of the mean, is very flat and our algorithm is able to escape it due to the stochastic nature. Another remark is the matrix multiplication needed is about which is less than the length observation sequence . It implies the algorithm has reached MLE even before the filtering procedure is finished in single iteration in EM or gradient descent algorithms. So it significantly speed up the inference procedure. Figure 4: Apply the Algorithm 2 to Example 1. The background is the contour plot of the log-likelihood function. In the left figure, μ1 and μ2 are unknown, the algorithm converges to (0.03,0.48) starting from (0.8,−0.8). In the right figure, μ1 and σ1 are unknown, the algorithm converges to (0.01,0.94) starting from (−0.4,1.4).

## 5 Extension and Conclusion

Although traditionally the EM algorithm has monotonic convergence, ensures parameters constraints implicitly and generally easier to be implemented, the convergence of EM can be very slow in terms of time for each iteration and total iteration steps. Our method can significantly reduce the time for each iteration since we only utilize part of the data by harnessing the memory loss property, but the steps needed is still comparable with EM algorithm. In fact, one could extend our idea to efficiently estimate the Hessian matrix such that much faster quadratic convergence can be achieved with second-order method. Moreover, one can calculate observed fisher information which is the negative Hessian matrix of the log-likelihood evaluated at MLE. It will give curvature information at MLE and help to decide which MLE may be better without calculating the likelihood explicitly. Another natural extension of our method is to discrete state Kalman filter, which is continuous time version of HMM. Similar exponential forgetting property and the rate being the gap of top two Lyapunov exponents are discussed in

 for the Wonham filter, but the proof is harder and evolves different techniques, in particular, the naive time discretization may be challenging since one needs to justify the change order of the limit. Further extension to continuous state Kalman filter is not feasible at this stage because our method is based on finite dimension random matrix theory. Last but not the least, it is tempering in realistic scenario to use the same buffer length for the left and right subsequences. The numerical simulation indicates the gap of Lyapunov exponents for both forward and backward probabilities for this particular example are close and one would speculate the gaps are the same. To formulate this problem more explicitly, it is equivalent to find the connection of Lyapunov spectrum for i.i.d random matrix production and for their transpose. One result is the Birkhoff contraction coefficient for the matrix is the same as its transpose. So if the Markov transition matrix is positive matrix, then both gaps are bounded by . But the explicit connection still remains as an open question.

In the era of big data, data analysis methods in machine learning and statistics, such as hidden Markov models, play a central role in industry and science. The growth of the web and improvements in data collection technology in science have lead to a rapid increase in the magnitude and complexity of these analysis tasks. This growth is driving the need for the scalable algorithms that can handle the “Big Data”. However, we don’t need that the whole massive data, instead small portion of data could serve as good as the original. One successful examples are mini-batched based algorithms. Despite the simple chain-based dependence structure, apply such algorithms in HMM are not obvious, since subsequences are not mutually independent. However, with the data set being abundant, we are able to harness the exponential memory decay in filtered state probability and appropriately choose the length of the subchain with the controlled error, to design mini-batched based algorithms. We proposed an efficient algorithm to accurately calculate the gap of the top two Lyapunov exponents, which helps to estimate the length of the subchain. We also prove the validity of the algorithm theoretically and verified it by numerical simulations. In the example, we also proposed the mini-batched gradient descent algorithm for MLE of log-likelihood function and it significantly reduces the computation cost.

## References

•  L. Arnold. Random Dynamical Systems. Springer Monographs in Mathematics. Springer-Verlag Berlin Heidelberg, 1998.
•  R. Atar and O. Zeitouni.

Lyapunov exponents for finite state nonlinear filtering.

SIAM Journal on Control and Optimization, 35(1):36–55, 1997.
•  R. Bhar and S. Hamori. Hidden Markov Models: Applications to Financial Economics. Advanced Studies in Theoretical and Applied Econometrics. Springer US, 2006.
•  C. Bocker-Neto and M. Viana. Continuity of lyapunov exponents for random two-dimensional matrices. Ergodic Theory and Dynamical Systems, 37(5):1413–1442, 2017.
•  P. Collet and F. Leonardi. Loss of memory of hidden markov models and lyapunov exponents. Ann. Appl. Probab., 24(1):422–446, 02 2014.
•  A. Crisanti, G. Paladin, and A. Vulpiani. Products of Random Matrices in Statistical Physics. Springer Series in Solid-State Sciences. Springer-Verlag Berlin Heidelberg, 1993.
•  N. Foti, J. Xu, D. Laird, and E. Fox. Stochastic variational inference for hidden markov models. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 27, pages 3599–3607. Curran Associates, Inc., 2014.
•  A.M. Fraser. Hidden Markov Models and Dynamical Systems. SIAM e-books. Society for Industrial and Applied Mathematics, 2008.
•  H. Furstenberg. Noncommuting random products. Trans. Amer. Math. Soc., 108:377–428, 1963.
•  H. Furstenberg and H. Kesten. Products of random matrices. The Annals of Mathematical Statistics, 31(2):457–469, 1960.
•  D.J. Hartfiel. Nonhomogeneous Matrix Products. World Scientific, 2002.
•  F. Jelinek. Statistical Methods for Speech Recognition. Language, Speech, & Communication: A Bradford Book. MIT Press, 1997.
•  B. H. Juang and L. R. Rabiner. Hidden markov models for speech recognition. Technometrics, 33(3):251–272, 1991.
•  C. Karlof and D. Wagner. Hidden Markov Model Cryptanalysis, pages 17–34. Springer Berlin Heidelberg, Berlin, Heidelberg, 2003.
•  W. Khreich, E. Granger, A. Miri, and R. Sabourin. A survey of techniques for incremental learning of hmm parameters. Information Sciences, 197:105 – 130, 2012.
•  Anders Krogh, Michael Brown, I.Saira Mian, Kimmen Sjölander, and David Haussler. Hidden markov models in computational biology: Applications to protein modeling. Journal of Molecular Biology, 235(5):1501 – 1531, 1994.
•  Anders Krogh, Björn Larsson, Gunnar von Heijne, and Erik L. L. Sonnhammer. Predicting transmembrane protein topology with a hidden markov model: Application to complete genomes. J. MOL. BIOL, 305:567–580, 2001.
•  Guillaume Lajoie, Kevin K. Lin, and Eric Shea-Brown. Chaos and reliability in balanced spiking networks with temporal drive. Phys. Rev. E, 87:052901, May 2013.
•  F. Le Gland and L. Mevel. Basic properties of the projective product with application to products of column-allowable nonnegative matrices. Mathematics of Control, Signals and Systems, 13(1):41–62, Feb 2000.
•  F. Le Gland and L. Mevel. Exponential forgetting and geometric ergodicity in hidden markov models. Mathematics of Control, Signals and Systems, 13(1):63–93, Feb 2000.
•  Kevin K. Lin. Stimulus-Response Reliability of Biological Networks, pages 135–161. Springer International Publishing, Cham, 2013.
•  Y.-A Ma, T. Chen, and E. B. Fox. A complete recipe for stochastic gradient MCMC. In Advances in Neural Information Processing Systems 28, pages 2899–2907. 2015.
•  Y.-A. Ma, N. J. Foti, and E. B. Fox. Stochastic Gradient MCMC Methods for Hidden Markov Models. arXiv.org, stat.ML, June 2017.
•  K.P. Murphy. Machine Learning: A Probabilistic Perspective. Adaptive computation and machine learning. MIT Press, 2012.
•  J. Newman. Necessary and Sufficient Conditions for Stable Synchronization in Random Dynamical Systems. arXiv.org, math.DS, February 2015.
•  F. J. Och and H. Ney. A comparison of alignment models for statistical machine translation. In Proceedings of the 18th Conference on Computational Linguistics - Volume 2, COLING ’00, pages 1086–1090, Stroudsburg, PA, USA, 2000. Association for Computational Linguistics.
•  F. J. Och and H. Ney. The alignment template approach to statistical machine translation. Computational Linguistics, 30(4):417–449, 2004.
•  V. I. Oseledets. A multiplicative ergodic theorem: Lyapunov characteristic numbers for dynamical systems. Trans. Moscow Math. Soc., 19:197–231, 1968.
•  Y. Peres. Domains of analytic continuation for the top lyapunov exponent. Annales de l’I.H.P. Probabilit s et statistiques, 28(1):131–148, 1992.
•  Lawrence R. Rabiner. Readings in speech recognition. chapter A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition, pages 267–296. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1990.
•  E. Seneta. Non-negative Matrices and Markov Chains. Springer Series in Statistics. Springer New York, 1981.
•  Erik L. L. Sonnhammer, Gunnar von Heijne, and Anders Krogh. A hidden markov model for predicting transmembrane helices in protein sequences. In Proceedings of the 6th International Conference on Intelligent Systems for Molecular Biology, ISMB ’98, pages 175–182. AAAI Press, 1998.
•  M. Viana. Lectures on Lyapunov Exponents. Cambridge Studies in Advanced Mathematics. Cambridge University Press, 2014.
•  F. X.-F. Ye, Y. Wang, and H. Qian. Stochastic dynamics: Markov chains and random transformations. Discrete and Continuous Dynamical Systems - Series B, 21(7):2337–2361, 2016.

## Appendix A Birkhoff contraction coefficient

We will introduce Hilbert metric and Birkhoff contraction coefficient on positive matrices, especially on positive stochastic matrices [31, 11].

Let and be positive vectors in , the Hilbert metric is defined as . But Hilbert metric is not a metric in since one could check when for some constant , . Actually, for each positive probability vector in the interior of the simplex , determines a metric on them.

The advantage of Hilbert metric for the positive stochastic matrix

is one can show for two different positive probability row vector, and , the distance between and under monotonically decreases, . This is not guaranteed for other metrics due to the possible non-normal behavior of the matrix. The Birkhoff contraction coefficient is defined as the supreme of the contraction ratio under the matrix ,

 τ(M)=supd(xM,yM)d(x,y) (23)

This coefficient indicates how much and are drown together at least after multiplying by . Actually, there is an explicit formula for computing in terms of the entries of . Define as

 ϕ(M)=minp,q,r,sMpqMrsMrqMps (24)

The term is cross ratios of all sub matrices of and is the minimum amount of them. If there is a row with both zero and positive elements, . The formula for is

 τ(M)=1−√ϕ(M)1+√ϕ(M) (25)

As expected, for positive stochastic matrix , .

## Appendix B Random dynamical system

In this section, we review some important definitions and concepts in random dynamical system(RDS) in terms of this HMM problem. This material can be found in standard textbooks . It is presented for the convenience of the readers.

Random dynamical system defines on the metric dynamical system . is the set of all possible one sided infinitely long sequence of invertible matrices, and is the -algebra of .

 Ω={ω:(A1A2…)|Ai∈Rd×d}

The probability measure for given first values is

 P(A1A2…Ak)=f(A1)f(A2)…f(Ak)dλ1dλ2…dλk

where is the probability density function given in (3). It is clear that the random variable at each time step are i.i.d. The is a left shift operator on ,

 θ(A1A2…)=(A2A3…)

The state space is equipped with -algebra , and the mapping defined on the state space is , where is the matrix valued function and takes the first matrix in the sequence. In particular, . It is the product of i.i.d random matrices. This mapping has cocycle property, i.e, it satisfies for all . and for all , .

A skew product of

and is a measurable transformation : , defined by

 S(n):(ω,v)→(θ(n)ω,ϕ(n,ω)v)

This RDS induces a Markov process on and assume this Markov process has an invariant measure . There is a simple one-to-one correspondence between invariant measure of RDS and induced Markov process: A product measure is -invariant, moreover, if is ergodic, then is ergodic.

The multiplicative ergodic theorem states as follows,

###### Theorem B.1

Let be an ergodic measure preserving transformation of , Let be a matrix-valued function with . Then there exist ; satisfying and a measurable family of subspaces such that

1. filteration: .

2. dimension: dim ; for a.e. .

3. equivariance: for a.e. .

4. growth: If , then for a.e. .

The Lyapunov spectrum of RDS is defined as . The multiplicative ergodic theorem states for almost all and each non-zero vector , the Lyapunov spectrum exists, depends on up to different values but independent of the choice of the metric.