1 Introduction
Hidden Markov models (HMMs) are widely used in machine learning when the data samples are time
dependent, for example in speech recognition, language processing, and video analysis. The graphical model of a HMM is shown in Figure 1. HMM models a (timedependent) sequence of dataas indirect observations of an underlying Markov chain
which is not available to us. Homogeneous HMMs are parsimonious models, in the sense that they are fully characterized by the transition probability and the emission probability even though the size of the given data can be very large.Consider a homogeneous HMM such that:

[noitemsep]

a latent variable can take possible outcomes ;

an ambient variable can take possible outcomes .
Recall that (Rabiner and Juang, 1986; Ghahramani, 2001):

[noitemsep]

Given both and , the complete joint probability factors, and we can easily estimate the transition probability and the emission probability .

Given only
, but assuming we know the underlying transition and emission probabilities, we can calculate the observation likelihood using the forward algorithm, estimate the most likely hidden sequence using the Viterbi algorithm, and compute the posterior probability of the hidden states using the forwardbackward algorithm.
The most natural problem setting, however, is when neither the hidden state sequence nor the underlying probabilities are known to us—we only have access to a sequence of observations, and our job is to reveal the HMM structure, characterized by the transition matrix and the emission probability from the set of observations .
1.1 Related work
The traditional way of learning a HMM from is via expectationmaximization (EM) (Rabiner and Juang, 1986), in which the expectation step is performed by calling the forwardbackward algorithm. This specific instance of EM is also called the BaumWelch algorithm (Baum et al., 1970; Ghahramani, 2001). However, the complexity of BaumWelch is prohibitive when is relatively large—the complexity of the forwardbackward algorithm is , but EM converges slowly, so the forwardbackward algorithm must be called many times. This is a critical issue, because a HMM can only be learned with high accuracy when the number of observation samples is large enough.
One way of designing scalable algorithms for learning HMMs is to work with sufficient statistics – a summary of the given observation sequence, whose size does not grow with . Throughout this paper we assume that the HMM process is stationary (timeinvariant), which is true almost surely if the underlying Markov process is ergodic and the process has been going on for a reasonable amount of time. With large enough, we can accurately estimate the cooccurrence probability between two consecutive emissions . According to the graphical model shown in Figure 1, it is easy to see that given the value of , is conditionally independent of all the other variables, leading to the factorization
(1) 
Let , , and , with their elements defined as
Then, equations (1.1) can be written compactly as
(2) 
Noticing that is a nonnegative matrix trifactorization with a number of inconsequential constraints for and to properly represent probabilities, Vanluyten et al. (2008); Lakshminarayanan and Raich (2010); Cybenko and Crespi (2011) proposed using nonnegative matrix factorization (NMF) to estimate the HMM probabilities. However, NMFbased methods have a serious shortcoming in this context: the trifactorization (2) is in general not unique, because it is fairly easy to find a nonsingular matrix such that both and , and then and are equally good solutions in terms of reconstructing the cooccurrence matrix . When we use and to perform HMM inference, such as estimating hidden states or predicting new emissions, the two models often yield completely different results, unless is a permutation matrix.
A number of works propose to use tensor methods to overcome the identifiability issue. Instead of working with the pairwise cooccurrence probabilities, they start by estimating the joint probabilities of three consecutive observations
. Noticing that these three random variables are conditionally independent given
, the tripleoccurrence probability factors intowhich admits a tensor canonical polyadic decomposition (CPD) model (Hsu et al., 2009; Anandkumar et al., 2012, 2014). Assuming , the CPD is essentially unique if two of the three factor matrices have full column rank, and the other one is not rank one (Harshman, 1970); in the context of HMMs, this is equivalent to assuming and both have linearly independent columns, which is a relatively mild condition. The CPD is known to be unique under much more relaxed conditions (Sidiropoulos et al., ), but in order to uniquely retrieve the transition probability using the relationship
is actually the best we can achieve using tripleoccurrences without making further assumptions. ^{1}^{1}1In the appendix, we prove that if the emission probability is generic and the transition probability is sparse, the HMM can be uniquely identified from tripleoccurrence probability for using the latest tensor identifiability result (Chiantini and Ottaviani, 2012). A salient feature in this case is that if the tripleoccurrence probability is exactly given (meaning the rank of the tripleoccurrence tensor is indeed smaller than ), the CPD can be efficiently calculated using generalized eigendecomposition and related algebraic methods (Sanchez and Kowalski, 1990; Leurgans et al., 1993; De Lathauwer et al., 2004). These methods do not work well, however, when the lowrank tensor is perturbed – e.g., due to insufficient mixing / sample averaging of the triple occurence probabilities.
It is also possible to handle cases where . The key observation is that, given , is conditionally independent of and . Then, grouping
into a single categorical variable taking
possible outcomes, and into another one, we can construct a much bigger tensor of size , and then uniquely identify the underlying HMM structure with as long as certain linear independence requirements are satisfied for the conditional distribution of the grouped variables (Allman et al., 2009; Bhaskara et al., 2014; Huang et al., 2016b; Sharan et al., 2017). It is intuitively clear that for fixed , we need a much larger realization length in order to accurately estimate occurrence probabilities as grows, which is the price we need to pay for learning a HMM with a larger number of hidden states.1.2 This paper
The focus of this paper is on cases where , and is large enough to obtain accurate estimate of , but not large enough to accurately estimate triple or higherorder occurrence probabilities. We prove that it is actually possible to recover the latent structure of an HMM only from pairwise cooccurrence probabilities , provided that the underlying emission probability is sufficiently scattered. Compared to the existing NMFbased HMM learning approaches, our formulation employs a different (determinantbased) criterion to ensure identifiability of the HMM parameters. Our matrix factorization approach resolves cases that cannot be handled by tensor methods, namely when is insufficient to estimate thirdorder probabilities, under an additional condition that is quite mild: that the emission probability matrix must be sufficiently scattered, rather than simply full columnrank.
We apply our method to hidden topic Markov modeling (HTMM) (Gruber et al., 2007), in which case the number of hidden states (topics) is indeed much smaller than the number of ambient states (words). HTMM goes beyond the simple and widely used bagofwords model by assuming that (ordered) words in a document are emitted from a hidden topic sequence that evolves according to a Markov model. We show improved performance on real data when using this simple and intuitive model to take word ordering into account when learning topics, which also benefits from our identifiability guaranteed matrix factorization method.
As an illustrative example, we showcase the inferred topic of each word in a news article (removing stop words) in Figure 2, taken from the Reuters21578 data set obtained at (Mimaroglu, 2007). As we can see, HTMM gets much more consistent and smooth inferred topics compared to that obtained from a bagofwords model (cf. appendix for details). This result agrees with human understanding.
2 Secondorder vs. Thirdorder Learning
We start by arguing that for the same observation data , the estimate of the pairwise cooccurrence probability is always more accurate than that of the triple cooccurrence probability .
Let us first explicitly describe the estimator we use for these probabilities. For each observation
, we define a coordinate vector
, and if . The natural estimator for the pairwise cooccurrence probability matrix is(3) 
and similarly for the triple cooccurrence probability
(4) 
where denotes vector outerproduct. ^{2}^{2}2In some literature is written as the Kronecker product . Strictly speaking, the Kronecker product of three vectors is a very long vector, not a threeway array. For this reason, we chose to use instead of .
The first observation is that both and
are unbiased estimators: Obviously
and likewise for the tripleoccurrences, and taking their averages does not change the expectation. However, the individual terms in the summation are not independent of each other, making it hard to determine how fast estimates converge to their expectation. The stateoftheart concentration result for HMMs (Kontorovich, 2006) states that for any 1Lipschitz functionwhere is a constant that only depends on the specific HMM structure but not on the function (cf. (Kontorovich, 2006) for details). Taking as any entry in or , we can check that indeed it is 1Lipschitz, meaning as goes to infinity, both estimators converge to their expectation with negligible fluctuations.
We now prove that for a given set of observations , is always going to be more accurate than
. Since both of them represent probabilities, we use two common metrics to measure the differences between the estimators and their expectations, the KullbackLeibler divergence
and the totalvariation difference .Proposition 1.
Let and be obtained from the same set of observations , we have that
The proof of Proposition 1 is relegated to the appendix.
3 Identifiability of HMMs from Pairwise Cooccurrence Probabilities
The arguments made in the previous section motivate going back to matrix factorization methods for learning a HMM when the realization length is not large enough to obtain accurate estimates of triple cooccurrence probabilities. As we have explained in §1.1, the cooccurrence probability matrix admits a nonnegative matrix trifactorization model (2). There are a number of additional equality constraints. Columns of represent conditional distributions, so . Matrix
represents the joint distribution between two consecutive Markovian variables, therefore
. Furthermore, we have that and represent and respectively, and since we assume that the Markov chain is stationary, they are the same, i.e., . Notice that this does not imply that is symmetric, and in fact it is often not symmetric.Huang et al. (2016a) considered a factorization model similar to (2) in a different context, and showed that identifiability can be achieved under a reasonable assumption called sufficiently scattered, defined as follows.
Definition 1 (sufficiently scattered).
Let denote the polyhedral cone , and denote the elliptical cone . Matrix is called sufficiently scattered if it satisfies that: (i) , and (ii) , where denotes the boundary of , .
The sufficiently scattered condition was first proposed in (Huang et al., 2014) to establish uniqueness conditions for the widely used nonnegative matrix factorization (NMF). For the NMF model , if both and are sufficiently scattered, then the nonnegative decomposition is unique up to column permutation and scaling. Follow up work strengthened and extended the identifiability results based on this geometry inspired condition (Fu et al., 2015; Huang et al., 2016a; Fu et al., 2018). A similar trifactorization model was considered in (Huang et al., 2016a) in the context of bagofwords topic modeling, and it was shown that among all feasible solutions of (2), if we find one that minimizes , then it recovers the groundtruth latent factors and , assuming the groundtruth is sufficiently scattered. In our present context, we therefore propose the following problem formulation:
(5a)  
subject to  (5b)  
(5c)  
(5d) 
Regarding Problem (5), we have the following identifiability result.
Theorem 1.
One may notice that in (Huang et al., 2016a), there are no constraints on the core matrix as we do in (5c). In terms of identifiability, it is easy to see that if the groundtruth satisfies (5c), solving (5) even without (5c) will produce a solution that satisfies (5c), thanks to uniqueness. In practice when we are given a less accurate , such “redundant” information will help us improve the estimation error, but that goes beyond identifiability consederations.
The proof of Theorem 1 is referred to (Huang et al., 2016a). Here we provide some insights on this geometryinspired sufficiently scattered condition, and discuss why it is a reasonable (and thus practical) assumption. The notation comes from the convention in convex analysis that it is the dual cone of the conical hull of the row vectors of , i.e., . Similarly, we can derive that the dual cone of is . A useful property of the dual cone is that for two convex cones and , iff . Therefore, the first requirement of sufficiently scattered in Definition 1 equivalently means
We give a geometric illustration of the sufficiently scattered condition in Figure 2(b) for , and we focus on the 2dimensional plane . The intersection between this plane and the nonnegative orthant is the probability simplex, which is the triangle in Figure 2(b). The outer circle represents , and the inner circle represents , again intersecting with the plane, respectively. The rows of are scaled to sum up to one, and they are represented by black dots in Figure 2(b). Their conical hull is represented by the shaded region. The polygon with dashed lines represents the dual of , which is indeed a subset of , and touches the boundary of only at the coordinate vectors.
Figure 2(a) shows a special case of sufficiently scattered called separability, which first appeared in (Donoho and Stodden, 2004) also to establish uniqueness of NMF. In this case, all the coordinate vectors appear in rows of , therefore equals the nonnegative orthant. It makes sense that this condition makes the identification problem easier, but it is also a very restrictive assumption. The sufficiently scattered condition, on the other hand, only requires that the shaded region contains the inner circle, as shown in Figure 2(b). Intuitively this requires that the rows of be “well scattered” in the probability simplex, but not to the extent of “separable”. Separabilitybased HMM identification has been considered in (Barlier et al., 2015; Glaude et al., 2015). However, the way they construct secondorder statistics is very different from ours. Figure 2(c) shows a case where is not sufficiently scattered, and it also happens to be a case where is not identifiable.
As we can see, the elliptical cone is tangent to all the facets of the nonnegative orthant. As a result, for to be sufficiently scattered, it is necessary that there are enough rows of lie on the boundary of the nonnegative orthant, i.e., is relatively sparse. Specifically, if is sufficiently scattered, then each column of contains at least zeros (Huang et al., 2014). This is a very important insight, as exactly checking whether a matrix is sufficiently scattered may be computationally hard. In the present paper we further show the following result.
Proposition 2.
The ratio between the volume of the hyperball obtained by intersecting and and the probability simplex is
(6) 
The proof is given in the appendix. As grows larger, the volume ratio (6) goes to zero at a superexponential decay rate. This implies that the volume of the inner sphere quickly becomes negligible compared to the volume of the probability simplex, as becomes moderately large. The take home point is that, for a practical choice of , say , as long as satisfies that each column contains at least zeros, and the positions of the zeros appear relatively random, it is very likely that it is sufficiently scattered, and thus can be uniquely recovered via solving (5).
4 Algorithm
Our identifiability analysis based on the sufficiently scattered condition poses an interesting nonconvex optimization problem (5). First of all, the given cooccurrence probability may not be exact, therefore it may not be a good idea to put (5b) as a hard constraint. For algorithm design, we propose the following modification to problem (5).
subject to  (7)  
In the loss function of (
4), the first term is the KullbackLeibler distance between the empirical probability and the parameterized version (ignoring a constant), and the second term is our identifiabilitydriven regularization. We need to tune the parameter to yield good estimation results. However, intuitively we should use a with a relatively small value. Suppose is sufficiently accurate, then the priority is to minimize the difference between and ; when there exist equally good fits, then the second term comes into play and helps us pick out a solution that is sufficiently scattered.Noticing that the constraints of (4) are all convex, but not the loss function, we propose to design an iterative algorithm to solve (4) using successive convex approximation. At iteration when the updates are and , we define
(8) 
Obviously, and
, which defines a probability distribution for fixed
and . Using Jensen’s inequality (Jensen, 1906), we have that(9) 
which defines a convex and locally tight upperbound for the first term in the loss function of (4). Regarding the second term in the loss of (4), we propose to simply take the linear approximation
(10) 
Combining (4) and (10), our successive convex approximation algorithm tries to solve the following convex problem at iteration :
(11)  
subject to  
where we define . Problem (4) decouples with respect to and , so we can work out their updates individually.
The update of admits a simple closed form solution, which can be derived via checking the KKT conditions. We denote the dual variable corresponding to as . Setting the gradient of the Lagrangian with respect to equal to zero, we have
and should be chosen so that the constraint is satisfied, which amounts to a simple rescaling.
The update of is not as simple as a closed form expression, but it can still be obtained very efficiently. Noticing that the nonnegativity constraint is implicitly implied by the individual functions in the loss function, we propose to solve it using Newton’s method with equality constraints (Boyd and Vandenberghe, 2004, §10.2). Although Newton’s method requires solving a linear system of equations with number of variables in each iteration, there is special structure we can exploit to reduce the periteration complexity down to : The Hessian of the loss function of (4) is diagonal, and the linear equality constraints are highly structured; using block elimination (Boyd and Vandenberghe, 2004, §10.4.2), we ultimately only need to solve a positive definite linear system with variables. Together with the quadratic convergence rate of Newton’s method, the complexity of updating is , where is the desired accuracy for the update. Noticing that the complexity of a naive implementation of Newton’s method would be , the difference is big for moderately large . The inline implementation of this tailored Newton’s method ThetaUpdate and the detailed derivation can be found in the appendix.
The entire proposed algorithm to solve Problem (4) is summarized in Algorithm 1. Notice that there is an additional linesearch step to ensure decrease of the loss function. The constraint set of (4) is convex, so the linesearch step will not incur infeasibility. Computationally, we find that any operation that involves can be carried out succinctly by defining the intermediate matrix , where “” denotes elementwise division between two matrices of the same size. The periteration complexity of Algorithm 1 is completely dominated by the operations that involve computing with , notably comparing with that of ThetaUpdate. In terms of initialization, which is important since we are optimizing a nonconvex problem, we propose to use the method by Huang et al. (2016a) to obtain an initialization for ; for , it is best if we start with a feasible point (so that the Newton’s iterates will remain feasible), and a simple choice is scaling the matrix to sum up to one. Finally, we show that this algorithm converges to a stationary point of Problem (4), with proof relegated to the appendix based on (Razaviyayn et al., 2013).
5 Validation on Synthetic Data
In this section we validate the identifiability performance on synthetic data. In this case, the underlying transition and emission probabilities are generated synthetically, and we compare them with the estimated ones to evaluate performance. The simulations are conducted in MATLAB using the HMM toolbox, which includes functions to generate observation sequences given transition and emission probabilities, as well as an implementation of the BaumWelch algorithm (Baum et al., 1970), i.e., the EM algorithm, to estimate the transition and emission probabilities using the observations. Unfortunately, even for some moderate problem sizes we considered, the istreamlined Matlab mplementation of the BaumWelch algorithm was not able to execute within reasonable amount of time, so its performance is not included here. For the baselines, we compare with the plain NMF approach using multiplicative update (Vanluyten et al., 2008) and the tensor CPD approach (Sharan et al., 2017) using simultaneous diagonalization with Tensorlab (Vervliet et al., 2016). Since we work with empirical distributions instead of exact probabilities, the result of the simultaneous diagonalization is not going to be optimal. We therefore use it to initialize the EM algorithm for fitting a nonnegative tensor factorization with KL divergence loss (Shashanka et al., 2008) for refinement.
We focus on the cases when the number of hidden states is smaller than the number observed states . As we explained in the introduction, even for this seemingly easier case, it is not known that we can guarantee unique recovery of the HMM parameters just from the pairwise cooccurrence probability. What is known is that the tensor CPD approach is able to guarantee identifiability given exact tripleoccurrence probability. We will demonstrate in this section that it is much harder to obtain accurate tripleoccurrence probability comparing with the cooccurrence probability. As a result, if the sufficiently scattered assumption holds for the emission probability, the estimated parameters obtained from our method are always more accurate than those obtained from tensor CPD.
Fixing and , we let the number of HMM realizations go from to , and compare the estimation error for the transition matrix and emission matrix by the aforementioned methods. We show the total variation distance between the ground truth probabilities and and their estimations and using various methods. The result is shown in Figure 4. As we can see, the proposed method indeed works best, obtaining almost perfect recovery when sample size is above . The CPD based method does not work as well since it cannot obtain accurate estimates of the thirdorder statistics that it needs. Initilaized by CPD, EM improves from CPD but the performance is still far away from the proposed method. NMF is also not working well since it does not have identifiability in this case.
6 Application: Hidden Topic Markov Model
Analyzing text data is one of the core application domains of machine learning. There are two prevailing approaches to model text data. The classical bagofwords model assumes that each word is independently drawn from certain multinomial distributions. These distributions are different across documents, but can be efficiently summarized by a small number of topics, again mathematically modeled as distributions over words; this task is widely known as topic modeling (Hofmann, 2001; Blei et al., 2003). However, it is obvious that the bagofwords representation is oversimplified. The gram model, on the other hand, assumes that words are conditionally dependent up to a windowlength of . This seems to be a much more realistic model, although the choice of is totally unclear, and is often dictated by memory and computatioal limitations in practice – since the size of the joint distribution grows exponentially with . What is more, it is somewhat difficult to extract “topics” from this model, despite some preliminary attempts (Wallach, 2006; Wang et al., 2007).
We propose to model a document as the realization of a HMM, in which the topics are hidden states emitting words, and the states are evolving according to a Markov chain, hence the name hidden topic Markov model (HTMM). For a set of documents, this means we are working with a collection of HMMs. Similar to other topic modeling works, we assume that the topic matrix is shared among all documents, meaning all the given HMMs share the same emission probability. For the bagofwords model, each document has a specific topic distribution , whereas for our model, each document has its own topic transition probability ; as per our previous discussion, the rowsum and columnsum of are the same, which are also the topic probability for the specific document. The difference is the Markovian assumption on the topics rather than the oversimplifying independence assumption.
We can see some immediate advantages for the HTMM. Since the Markovian assumption is only imposed on the topics, which are not exposed to us, the observations (words) are not independent from each other, which agrees with our intuition. On the other hand, we now understand that although word dependencies exist for a wide neighborhood, we only need to work with pairwise cooccurrence probabilities, or 2grams. This releases us from picking a window length in the gram model, while maintaining dependencies between words well beyond a neighborhood of words. It also includes the bagofwords assumption as a special case: If the topics of the words are indeed independent, this just means that the transition probability has the special form . The closest work to ours is by Gruber et al. (2007), which is also termed hidden topic Markov model. However, they make a simplifying assumption that the transition probability takes the form , meaning the topic of the word is either the same as the previous one, or independently drawn from . Both models are special cases of our general HTMM.
In order to learn the shared topic matrix , we can use the cooccurrence statistics for the entire corpus: Denote the cooccurrence statistics for the th document as , then ; consequently
which is an unbiased estimator for
where is the length of the th document, and is conceptually a weighted average of all the topictransition matrices. Then we can apply Algorithm 1 to learn the topic matrix.
We illustrate the performance of our HTMM by comparing it to three popular bagofwords topic modeling approaches: pLSA (Hofmann, 2001), LDA (Blei et al., 2003), and FastAnchor (Arora et al., 2013), which guarantees identifiability if every topic has a characteristic anchor word. Our HTMM model guarantees identifiability if the topic matrix is sufficiently scattered, which is a more relaxed condition than the anchor word one. On the Reuters21578 data set obtained at (Mimaroglu, 2007), we use the raw document to construct the word cooccurrence statistics, as well as bagofwords representations for each document for the baseline algorithms. We use the version in which the stopwords have been removed, which makes the HTMM model more plausible since any syntactic dependencies have been removed, leaving only semantic dependencies. The vocabulary size of Reuters21578 is around , making any method relying on tripleoccurrences impossible to implement, and that is why tensorbased methods are not compared here.
Because of page limitations, we only show the quality of the topics learned by various methods in terms of coherence. Simply put, a higher coherence means more meaningful topics, and the concrete definition can be found in (Arora et al., 2013). In Figure 5, we can see that for different number of topics we tried on the entire dataset, HTMM consistently produces topics with the highest coherence. Additional evaluations can be found in the appendix.
7 Conclusion
We presented an algorithm for learning hidden Markov models in an unsupervised setting, i.e., using only a sequence of observations. Our approach is guaranteed to uniquely recover the groundtruth HMM structure using only pairwise cooccurrence probabilities of the observations, under the assumption that the emission probability is sufficiently scattered. Unlike EM, the complexity of the proposed algorithm does not grow with the length of the observation sequence. Compared to tensorbased methods for HMM learning, our approach only requires reliable estimates of pairwise cooccurrence probabilities, which are easier to obtain. We applied our method to topic modeling, assuming each document is a realization of a HMM rather than a simpler bagofwords model, and obtained improved topic coherence results. We refer the reader to the appendix for detailed proofs of the propositions and additional experimental results.
Appendix
Appendix A Proof of Proposition 1
For categorical probabilities and , their KullbackLeiber divergence is defined as
and their total variation distance is defined as
The key to prove Proposition 1 is the fact that the cooccurrence probability can be obtained by marginalizing in the tripleoccurrence probability , i.e.,
Similarly, this holds for the cumulative estimates described in §2 of the main paper as well,
Using the log sum inequality, we have that
Summing both sides over and , we result in
Using Hölder’s inequality with norm and norm, we have that
Summing both sides over and and then dividing by 2, we obtain
Q.E.D.
Appendix B Proof of Proposition 2
The volume of a hyperball in with radius is
The elliptical cone
intersecting with the hyperplane
is a hyperball in with radius . Therefore, the volume of the innerball isThe nonnegative orthan intersecting with is a regular simplex in with side length . Its volume is
Their ratio is
Q.E.D.
This function of volume ratio is plotted in Figure 6. As we can see, as increases, the volume ratio indeed goes to zero at a superexponential rate.
Appendix C Derivation of ThetaUpdate
It is described in (Boyd and Vandenberghe, 2004, §10.2) that for solving a convex equality constrained problem
subject to 
using the Newton’s method, we start at a feasible point , and the iterative update takes the form , where the Newton direction is calculated from solving the KKT system
Assuming and has full row rank, then the KKT system can be solved via elimination, as described in (Boyd and Vandenberghe, 2004, Algorithm 10.3). Suppose , if is diagonal, the cost of calculating is dominated by forming and inverting the matrix with being diagonal.
Now we follow the steps of (Boyd and Vandenberghe, 2004, Algorithm 10.3) to derive explicit Newton iterates for solving (11). First, we rewrite the part of (11) that involve here:
subject to 
Let , then equality constraint has the form where
Matrix does not have full row rank, because the last row of is implied by the rest. Therefore, we can discard the last equality constraint. We will keep it when calculating matrix multiplications for simpler expression, and discard the corresponding entry or column/row for other operations.
Obviously has the form
which costs flops. For a slightly more complicated multiplication
which also costs flops to compute. For ,
At point , the negative gradient is where
and the inverse of the Hessian where
Let
and then delete the last column and row of , and
and then delete the last entry of . We can first solve for by
Then we append a zero at the end of and define
The Newton direction can then be obtained via
In matrix form, it is equivalent to
The inline implementation of ThetaUpdate is given here.
Comments
There are no comments yet.