1 Introduction
Unsupervised learning of latent variable models is a fundamental machine learning problem. Algorithms for learning a variety of latent variable models, including topic models, hidden Markov models, and mixture models are routinely used in practical applications for solving tasks ranging from representation learning to exploratory data analysis. For practitioners faced with the problem of training a latent variable model, the decadesold ExpectationMaximization (EM) algorithm EMDempster is still the tool of choice. Despite its theoretical limitations, EM owes its appeal to (i) the robustness of the maximumlikelihood principle to model misspecification, and (ii) the need, in most cases, to tune a single parameter: the dimension of the latent variables.
On the other hand, method of moments (MoM) algorithms for learning latent variable models via efficient tensor factorization algorithms have been proposed in the last few years TensorLatent ; SpectralLatent ; SpectralLDA ; jain2014learning ; hsu2013learning ; song2011spectral ; EMSlow ; chaganty2013spectral . Compared to EM, momentbased algorithms provide a stronger theoretical foundation for learning latent variable models. In particular, it is known that in the realizable setting the output of a MoM algorithm will converge to the parameters of the true model as the amount of training data increases. Furthermore, MoM algorithms only make a single pass over the training data, are highly parallelizable, and always terminate in polynomial time. However, despite their apparent advantages over EM, the adoption of MoM algorithms in practical applications is still limited.
Empirical studies indicate that initializing EM with the output of a MoM algorithm can improve the convergence speed of EM by several orders of magnitude, yielding a very efficient strategy to accurately learn latent variable models EMSlow ; chaganty2013spectral ; bailly2011quadratic . In the case of relatively simple models this approach can be backed by intricate theoretical analyses zhang2014spectral . Nonetheless, these strategies are not widely deployed in practice either.
The main reason why MoM algorithms are not adopted by practitioners is their lack of robustness to model misspecification. Even when combined with EM, MoM algorithms fail to provide an initial estimate for the parameters of a model leading to fast convergence when the learning problem is too far from the realizable setting. For example, this happens when the number of the latent variables used in a MoM algorithm is too small to accurately represent the training data. In contrast, the model obtained by standalone EM in this case is reasonable and desirable: when asked for a small number of latent variables EM yields a model which is easy to interpret and can be useful for data visualization and exploration. For example, an important application of lowdimensional learning can be found in mixture models, where latent class assignments provided by a simple model can be used to split the training data into disjoint datasets to which EM is applied recursively to produce a hierarchical clustering
steinbach2000comparison ; savaresi2001performance . The tree produced by such clusterings procedure provides a useful aid in data exploration and visualization even if the models learned at each branching point do not accurately represent the training data.In this paper we develop a hierarchical method of moments that produces meaningful results even in misspecified settings. Our approach is different from previous attemps to design MoM algorithms for misspecified models. Instead of looking for convex relaxations of existing MoM algorithms like in DBLP:conf/icml/BalleQC12 ; DBLP:conf/nips/BalleM12 ; DBLP:conf/icml/QuattoniBCG14 or analyzing the behavior of a MoM algorithm with a misspecified number of latent states like in kulesza2014low ; kulesza2015low , we generalize wellknown simultaneous diagonalization approaches to tensor decomposition by phrasing the problem as a nonconvex optimization problem. Despite its nonconvexity, the hierarchical nature of our method allows for a fast accurate solution based on lowdimensional grid search. We test our method on synthetic and realworld datasets on the topic modeling task, showcasing the advantages of our approach and obtaining meaningful results.
2 Moments, Tensors, and Latent Variable Models
This section starts by recalling the basic ideas behind methods of moments for learning latent variable models via tensor decompositions. Then we review existing tensor decomposition algorithms and discuss the effect of model misspecification on the output of such algorithms.
For simplicity we consider first a single topic model with topics over a vocabulary with words. A single topic model defines a generative process for text documents where first a topic is drawn from some discrete distribution , and then each word , , in a document of length is independently drawn from some distribution
over words conditioned on the document topic. The model is completely specified by the vector of topic proportions
and the word distributions for each topic . We collect the word distributions of the model as the columns of a matrix .It is convenient to represent the words in a document using onehot encodings so that
is an indicator vector. With this notation, the conditional expectation of any word in a document drawn from topic is , and the random vectoris conditionally distributed as a multinomial random variable, with parameters
and . Integrating over topics drawn from we obtain the first moment of the distribution over words . Generalizing this argument to pairs and triples of distinct words in a document yields the matrix of second order moments and the tensor of third order moments of a single topic model:(1)  
(2) 
where denotes the tensor (Kronecker) product between vectors. By defining the matrix one also obtains the expression .
A method of moments for learning single topic models proceeds by (i) using a collection of documents to compute empirical estimates , , of the moments, and (ii) using matrix and tensor decomposition methods to (approximately) factor these empirical moments and extract the model parameters from their decompositions. From the algorithmic point of view, the appeal of this scheme resides in the fact that step (i) requires a single pass over the data which can be trivially parallelized using mapreduce primitives, while step (ii) only requires linear algebra operations whose running time is independent of . The specifics of step (ii) will be discussed in Section 2.1.
Estimating moments from data with the property that for is the essential requirement for step (i). In the case of single topic models, and more generally multiview models, such estimations are straightforward. For example, a simple consistent estimator takes a collection of documents and computes using the first three words from each document. More dataefficient estimators for datasets containing long documents can be found in the literature ruffini2016new .
For more complex models the method sketched above requires some modifications. Specifically, it is often necessary to correct the statistics directly observable from data in order to obtain vectors/matrices/tensors whose expectation over a training dataset exhibits precisely the relation with the parameters and described above. For example, this is the case for Latent Dirichlet Allocation and mixtures of spherical Gaussians SpectralLDA ; hsu2013learning . For models with temporal dependence between observations, e.g. hidden Markov models, the method requires a spectral projection of observables to obtain moments behaving in a multiviewlike fashion SpectralLatent ; SpectralLatentHMM . Nonetheless, methods of moments for these models and many others always reduces to the factorization of a matrix and tensor of the form and given above.
2.1 Existing Tensor Decomposition Algorithms
Mathematically speaking, methods of moments attempt to solve the polynomial equations in and arising from plugging the empirical estimates into the expressions for their expectations given above. Several approaches have been proposed to solve these nonlinear systems of equations.
A popular method for tensor decomposition is Alternating Least Squares (ALS) kolda2009tensor . Starting from a random initialization of the factors composing a tensor, ALS iteratively fixes two of the three factors, and updates the remaining one by solving an overdetermined linear least squares problem. ALS is easy to implement and to understand, but is known to be prone to local minima, needing several random restarts to yield meaningful results. These limitations fostered the research for methods with guarantees, that, in the unperturbed setting, optimally decompose a tensor like the one in Eq. (2). We now briefly analyze some of these methods.
The tensor power method (TPM) TensorLatent starts with a whitening step where, given the SVD , the whitening matrix is used to transform into a symmetric orthogonally decomposable tensor
(3) 
The weights and vectors are then recovered from using a tensor power method and inverting the whitening step.
The same whitening matrix is used in SpectralLatent ; SpectralLDA , where the authors observe that the whitened slices of are simultaneously diagonalized by the MoorePenrose pseudoinverse of . Indeed, since , there exists a unique orthonormal matrix such that . Writing for the th slice of across its second mode and for the th row of , it follows that
Thus, the problem can be reduced to searching for the common diagonalizer of the whitened slices of defined as
(4) 
In the noiseless settings it is sufficient to diagonalize any of the slices . However, one can also recover
as the eigenvectors of a random linear combination of the various
which is more robust to noise SpectralLatent .Lastly, the method proposed in kuleshov2015tensor consists in directly performing simultaneous diagonalization of random linear combinations of slices of without any whitening step. This method, which in practice is slower than the others (see Section 4.1), under an incoherence assumption on the vectors , can robustly recover the weights and vectors from the tensor , even when it is not orthogonally decomposable.
2.2 The Misspecified Setting
The methods listed in the previous section have been analyzed in the case where the algorithm only has access to noisy estimates of the moments. However, such analyses assume that the data was generated by a model from the hypothesis class, that the matrix has rank , and that this rank is known to the algorithm. In practice the dimension of the latent variable can be crossvalidated, but in many cases this is not enough: data may come from a model outside the class, or from a model with a very large true . Besides, the moment estimates might be too noisy to provide reliabe estimates for large number of latent variables. It is thus frequent to use these algorithms to estimate latent variables. However, existing algorithms are not robust in this setting, as they have not been designed to work in this regime, and there is no theoretical explanation of what their outputs will be.
The methods relying on a whitening step TensorLatent ; SpectralLatent ; SpectralLDA , will perform the whitening using the matrix obtained from the lowrank SVD truncated at rank : . TPM will use to whiten the tensor to a tensor . However, when , may not admit a symmetric orthogonal decomposition ^{1}^{1}1See the supplementary material for an example corroborating this statement.. Consequently, it is not clear what TPM will return in this case and there are no guarantees it will even converge. The methods from SpectralLatent ; SpectralLDA will compute the matrices for that may not be jointly diagonalizable, and in this case there is no theoretical justification of what what the result of these algorithms will be. Similarly, the simultaneous diagonalization method proposed in kuleshov2015tensor produces a matrix that nearly diagonalizes the slices of , but no analysis is given for this setting.
3 Simultaneous Diagonalization Based on Whitening and Optimization
This section presents the main contribution of the paper: a simultaneous diagonalization algorithms based on whitening and optimization we call SIDIWO (Simultaneous Diagonalization based on Whitening and Optimization). When asked to produce components in the noiseless setting, SIDIWO will return the same output as any of the methods discussed in Section 2.1. However, in contrast with those methods, SIDIWO will provide useful results with a clear interpretation even in a misspecified setting ().
3.1 SIDIWO in the Realizable Setting
To derive our SIDIWO algorithm we first observe that in the noiseless setting and when , the pair returned by all methods described in Section 2.1 is the solution of the optimization problem given in the following lemma^{2}^{2}2The proofs of all the results are provided in the supplementary material..
Lemma 3.1
Remark 1 (The role of the constraint)
Consider the cost function of Problem (5
): in an unconstrained setting, there may be several matrices minimizing that cost. A trivial example is the zero matrix. A less trivial example is when the rows of
belong to the orthogonal complement of the column space of the matrix . The constraint for some orthonormal matrix first excludes the zero matrix from the set of feasible solutions, and second guarantees that all feasible solutions lay in the space generated by the columns of .Problem (5) opens a new perspective on using simultaneous diagonalization to learn the parameters of a latent variable model. In fact, one could recover the pair from the relation by first finding the optimal and then individually retrieving and by solving a linear system using the vector . This approach, outlined in Algorithm 1, is an alternative to the ones presented in the literature up to now (even though in the noiseless, realizable setting, it will provide the same results). Similarly to existing methods, this approach requires to know the number of latent states. We will however see in the next section that Algorithm 1 provides meaningful results even when a misspecified number of latent states is used.
3.2 The Misspecified Setting
Algorithm 1 requires as inputs the low order moments , , along with the desired number of latent states to recover. If , it will return the exact model parameters ; we will now see that it will also provide meaningful results when . In this setting, Algorithm 1 returns a pair such that the matrix is optimal for the optimization problem
(6) 
Analyzing the space of feasible solutions (Theorem 3.1) and the optimization function (Theorem 3.2), we will obtain theoretical guarantees on what SIDIWO returns when , showing that the trivial solutions are not feasible, and that, in the space of feasible solutions, SIDIWO’s optima will approximate the true model parameters according to an intuitive geometric interpretation.
Remarks on the constraints.
The first step consists in analyzing the space of feasible solutions when . The observations outlined in Remark 1 still hold in this setting: the zero solution and the matrices laying in the orthonormal complement of are not feasible. Furthermore, the following theorem shows that other undesirable solutions will be avoided.
Theorem 3.1
Let with rows , and let denote the identity matrix. The following facts hold under the hypotheses of Lemma 3.1:

[itemsep=1pt, topsep=1pt, partopsep=0pt]

For any row , there exists at least one column of such that .

The columns of any satisfying are a linear combination of those of , laying in the bestfit dimensional subspace of the space spanned by the columns of .

Let be any permutation of , and let and be obtained by permuting the columns of and according to . If for any , then , and similarly .
The second point of Theorem 3.1 states that the feasible solutions will lay in the best dimensional subspace approximating the one spanned by the columns of . This has two interesting consequences: if the columns of are not orthogonal, point 3 guarantees that cannot simply be a subblock of the original , but rather a nontrivial linear combination of its columns laying in the best dimensional subspace approximating its column space. In the single topic model case with topics, when asked to recover topics, Algorithm 1 will not return a subset of the original topics, but a matrix whose columns gather the original topics via a non trivial linear combination: the original topics will all be represented in the columns of with different weights. When the columns of are orthogonal, this space coincides with the space of the columns of associated with the largest ; in this setting, the matrix (for some permutation ) is a feasible solution and minimizes Problem (6). Thus, Algorithm 1 will recover the top topics.
Interpreting the optima.
Let be such that is a minimizer of Problem (6). In order to better understand the relation between and the original matrix , we will show that the cost function of Problem (6) can be written in an equivalent form, that unveils a geometric interpretation.
Theorem 3.2
Let denote the rows of and introduce the following optimization problem
(7) 
where . Then this problem is equivalent to (6).
First, observe that the cost function in Equation (7) prefers ’s such that the vectors , , have disjoint support. This is a consequence of the , and requires that, for each , the entries are close zero for at least all but one of the various . Consequently, each center will be almost orthogonal to all but one row of the optimal ; however the number of centers is greater than the number of rows of , so the same row may be nonorthogonal to various centers.
For illustration, consider the single topic model: a solution to Problem (7) would have rows that should be as orthogonal as possible to some topics and as aligned as possible to the others; in other words, for a given topic , the optimization problem is trying to set for all but one of the various . Consequently, each column of the output of Algorithm 1 should be in essence aligned with some of the topics and orthogonal to the others.
It is worth mentioning that the constraint set forbids the trivial solutions such as the zero matrix, the pseudoinverse of any subset of columns of , and any subset of rows of (which all have an objective value of ).
We remark that Theorem 3.2 doesn’t require the matrix to be full rank : we only need it to have at least rank greater or equal to , in order to guarantee that the constraint set is well defined.
An optimal solution when .
While Problem (5) can be solved in general using an extension of the Jacobi technique cardoso1996jacobi ; bunse1993numerical , we provide a simple and efficient method for the case . This method will then be used to perform hierarchical topic modeling in Section 4. When , Equation (6) can be solved optimally with few simple steps; in fact, the following theorem shows that solving (6) is equivalent to minimizing a continuous function on the compact onedimensional set , which can easily be done by griding . Using this in Step 3 of Algorithm 1, one can efficiently compute an arbitrarily good approximation of the optimal matrix .
Theorem 3.3
Consider the continuous function , where the coefficients are functions of the entries of and . Let be the minimizer of on , and consider the matrix
Then, the matrix is a minimizer of Problem (6) when .
4 Case Study: Hierarchical Topic Modeling
In this section, we show how SIDIWO can be used to efficiently recover hierarchical representations of latent variable models. Given a latent variable model with states, our method allows to recover a pair from estimate of the moments , and , where the columns of offer a synthetic representation of the original centers. We will refer to these vectors as pseudocenters: each pseudocenter is representative of a group of the original centers. Consider the case . A dataset of samples can be split into two smaller subsets according to their similarity to the two pseudocenters. Formally, this assignment is done using Maximum A Posteriori (MAP) to find the pseudocenter giving maximum conditional likelihood to each sample. The splitting procedure can be iterated recursively to obtain a divisive binary tree, leading to a hierarchical clustering algorithm. While this hierarchical clustering method can be applied to any latent variable model that can be learned with the tensor method of moments (e.g. Latent Dirichlet Allocation), we present it here for the single topic model for the sake of simplicity.
We consider a corpus of texts encoded as in Section 2 and we split
into two smaller corpora according to their similarity to the two pseudocenters in two steps: project the pseudocenters on the simplex to obtain discrete probability distributions (using for example the method described in
duchi2008efficient ), and use MAP assignment to assign each text to a pseudocenter. This process is summarized in Algorithm 2.Once the corpus has been split into two subsets and , each of these subsets may still contain the full set of topics but the topic distribution will differ in the two: topics similar to the first pseudocenter will be predominant in the first subset, the others in the second. By recursively iterating this process, we obtain a binary tree where topic distributions in the nodes with higher depth are expected to be more concentrated on fewer topics.
In the next sections, we assess the validity of this approach on both synthetic and realworld data^{3}^{3}3The experiments in this section have been performed in Python 2.7, using numpy (van2011numpy, ) library for linear algebra operations, with the exception of the implementation of the method from kuleshov2015tensor , for which we used the author’s Matlab implementation: https://github.com/kuleshov/tensorfactorization. All the experiments were run on a MacBook Pro with an Intel Core i5 processor. The implementation of the described algorithms can be found a this link: https://github.com/mruffini/HierarchicalMethodsofMoments..
4.1 Experiment on Synthetic Data
reports the average and standard deviation over 10 runs of the clustering accuracy for the various methods, along with average running times.
In order to test the ability of SIDIWO to recover latent structures in data, we generate a dataset distributed as a single topic model (with a vocabulary of 100 words) whose topics have an intrinsic hierarchical structure depicted in Figure 0(a). In this figure, topics are on the axis, words on the axis, and green (resp. red) points represents high (resp low) probability. We see for example that the first 4 topics are concentrated over the 1st half of the vocabulary, and that topics 1 and 2 have high probability on the 1st and 3rd fourth of the words while for the other two it is on the 1st and 4th.
We generate samples according to this model and we iteratively run Algorithm 2 to create a hierarchical binary tree with 8 leafs. We expect leafs to contain samples from a unique topic and internal nodes to gather similar topics. Results are displayed in Figure 0(b) where each chart represents a node of the tree (child nodes lay below their parent) and contains the heatmap of the samples clustered in that node (axis corresponds to samples and axis to words, red points are infrequent words and clear points frequent ones). The results are as expected: each leaf contains samples from one of the topics and internal nodes group similar topics together.
We compare the clustering accuracy of SIDIWO with other methods using the Adjusted Rand Index hubert1985comparing of the partition of the data obtained at the leafs w.r.t the one obtained using the true topics; comparisons are with the flat clustering on topics with TPM, the method from SpectralLatent (SVD), the one from kuleshov2015tensor (Rand. Proj.) and ALS from kolda2009tensor , where ALS is applied to decompose a whitened tensor , calculated as in Equation (3). We repeat the experiment 10 times with different random samples and we report the average results in Table 0(c); SIDIWO always recovers the original topic almost perfectly, unlike competing methods. One intuition for this improvement is that each split in the divisive clustering helps remove noise in the moments.
4.2 Experiment on NIPS Conference Papers 19872015
We consider the full set of NIPS papers accepted between 1987 and 2015 , containing papers perrone2016poisson . We assume that the papers are distributed according to a single topic model, we keep the most frequent words as vocabulary and we iteratively run Algorithm 2 to create a binary tree of depth 4. The resulting tree is shown in Figure 2 where each node contains the most relevant words of the cluster, where the relevance sievert2014ldavis of a word is defined by
where the weight parameter is set to and (resp. ) is the empirical frequency of in (resp. in
). The leafs clustering and the whole hierarchy have a neat interpretation. Looking at the leaves, we can easily hypothesize the dominant topics for the 8 clusters. From left to right we have: [image processing, probabilistic models], [neuroscience, neural networks], [kernel methods, algorithms], [online optimization, reinforcement learning]. Also, each node of the lower levels gathers meaningful keywords, confirming the ability of the method to hierarchically find meaningful topics. The running time for this experiment was 59 seconds.
4.3 Experiment on Wikipedia Mathematics Pages
We consider a subset of the full Wikipedia corpus, containing all articles ( texts) from the following mathrelated categories: linear algebra, ring theory, stochastic processes and optimization. We remove a set of stopwords, keep a vocabulary of words and run SIDIWO to perform hierarchical topic modeling (using the same methodology as in the previous section). The resulting hierarchical clustering is shown in Figure 3 where we see that each leaf is characterized by one of the dominant topics: [ring theory, linear algebra], [stochastic processes, optimization] (from left to right). It is interesting to observe that the first level of the clustering has separated pure mathematical topics from applied ones. The running time for this experiment was 6 seconds.
5 Conclusions and future works
We proposed a novel spectral algorithm (SIDIWO) that generalizes recent method of moments algorithms relying on tensor decomposition. While previous algorithms lack robustness to model misspecification, SIDIWO provides meaningful results even in misspecified settings. Moreover, SIDIWO can be used to perform hierarchical method of moments estimation for latent variable models. In particular, we showed through hierarchical topic modeling experiments on synthetic and real data that SIDIWO provides meaningful results while being very computationally efficient.
A natural future work is to investigate the capability of the proposed hierarchical method to learn overcomplete latent variable models, a task that has received significant attention in recent literature anandkumar2015learning ; anandkumar2017analyzing . We are also interested in comparing the learning performance of SIDIWO the with those of other existing methods of moments in the realizable setting. On the applications side, we are interested in applying the methods developed in this paper to the healthcare analytics field, for instance to perform hierarchical clustering of patients using electronic healthcare records or more complex genetic data.
Acknowledgments
Guillaume Rabusseau acknowledges support of an IVADO postdoctoral fellowship. Borja Balle completed this work while at Lancaster University.
References
 (1) Arthur P. Dempster, Nan M. Laird, and Donald B. Rubin. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society. Series B (methodological), pages 1–38, 1977.
 (2) Animashree Anandkumar, Rong Ge, Daniel Hsu, Sham M Kakade, and Matus Telgarsky. Tensor decompositions for learning latent variable models. Journal of Machine Learning Research, 15(1):2773–2832, 2014.
 (3) Animashree Anandkumar, Daniel Hsu, and Sham M Kakade. A method of moments for mixture models and hidden Markov models. In COLT, volume 1, page 4, 2012.
 (4) Animashree Anandkumar, Yikai Liu, Daniel J Hsu, Dean P Foster, and Sham M Kakade. A spectral algorithm for Latent Dirichlet Allocation. In NIPS, pages 917–925, 2012.
 (5) Prateek Jain and Sewoong Oh. Learning mixtures of discrete product distributions using spectral decompositions. In COLT, pages 824–856, 2014.
 (6) Daniel Hsu and Sham M Kakade. Learning mixtures of spherical Gaussians: moment methods and spectral decompositions. In ITCS, pages 11–20. ACM, 2013.
 (7) Le Song, Eric P Xing, and Ankur P Parikh. A spectral algorithm for latent tree graphical models. In ICML, pages 1065–1072, 2011.
 (8) Borja Balle, William L Hamilton, and Joelle Pineau. Methods of moments for learning stochastic languages: Unified presentation and empirical comparison. In ICML, pages 1386–1394, 2014.

(9)
Arun T Chaganty and Percy Liang.
Spectral experts for estimating mixtures of linear regressions.
In ICML, pages 1040–1048, 2013.  (10) Raphael Bailly. Quadratic weighted automata: Spectral algorithm and likelihood maximization. Journal of Machine Learning Research, 20:147–162, 2011.
 (11) Yuchen Zhang, Xi Chen, Denny Zhou, and Michael I Jordan. Spectral methods meet EM: A provably optimal algorithm for crowdsourcing. In NIPS, pages 1260–1268, 2014.
 (12) Michael Steinbach, George Karypis, Vipin Kumar, et al. A comparison of document clustering techniques. In KDD workshop on text mining, volume 400, pages 525–526. Boston, 2000.

(13)
Sergio M Savaresi and Daniel L Boley.
On the performance of bisecting Kmeans and PDDP.
In SDM, pages 1–14. SIAM, 2001.  (14) Borja Balle, Ariadna Quattoni, and Xavier Carreras. Local loss optimization in operator models: a new insight into spectral learning. In ICML, pages 1819–1826, 2012.
 (15) Borja Balle and Mehryar Mohri. Spectral learning of general weighted automata via constrained matrix completion. In NIPS, pages 2159–2167, 2012.
 (16) Ariadna Quattoni, Borja Balle, Xavier Carreras, and Amir Globerson. Spectral regularization for maxmargin sequence tagging. In ICML, pages 1710–1718, 2014.
 (17) Alex Kulesza, N Raj Rao, and Satinder Singh. Lowrank spectral learning. In Artificial Intelligence and Statistics, pages 522–530, 2014.

(18)
Alex Kulesza, Nan Jiang, and Satinder Singh.
Lowrank spectral learning with weighted loss functions.
In Artificial Intelligence and Statistics, pages 517–525, 2015.  (19) Matteo Ruffini, Marta Casanellas, and Ricard Gavaldà. A new spectral method for latent variable models. arXiv preprint arXiv:1612.03409, 2016.
 (20) Daniel Hsu, Sham M Kakade, and Tong Zhang. A spectral algorithm for learning hidden Markov models. Journal of Computer and System Sciences, 78(5):1460–1480, 2012.
 (21) Tamara G Kolda and Brett W Bader. Tensor decompositions and applications. SIAM review, 51(3):455–500, 2009.
 (22) Volodymyr Kuleshov, Arun Chaganty, and Percy Liang. Tensor factorization via matrix factorization. In AISTATS, pages 507–516, 2015.
 (23) JeanFrancois Cardoso and Antoine Souloumiac. Jacobi angles for simultaneous diagonalization. SIAM journal on matrix analysis and applications, 17(1):161–164, 1996.
 (24) Angelika BunseGerstner, Ralph Byers, and Volker Mehrmann. Numerical methods for simultaneous diagonalization. SIAM journal on matrix analysis and applications, 14(4):927–949, 1993.
 (25) John Duchi, Shai ShalevShwartz, Yoram Singer, and Tushar Chandra. Efficient projections onto the l 1ball for learning in high dimensions. In ICML, pages 272–279, 2008.
 (26) Stefan Van Der Walt, S Chris Colbert, and Gael Varoquaux. The numpy array: a structure for efficient numerical computation. Computing in Science & Engineering, 13(2):22–30, 2011.
 (27) Lawrence Hubert and Phipps Arabie. Comparing partitions. Journal of classification, 2(1):193–218, 1985.
 (28) Valerio Perrone, Paul A Jenkins, Dario Spano, and Yee Whye Teh. Poisson random fields for dynamic feature models. arXiv preprint arXiv:1611.07460, 2016.
 (29) Carson Sievert and Kenneth E Shirley. Ldavis: A method for visualizing and interpreting topics. In ACL workshop on interactive language learning, visualization, and interfaces, 2014.
 (30) Animashree Anandkumar, Rong Ge, and Majid Janzamin. Learning overcomplete latent variable models through tensor methods. In COLT, pages 36–112, 2015.
 (31) Animashree Anandkumar, Rong Ge, and Majid Janzamin. Analyzing tensor power method dynamics in overcomplete regime. Journal of Machine Learning Research, 18(22):1–40, 2017.
 (32) Elina Mihaylova Robeva. Decomposing Matrices, Tensors, and Images. PhD thesis, University of California, Berkeley, 2016.
Appendix A Lowrank whitenings may not admit a symmetric orthogonal decomposition.
In Section 2.2 we claimed that a symmetric tensor with CPrank , when whitened to a tensor , may not admit a symmetric orthogonal decomposition if . We give here a simple counterexample by constructing a tensor whose whitening does not admit a symmetric orthogonal decomposition. We will make use of the following Lemma.
Lemma A.1 (robeva2016decomposing , Example 1.2.3)
A symmetric tensor is orthogonally decomposable if and only if its entries satisfy the following equation:
(8) 
Consider the following parameters^{4}^{4}4This example easily generalizes to vectors in the simplex.
from which one can recover a matrix and a tensor from equations (1) and (2), both of rank . Using the top singular vectors and values of , would be whitened to a tensor with the following entries:
One can check that in this case Eq. (8) is equivalent to , which does not hold, hence is not orthogonally decomposable.
Appendix B Proof Of Lemma 3.1
Consider the SVD of :
where and are obtained from the first singular vectors and values. Define now the matrix ; then there exists a unique orthogonal matrix such that .
This implies that the slices of can be rewritten as follows:
Take now a generic matrix ; it can be written as
for a certain orthonormal matrix . So we have
This matrix is diagonal if and only if , so the problem
(9) 
is optimized by , which is the unique (up to a columns rescaling) feasible optimum.
Appendix C Proof Of Theorem 3.1
Let’s recall the notation we are going to use. Consider the matrix and its SVD:
For any , define , where and are and truncated at the th singular vector (recall: and ). We know that there exists an orthonormal matrix such that
Let’s prove the various points of the Theorem.

Consider any matrix . Then we will have, for an orthonormal ,
To prove the statement, it is enough to show that the matrix has rank . To see this, explicitly represent :
the fact that and are orthogonal proves the claim.

Consider again any matrix , then
The columns of are the left singular vectors of , that span the best fit dimensional subspace of the space generated by the columns of .

To prove this we will proceed by contradiction. Assume that ; this means that there exists an orthonormal matrix such that
But , so
This would imply that
and so, for some
Observe now that the matrix has all the entries that are different from zero, by the hypothesis that for any . However, we have that
where is the diagonal matrix with the last singular values. So has some zero entry. This contradiction proves the claim.
The proof of the fact that is identical.
Appendix D Proof Of Theorem 3.2
Recall the considered problem:
(10) 
where
Consider any , then it admits the following representation, for some with :
This allows the following chain of equalities on the cost function:
Where the vector is defined as
and the last equality has been obtained from the fact that, for any vector , we have
This last equation proves our statement; in fact,
Appendix E Proof Of Theorem 3.3
First, observe that the set of orthonormal matrices can be parametrized as
(11) 
The set can thus be rewritten in function of , as
and Problem (5) can be rewritten as
where
(12) 
and where we used the fact that is symmetric. We can then write
where the coefficients can be written as
with and . Letting for it follows that optimizing Problem (5) is equivalent to minimize the following smooth real function