Latent variable models are probabilistic models that are versatile in modeling high dimensional complex data with hidden structure, and are an often employed unsupervised learning method. The method of moments relates the observed data moments with model parameters using a CP tensor decomposition. Specifically, learning latent variable models using the method of moments involves identifying the components that have generated the noisy observation () of the original tensor . Here, , where is noise. (An observed tensor is often noisy due to limited number of available samples.) Therefore, it is important to find methods that can provide guarantees over the components recovered by CP decompositions executed over noisy data.
Assume that the underlying tensor is 3-order, and has components , , such that the tensor , , , and are the columns of , , , respectively. If is symmetric, i.e., , then it is said to permit a symmetric CP decomposition. If is not symmetric, then it must be decomposed using an asymmetric decomposition.
Let’s further restrict a symmetric tensor such that the components () are not just equal but also orthogonal. Decomposition of symmetric orthogonal tensors is trivial  as the constraints of symmetric entries and orthogonal components vastly reduce the number of parameters in the CP decomposition problem. There is much prior work [2, 3, 7, 18, 19] on decomposing symmetric tensor with identical components across modes. For symmetric tensor with orthogonal components, existing methods 
provide a high probability guarantee of recovery and convergence.
While the CP decomposition of symmetric tensors is relatively well understood, the existence and uniqueness of the asymmetric CP decomposition is less so . Indeed, there are multiple definitions of orthogonal CP decomposition (see ), going from weak to strict.
In this paper, we consider CP decomposition of asymmetric tensors with orthonormal factors. The restriction to orthonormal factors is reasonable since common whitening procedures can be used to orthogonalize a tensor. However, the symmetric assumption, required by prior methods is far too restrictive. In most applications, we use multi-view models or HMMs in which information is asymmetric along different modes.
Unlike previous schemes based on deflation methods 
that recover factors sequentially, our scheme recovers the components simultaneously. In numerous machine learning settings, data is generated in real-time, and sequential recovery of factors may be inapplicable under such online settings. Further, many existing schemes require multiple random initializations in order to achieve a high probability bound on the convergence of deflation algorithms. Our scheme is based on matrix subspace iteration, and does not require multiple initializations. Prior work considers a simultaneous subspace iteration, but was only limited to symmetric tensors.
1.1 Summary of Contribution
We provide the first guaranteed decomposition algorithm, Smartly Initialized Orthogonal ALS, for orthogonal asymmetric tensors using Trace Based Initialization procedure. We prove a dimension independent doubly exponential convergence rate under the assumption of spectral gap. We propose a noise model, called benign noise model, under which Smartly Initialized Orthogonal ALS consistently recovers the factors of the orthogonal asymmetric tensor decomposition.
Our Trace Based Initialization procedure can be applied to symmetric orthogonal tensor decomposition to improve the convergence rate. For the asymmetric orthogonal tensor decomposition, we provide insight for the hardness of the initialization with unknown rank .
Theorem 1.1 (Informal).
For a tensor that permits a CP decomposition form where are orthonormal matrices,
using generated from Trace Based Initialization procedure described in Procedure 2 and number of matrix subspace iterations in Procedure 3 , the estimated results from Orthogonal ALS in Procedure
, the estimated results from Orthogonal ALS in Procedure1, , and , converge to the true components of the tensor , , and respectively,
1.2 Related Work
The most popular tensor decomposition is the power method wherein Anandkumar et al.  proved the convergence of rank-1 power method on orthogonal symmetric tensors using deflation. This algorithm uses random initializations: multiple runs of the algorithm is needed for convergence guarantees. Wang et al.  used subspace iteration and proved the simultaneous convergence of the top-
singular vectors for orthogonal symmetric tensors. A matrix subspace iteration procedure is used for initialization. Another popular way to implement tensor decomposition is alternating least square (ALS) algorithm[5, 8, 13]
. Even for symmetric tensors, ALS algorithm optimizes individual mode of the factors by fixing all other modes, and alternative between the modes. This process is repeated until convergence. Each ALS step requires solving a least squares problem. Convergence of a variant of ALS using QR decomposition
for symmetric tensors had been proven. Stochastic gradient descent is also used to solve tensor decomposition problem. In, an objective function for tensor decomposition is proposed where all the local optima are globally optimal.
2 Tensor & Subspace Iteration Preliminaries and Notations
Let . For a vector , denote the element as . For a matrix , denote the row as , column as , and element as . Denote the first columns of matrix as . An -order (number of dimensions, a.k.a. modes) tensor, denoted as , is a multi-dimensional array with dimensions. For a 3-order tensor , its entry is denoted by .
Matricization is the process of reordering the elements of an -way tensor into a matrix. The mode- matricization of a tensor is denoted by and arranges the mode- fibers to be the columns of the resulting matrix  To be precise, tensor element maps to matrix element , where .
Tensor product is also known as outer product. For and , is a sized 3-way tensor with entry being .
Khatri-Rao product is defined as for , .
Tensor Spectral Norm The spectral norm for tensor is defined as .
Definition 2.1 (Subspace Similarity ).
Let be two -dimension proper subspaces in spanned respectively by columns of two basis matrices . Let be the basis matrix for the complement subspace of . Then the principal angle formed by and is
where , denotes the smallest and greatest singular value of a matrix.
3 Asymmetric (Orthogonal) Tensor Decomposition Model
Consider a rank- asymmetric tensor generated from latent factors , , and
where , and similarly for and . Without loss of generality, we assume . Our analysis applies to general order- tensors, for notation simplicity we use order-3 tensor only. In this paper, we assume that are all orthonormal matrices, and therefore the tensor we find CP decomposition on has a unique orthogonal decomposition, based on Kruskal’s condition .
Our goal is to discover a CP decomposition with orthogonal components that best approximates the observed , and it can be formulated as solving the following optimization problem:
We denote the estimated singular values and factor matrices of the tensor as , , and respectively.
4 Simultaneous Asymmetric Tensor Decomposition via Orthogonalized Alternating Least Square
4.1 Simultaneous Orthogonalized ALS for Asymmetric Tensor
The ALS approach fixes to compute a closed form solution for , then fixes to solve for , followed by fixes to solve for . The alternating updates are repeated until the some convergence criterion is satisfied. Fixing all but one factor matrix, the problem reduces to a linear least-squares problem over the matricized tensor
where there exists a closed form solution , using the pseudo-inverse.
Intuition for the Simultaneous Orthogonalized ALS
ALS converges quickly and is usually robust to noise in practice. However, the convergence theory of ALS for asymmetric tensor is not well understood. We fill the gap in this paper by introducing an orthogonalization step after each closed form update of the least square problem as shown in Procedure 1 for orthogonally decomposable asymmetric tensors.
Under -sufficient initialization condition in Definition 4.1 (defined next), we update the components , and as in line 3,4,5 of Procedure 1. We save on expensive matrix inversions over as due to the orthogonality of and .
We define the sufficient initialization condition in Definition 4.1 under which our Orthogonal ALS algorithm is guaranteed to converge to the true factors of the tensor .
Definition 4.1 (-Sufficient Initialization Condition).
The -sufficient initialization condition is satisfied if , , and .
With the -Sufficient Initialization Condition defined, we obtain the following conditional convergence theorem.
Theorem 4.2 (Conditional Simultaneous Convergence).
Theorem 4.2 guarantees that the estimated factors recovered using Orthogonal ALS converges to the true factors , and . The convergence rate of Orthogonal ALS is when the -sufficient initialization condition is satisfied. The proof sketch is in Appendix B.
We next propose a novel initialization method in Procedure 2 which guarantees that the -Sufficient Initialization Condition is satisfied.
4.2 Initialization for Orthogonal ALS: Subspace Iteration
The objective function in equation 6 is nonconvex, therefore initialization of the orthogonal factors , and is crucial.
It is nontrivial to initialize that satisfy the -sufficient initialization condition in definition 4.1. Denote the singular of in descending order as . We assume a gap between the and the singular values, i.e., . The following lemma provides the key insight behind our initialization procedure.
Let respectively be the orthonormal complex matrix whose column space is the left and right invariant subspace corresponding to the dominant eigenvalues of . Assume for fixed initialization , has full rank, where is the conjugate transpose operation. Then independent to ,
suggests that as long as we can find a matrix whose left eigenspace is the column space of(similarly for and ), we can take advantage of matrix subspace iteration to prepare an initialization for the Orthogonal ALS.
Intuition for Matrix Subspace Iteration
Given a matrix , matrix subspace iteration aims to recover the left eigenspace of dimension
determined as the span space of eigenvectors ofcorresponding to largest eigenvalues. Matrix subspace iteration provides insight into how the factors should be initialized.
Using an orthonormal matrix , matrix subspace iteration updates as follows
5 Trace Based Initialization Procedure 2
For matrix subspace iteration to work, we need to prepare a matrix that spans space of eigenvectors of . We start with a vector which is the collection of the trace of each slice matrix on the third mode of tensor , i.e. Then we take mode-3 product of tensor with the above vector . As a result, we have the following lemma.
Mode-3 product of tensor with vector has the form where
The proof is in Appendix F. Similarly and . To initialize for recovering , one could utilize or . To recover , one could utilize or .
5.1 Performance of Trace Based Initialization Procedure for Symmetric Tensor
Consider a symmetric tensor with orthogonal components where . For rank-1 power method with deflation, between deflations it uses random unit vector initializations, and the power iteration converges to the tensor eigenvector with the largest among where . A drawback of this property is that random initialization does not guarantee convergence to the eigenvector with the largest eigenvalue.
Procedure 2 uses and thus Therefore we obtain , and the power method converges to the eigenvector which corresponds to the largest eigenvalue . Inductively, by deflating the tensor, we will recover eigenvectors in an descending order.
Procedure 2 also improves simultaneous power method for symmetric tensors.
5.2 Hardness of Initialization for Asymmetric Tensor Decomposition
Trace Based Initialization Procedure 2 works well for symmetric tensors even when the rank is unknown. However, for the asymmetric tensors, we assume knowledge of the rank (as specified in Procedure 2). In practice, we suggest initializing the algorithm by a matrix with large enough number of columns.
Let , where ’s are independent standard Gaussian, be the matrix obtained from performing the Gram-Schmidt procedure on the columns of , be a sequence of positive integers and
we then have
the matrix is Haar invariant on the orthonormal group ;
in probability, provided as ;
, we have that in probability as .
This proposition states that for a matrix from Haar distribution , the first columns, scaled by , asymptotically behave like a matrix with independent standard Gaussian entries and this is the largest order for the number of columns we can approximate simultaneously.
We treat as the left sub-block of some orthonormal matrix. Thus if , could be approximated by a matrix of i.i.d. .
Now the tensor is asymmetric and
are independent, so are the Gaussian approximations. Using Trace Based Initialization Procedure, the prepared matrix will have each eigenvalue be the ordered tensor singular value multiplied by a random variable independently and identically distributed.
In case is unknown and we aim to recover the subspace corresponding to first leading singular values of the tensor using Trace Based Initialization Procedure,
is required to ensure that the rankings of singular values of the initialization matrix remains in the same descending order for singular values of the tensor.
Because we do not have a probabilistic characterization of ’s, it is hard to characterize the probability of success. Nevertheless, we gain insight by studying the extreme case when all ’s are identically distributed.
In our Trace Based Initialization Procedure, if we don’t know the rank, the ordering of the eigenvalues in Equation 13 is crucial for us to recover the singular vectors corresponding to top largest singular vectors of the tensor. Proposition 5.6
provides an asymptotic analysis of the hardness of the initialization with unknown.
In Equation 13, factors
are asymptotically i.i.d. continuous random variables taking absolute value of standard Gaussian; the ranking vector ofis the same as the ranking vector of , which are asymptotically i.i.d. continuous random variables taking absolute value of inner product of two standard multivariate Gaussian.
Remark (Hardness Results).
In the extreme case where all singular values are identically distributed, the probability of getting a certain rank decreases exponentially with respect to (number of random variables). Therefore, the probability that we will recover every dimension of the subspace also decreases exponentially with rank of tensor.
6 Robustness of the Convergence Result
In this section, we extend the convergence result to noisy asymmetric tensors. For symmetric tensors, there are a number of prior efforts [18, 19, 2] showing that their decomposition algorithms are robust to noise. Such robustness depends on restriction on tensor or structure of the noise such as low column correlations of factor matrices (in ) or symmetry of noise along with the true tensor (in ).
Let be the observed tensor with being the true tensor with a CP decomposition form
and let being a noise tensor. We provide a result on robustness of our algorithm under the following benign noise model.
Definition 6.1 (Benign Noise).
We define a class of benign noise as follows
where , and are the true factors of the tensor .
Under the benign noise model, we have the following robustness result.
Theorem 6.2 (Conditional Robustness).
The proof follows from the main convergence result B.1.
Our robustness result is guaranteed under the benign noise model. In the general case, the noise can be “malicious” if there is a sharp angle between subspace of and subspace of for every modes. Under the malicious noise model, not only is the robustness of our algorithm is no longer guaranteed, it is also highly questionable whether observed tensor still permits an orthogonal CP decomposition. Hillar and Lim stated in  that in general, the best rank- approximation to the tensor does not necessarily exist when . According to , we have no guarantee that the orthogonal decomposition exists in general.
Discovering latent variable models over large datasets can be cast as a tensor decomposition problem. Existing theory for tensor decompositions guarantee results when the tensor is noise-free and symmetric. However, in practice, the tensors that have to be decomposed are noisy due to sampling limitations, and also inherently asymmetric due to the underlying model. In this paper, we present the first algorithm for guaranteed recovery of tensor factors for an orthogonal asymmetric noisy tensor. We also introduce a new initialization procedure that guarantees convergence of Orthogonal ALS for asymmetric tensors, and improves convergence for existing methods for symmetric tensors.
-  Anima Anandkumar, Rong Ge, and Majid Janzamin. Guaranteed non-orthogonal tensor decomposition via alternating rank-1 updates. arXiv preprint arXiv:1402.5180, 2014.
-  Anima Anandkumar, Prateek Jain, Yang Shi, and Uma Naresh Niranjan. Tensor vs. matrix methods: Robust tensor decomposition under block sparse perturbations. In Artificial Intelligence and Statistics, pages 268–276, 2016.
-  Animashree Anandkumar, Rong Ge, Daniel Hsu, Sham M Kakade, and Matus Telgarsky. Tensor decompositions for learning latent variable models. The Journal of Machine Learning Research, 15(1):2773–2832, 2014.
-  Peter Arbenz, Daniel Kressner, and DME Zürich. Lecture notes on solving large scale eigenvalue problems. D-MATH, EHT Zurich, 2, 2012.
-  J Douglas Carroll and Jih-Jie Chang. Analysis of individual differences in multidimensional scaling via an n-way generalization of ” Eckart-Young ” decomposition. Psychometrika, 35(3):283–319, 1970.
-  Rong Ge, Furong Huang, Chi Jin, and Yang Yuan. Escaping from saddle points—online stochastic gradient for tensor decomposition. In Conference on Learning Theory, pages 797–842, 2015.
Navin Goyal, Santosh Vempala, and Ying Xiao.
Fourier PCA and robust tensor decomposition.
Proceedings of the forty-sixth annual ACM symposium on Theory of computing, pages 584–593. ACM, 2014.
-  Richard A Harshman. Foundations of the parafac procedure: Models and conditions for an” explanatory” multimodal factor analysis. UCLA Working Papers in Phonetics, 16:1–84, 1970.
-  Christopher J Hillar and Lek-Heng Lim. Most tensor problems are np-hard. Journal of the ACM (JACM), 60(6):45, 2013.
-  Tiefeng Jiang. How many entries of a typical orthogonal matrix can be approximated by independent normals? The Annals of Probability, 34(4):1497–1529, 07 2006.
-  Tamara G Kolda. Orthogonal tensor decompositions. SIAM Journal on Matrix Analysis and Applications, 23(1):243–255, 2001.
-  Tamara G Kolda. Symmetric orthogonal tensor decomposition is trivial. arXiv preprint arXiv:1503.01375, 2015.
-  Tamara G Kolda and Brett W Bader. Tensor decompositions and applications. SIAM review, 51(3):455–500, 2009.
-  Joseph B Kruskal. Three-way arrays: rank and uniqueness of trilinear decompositions, with application to arithmetic complexity and statistics. Linear algebra and its applications, 18(2):95–138, 1977.
-  Shuangzhe Liu and Gõtz Trenkler. Hadamard, khatri-rao, kronecker and other matrix products. Int. J. Inf. Syst. Sci, 4(1):160–177, 2008.
-  F. Mezzadri. How to generate random matrices from the classical compact groups. ArXiv Mathematical Physics e-prints, Sep 2006.
-  Ronald H Randles and Douglas A Wolfe. Introduction to the theory of nonparametric statistics. Introduction to the theory of nonparametric statistics, by Randles, Ronald H.; Wolfe, Douglas A. New York: Wiley, c1979. Wiley series in probability and mathematical statistics, 1979.
-  Vatsal Sharan and Gregory Valiant. Orthogonalized als: A theoretically principled tensor decomposition algorithm for practical use. arXiv preprint arXiv:1703.01804, 2017.
-  Po-An Wang and Chi-Jen Lu. Tensor decomposition via simultaneous power iteration. In International Conference on Machine Learning, pages 3665–3673, 2017.
-  Peizhen Zhu and Andrew V Knyazev. Angles between subspaces and their tangents. Journal of Numerical Mathematics, 21(4):325–340, 2013.
Appendix A A Naive Initialization Procedure
Based on the CP decomposition model in Equation (5), it is easy to see that the frontal slices shares the mode-A and mode-B singular vectors with the tensor , and the frontal slice is where
. It is natural to consider naively implementing singular value decompositions on the frontal slices to obtain estimations ofand .
Failure of Naive Initialization
Consider the simpler scenario of finding a good initialization for a symmetric tensor which permits the following CP decomposition
Specifically we have
where . However the first method gives us a matrix without any improvement on the diagonal decomposition, i.e. , where
For each eigenvalue of matrix
, it contains not only the factor of a tensor singular value which we care about, but also some unknowns from the unitary matrix. This induces trouble when one wants to recover the subspace relative to only some leading singular values of the tensor if the rankis believed to be in a greater order of the dimension . Although the analogous statement in matrix subspace iteration is true almost surely (with probability one), in tensor subspace iteration we indeed need to do more work than simply taking a slice. It is highly probable that the unknown entries permutate the eigenvalues into a unfavourate sequence. Meanwhile, since is ideally clean, we see success when we use the second method to recover the subspace relative to a few dominant singular values of a symmetric tensor.
They are all qualified in the sense that they own as the left eigenspace exactly. However we can generalize this scheme to a greater extent. Frontal slicing is just a specific realization of multiplying the tensor on the third mode by a unit vector. Mode- product of a tensor with a vector would return the collection of inner products of each mode- fiber with the vector. The mode-3 product of tensor with will give the th slice of .
Appendix B Procedure 1 Convergence Result
b.1 Conditional Simultaneous Convergence
Theorem B.1 (Main Convergence).
To prove the main convergence result, just combine all of the rest results together.
Let , be orthonormal initialization matrices for the specified subspace iteration. Then after iterations, we have
where , , , . Similarly for and .
The proof is in Appendix C.
Given that the initialization matrices satisfy the -sufficient initialization condition, the angles between approximate subspaces and true spaces would decrease double exponentially fast. Therefore, only number of iterations is needed to achieve .
The following result shows that if we have the angle of subspaces small enough, column vectors of the approximate matrix converges simultaneously to the true vectors of true tensor component at the same position.
Lemma B.3 (Simultaneous Convergence).
For any , if for some matrix , then
Similarly for and .
The proof is in Appendix D.
Appendix C Proof for Lemma b.2
We only prove the result for the order of . The proofs for the other two orders are the same.
For rank- tensor , its mode-1 matricization . So in each iteration,
We can expand matrices to be a basis for , and we can for example for , let be the matrix consisted of the rest columns in the expanded matrix. Now the column space of is just the complement space of column space of in . And is a orthonormal matrix.
With that notation, we have for
Now fix and focus on a single iteratoin step,