Tensor decompositions for learning latent variable models

This work considers a computationally and statistically efficient parameter estimation method for a wide class of latent variable models---including Gaussian mixture models, hidden Markov models, and latent Dirichlet allocation---which exploits a certain tensor structure in their low-order observable moments (typically, of second- and third-order). Specifically, parameter estimation is reduced to the problem of extracting a certain (orthogonal) decomposition of a symmetric tensor derived from the moments; this decomposition can be viewed as a natural generalization of the singular value decomposition for matrices. Although tensor decompositions are generally intractable to compute, the decomposition of these specially structured tensors can be efficiently obtained by a variety of approaches, including power iterations and maximization approaches (similar to the case of matrices). A detailed analysis of a robust tensor power method is provided, establishing an analogue of Wedin's perturbation theorem for the singular vectors of matrices. This implies a robust and computationally tractable estimation approach for several popular latent variable models.

Comments

There are no comments yet.

Authors

• 91 publications
• 49 publications
• 52 publications
• 46 publications
• 23 publications
• Efficient Orthogonal Tensor Decomposition, with an Application to Latent Variable Model Learning

Decomposing tensors into orthogonal factors is a well-known task in stat...
09/12/2013 ∙ by Franz J Király, et al. ∙ 0

read it

• Spectral Learning on Matrices and Tensors

Spectral methods have been the mainstay in several domains such as machi...
04/16/2020 ∙ by Majid Janzamin, et al. ∙ 164

read it

• Provable learning of Noisy-or Networks

Many machine learning applications use latent variable models to explain...
12/28/2016 ∙ by Sanjeev Arora, et al. ∙ 0

read it

• Weighted Tensor Decomposition for Learning Latent Variables with Partial Data

Tensor decomposition methods are popular tools for learning latent varia...
10/18/2017 ∙ by Omer Gottesman, et al. ∙ 0

read it

• Hierarchical Methods of Moments

Spectral methods of moments provide a powerful tool for learning the par...
10/17/2018 ∙ by Matteo Ruffini, et al. ∙ 0

read it

• Efficient coordinate-descent for orthogonal matrices through Givens rotations

Optimizing over the set of orthogonal matrices is a central component in...
12/02/2013 ∙ by Uri Shalit, et al. ∙ 0

read it

• Introduction to Tensor Decompositions and their Applications in Machine Learning

Tensors are multidimensional arrays of numerical values and therefore ge...
11/29/2017 ∙ by Stephan Rabanser, et al. ∙ 0

read it

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The method of moments is a classical parameter estimation technique [Pea94]

from statistics which has proved invaluable in a number of application domains. The basic paradigm is simple and intuitive: (i) compute certain statistics of the data — often empirical moments such as means and correlations — and (ii) find model parameters that give rise to (nearly) the same corresponding population quantities. In a number of cases, the method of moments leads to consistent estimators which can be efficiently computed; this is especially relevant in the context of latent variable models, where standard maximum likelihood approaches are typically computationally prohibitive, and heuristic methods can be unreliable and difficult to validate with high-dimensional data. Furthermore, the method of moments can be viewed as complementary to the maximum likelihood approach; simply taking a single step of Newton-Ralphson on the likelihood function starting from the moment based estimator

[Le 86] often leads to the best of both worlds: a computationally efficient estimator that is (asymptotically) statistically optimal.

The primary difficulty in learning latent variable models is that the latent (hidden) state of the data is not directly observed; rather only observed variables correlated with the hidden state are observed. As such, it is not evident the method of moments should fare any better than maximum likelihood in terms of computational performance: matching the model parameters to the observed moments may involve solving computationally intractable systems of multivariate polynomial equations. Fortunately, for many classes of latent variable models, there is rich structure in low-order moments (typically second- and third-order) which allow for this inverse moment problem to be solved efficiently [Cat44, Car91, Cha96, MR06, HKZ12, AHK12, AFH12, HK13]. What is more is that these decomposition problems are often amenable to simple and efficient iterative methods, such as gradient descent and the power iteration method.

1.1 Contributions

This work observes that a number of important and well-studied latent variable models—including Gaussian mixture models, hidden Markov models, and Latent Dirichlet allocation—share a certain structure in their low-order moments, and this permits certain tensor decomposition approaches to parameter estimation. In particular, this decomposition can be viewed as a natural generalization of the singular value decomposition for matrices.

While much of this (or similar) structure was implicit in several previous works [Cha96, MR06, HKZ12, AHK12, AFH12, HK13], here we make the decomposition explicit under a unified framework. Specifically, we express the observable moments as sums of rank-one terms, and reduce the parameter estimation task to the problem of extracting a symmetric orthogonal decomposition of a symmetric tensor derived from these observable moments. The problem can then be solved by a variety of approaches, including fixed-point and variational methods.

One approach for obtaining the orthogonal decomposition is the tensor power method of [LMV00, Remark 3]. We provide a convergence analysis of this method for orthogonally decomposable symmetric tensors, as well as a detailed perturbation analysis for a robust (and a computationally tractable) variant (Theorem 5.1). This perturbation analysis can be viewed as an analogue of Wedin’s perturbation theorem for singular vectors of matrices [Wed72], providing a bound on the error of the recovered decomposition in terms of the operator norm of the tensor perturbation. This analysis is subtle in at least two ways. First, unlike for matrices (where every matrix has a singular value decomposition), an orthogonal decomposition need not exist for the perturbed tensor. Our robust variant uses random restarts and deflation to extract an approximate decomposition in a computationally tractable manner. Second, the analysis of the deflation steps is non-trivial; a naïve argument would entail error accumulation in each deflation step, which we show can in fact be avoided. When this method is applied for parameter estimation in latent variable models previously discussed, improved sample complexity bounds (over previous work) can be obtained using this perturbation analysis.

Finally, we also address computational issues that arise when applying the tensor decomposition approaches to estimating latent variable models. Specifically, we show that the basic operations of simple iterative approaches (such as the tensor power method) can be efficiently executed in time linear in the dimension of the observations and the size of the training data. For instance, in a topic modeling application, the proposed methods require time linear in the number of words in the vocabulary and in the number of non-zero entries of the term-document matrix. The combination of this computational efficiency and the robustness of the tensor decomposition techniques makes the overall framework a promising approach to parameter estimation for latent variable models.

1.2 Related work

Tensor decompositions

The role of tensor decompositions in the context of latent variable models dates back to early uses in psychometrics [Cat44]

. These ideas later gained popularity in chemometrics, and more recently in numerous science and engineering disciplines, including neuroscience, phylogenetics, signal processing, data mining, and computer vision. A thorough survey of these techniques and applications can be found in

[KB09]

. Below, we discuss a few specific connections to two applications in machine learning and statistics, independent component analysis and latent variable models (between which there is also significant overlap).

Tensor decompositions have been used in signal processing and computational neuroscience for blind source separation and independent component analysis (ICA) [CJ10]. Here, statistically independent non-Gaussian sources are linearly mixed in the observed signal, and the goal is to recover the mixing matrix (and ultimately, the original source signals). A typical solution is to locate projections of the observed signals that correspond to local extrema of the so-called “contrast functions” which distinguish Gaussian variables from non-Gaussian variables. This method can be effectively implemented using fast descent algorithms [Hyv99]

. When using the excess kurtosis (

i.e., fourth-order cumulant) as the contrast function, this method reduces to a generalization of the power method for symmetric tensors [LMV00, ZG01, KR02]. This case is particularly important, since all local extrema of the kurtosis objective correspond to the true sources (under the assumed statistical model) [DL95]; the descent methods can therefore be rigorously analyzed, and their computational and statistical complexity can be bounded [FJK96, NR09, AGMS12, HK13].

Higher-order tensor decompositions have also been used for estimation latent variable models, often via the algebraic technique of [Har70] (see also [Kru77, LRA93]). This was employed for parameter estimation of discrete Markov models [Cha96]

using pair-wise and triple-wise probability tables, and later for other latent variable models such as hidden Markov models (HMMs), latent trees, Gaussian mixture models, and topic models such as latent Dirichlet allocation (LDA)

[MR06, HKZ12, AHK12, AFH12, HK13]. Simultaneous diagonalization is also used for many other applications, including blind source separation and ICA (as discussed above), and a number of efficient algorithms have been developed for this problem [BGBM93, CS93, Car94, CC96, CGT97, ZLNM04]

. Another reduction from tensors to matrices called flattening has also been used to solve tensor decomposition problems via matrix eigenvalue techniques

[Car91, DLCC07]. One advantage of these methods is that they can be used to estimate under-determined mixtures, where the number of sources is larger than the observed dimension.

The relevance of tensor analysis to latent variable modeling has been long recognized in the field of algebraic statistics [PS05], and many works characterize the algebraic varieties corresponding to the moments of various classes of latent variable models [DSS07, SZ13]. These works typically do not address computational or finite sample issues, but rather are concerned with basic questions of identifiability.

The specific tensor structure considered in the present work is the symmetric orthogonal decomposition. This decomposition expresses a tensor as a linear combination of simple tensor forms; each form is the tensor product of a vector (i.e., a rank-

tensor), and the collection of vectors form an orthonormal basis. An important property of tensors with such decompositions is that they have eigenvectors corresponding to these basis vectors. Although the concepts of eigenvalues and eigenvectors of tensors is generally significantly more complicated than their matrix counterpart (both algebraically

[Qi05, CS13, Lim05] and computationally [HL13, KR02]), the special symmetric orthogonal structure we consider permits simple algorithms to efficiently and stably recover the desired decomposition. In particular, a generalization of the matrix power method to symmetric tensors, introduced in [LMV00, Remark 3] and analyzed in [KR02], provides such a decomposition. This is in fact implied by the characterization in [ZG01], which shows that iteratively obtaining the best rank- approximation of such orthogonally decomposable tensors also yields the exact decomposition. We note that in general, obtaining such approximations for general (symmetric) tensors is NP-hard [HL13].

Latent variable models

This work focuses on the particular application of tensor decomposition methods to estimating latent variable models, a significant departure from many previous approaches in the machine learning and statistics literature. By far the most popular heuristic for parameter estimation for such models is the Expectation-Maximization (EM) algorithm

[DLR77, RW84]. Although EM has a number of merits, it may suffer from slow convergence and poor quality local optima [RW84], requiring practitioners to employ many additional heuristics to obtain good solutions. For some models such as latent trees [Roc06] and topic models [AGM12], maximum likelihood estimation is NP-hard, which suggests that other estimation approaches may be more attractive. More recently, algorithms from theoretical computer science and machine learning have addressed computational and sample complexity issues related to estimating certain latent variable models such as Gaussian mixture models and HMMs [Das99, AK01, DS07, VW02, KSV05, AM05, CR08, BV08, KMV10, BS10, MV10, HK13, Cha96, MR06, HKZ12, AHK12, AGM12, AFH12]. See [AHK12, HK13] for a discussion of these methods, together with the computational and statistical hardness barriers that they face. The present work reviews a broad range of latent variables where a mild non-degeneracy condition implies the symmetric orthogonal decomposition structure in the tensors of low-order observable moments.

Notably, another class of methods, based on subspace identification [OM96] and observable operator models/multiplicity automata [Sch61, Jae00, LSS01], have been proposed for a number of latent variable models. These methods were successfully developed for HMMs in [HKZ12], and subsequently generalized and extended for a number of related sequential and tree Markov models models [SBG10, Bai11, BSG11, PSX11, RFWU13, BQC12, BM12], as well as certain classes of parse tree models [LQBC12, CSC12, DRC12]. These methods use low-order moments to learn an “operator” representation of the distribution, which can be used for density estimation and belief state updates. While finite sample bounds can be given to establish the learnability of these models [HKZ12], the algorithms do not actually give parameter estimates (e.g., of the emission or transition matrices in the case of HMMs).

1.3 Organization

The rest of the paper is organized as follows. Section 2 reviews some basic definitions of tensors. Section 3 provides examples of a number of latent variable models which, after appropriate manipulations of their low order moments, share a certain natural tensor structure. Section 4 reduces the problem of parameter estimation to that of extracting a certain (symmetric orthogonal) decomposition of a tensor. We then provide a detailed analysis of a robust tensor power method and establish an analogue of Wedin’s perturbation theorem for the singular vectors of matrices. The discussion in Section 6 addresses a number of practical concerns that arise when dealing with moment matrices and tensors.

2 Preliminaries

We introduce some tensor notations borrowed from [Lim05]. A real -th order tensor is a member of the tensor product of Euclidean spaces , . We generally restrict to the case where , and simply write . For a vector , we use to denote its -th tensor power. As is the case for vectors (where ) and matrices (where ), we may identify a -th order tensor with the -way array of real numbers , where is the -th coordinate of (with respect to a canonical basis).

We can consider to be a multilinear map in the following sense: for a set of matrices , the -th entry in the -way array representation of is

 [A(V1,V2,…,Vp)]i1,i2,…,ip := ∑j1,j2,…,jp∈[n]Aj1,j2,…,jp [V1]j1,i1 [V2]j2,i2 ⋯ [Vp]jp,ip.

Note that if is a matrix (), then

 A(V1,V2)=V⊤1AV2.

Similarly, for a matrix and vector , we can express as

 A(I,v)=Av∈Rn,

where is the identity matrix. As a final example of this notation, observe

 A(ei1,ei2,…,eip)=Ai1,i2,…,ip,

where is the canonical basis for .

Most tensors considered in this work will be symmetric (sometimes called supersymmetric), which means that their -way array representations are invariant to permutations of the array indices: i.e., for all indices , for any permutation on . It can be checked that this reduces to the usual definition of a symmetric matrix for .

The rank of a -th order tensor is the smallest non-negative integer such that for some , and the symmetric rank of a symmetric -th order tensor is the smallest non-negative integer such that for some (for even , the definition is slightly different [CGLM08]). The notion of rank readily reduces to the usual definition of matrix rank when , as revealed by the singular value decomposition. Similarly, for symmetric matrices, the symmetric rank is equivalent to the matrix rank as given by the spectral theorem.

The notion of tensor (symmetric) rank is considerably more delicate than matrix (symmetric) rank. For instance, it is not clear a priori that the symmetric rank of a tensor should even be finite [CGLM08]. In addition, removal of the best rank- approximation of a (general) tensor may increase the tensor rank of the residual [SC10].

Throughout, we use to denote the Euclidean norm of a vector , and to denote the spectral (operator) norm of a matrix. We also use to denote the operator norm of a tensor, which we define later.

3 Tensor structure in latent variable models

In this section, we give several examples of latent variable models whose low-order moments can be written as symmetric tensors of low symmetric rank; many of these examples can be deduced using the techniques developed in [McC87]. The basic form is demonstrated in Theorem 3.1 for the first example, and the general pattern will emerge from subsequent examples.

3.1 Exchangeable single topic models

We first consider a simple bag-of-words model for documents in which the words in the document are assumed to be exchangeable

. Recall that a collection of random variables

are exchangeable if their joint probability distribution is invariant to permutation of the indices. The well-known De Finetti’s theorem

[Aus08] implies that such exchangeable models can be viewed as mixture models in which there is a latent variable such that are conditionally i.i.d. given (see Figure 1(a) for the corresponding graphical model) and the conditional distributions are identical at all the nodes.

In our simplified topic model for documents, the latent variable is interpreted as the (sole) topic of a given document, and it is assumed to take only a finite number of distinct values. Let be the number of distinct topics in the corpus, be the number of distinct words in the vocabulary, and be the number of words in each document. The generative process for a document is as follows: the document’s topic is drawn according to the discrete distribution specified by the probability vector

. This is modeled as a discrete random variable

such that

 Pr[h=j]=wj,j∈[k].

Given the topic , the document’s words are drawn independently according to the discrete distribution specified by the probability vector . It will be convenient to represent the words in the document by -dimensional random vectors . Specifically, we set

 xt=eiif and only ifthe t-th word in the % document is i,t∈[ℓ],

where is the standard coordinate basis for .

One advantage of this encoding of words is that the (cross) moments of these random vectors correspond to joint probabilities over words. For instance, observe that

 E[x1⊗x2] =∑1≤i,j≤dPr[x1=ei,x2=ej] ei⊗ej =∑1≤i,j≤dPr[1st word=i,2nd word=j] ei⊗ej,

so the -the entry of the matrix is . More generally, the -th entry in the tensor is . This means that estimating cross moments, say, of , is the same as estimating joint probabilities of the first three words over all documents. (Recall that we assume that each document has at least three words.)

The second advantage of the vector encoding of words is that the conditional expectation of given is simply , the vector of word probabilities for topic :

 E[xt|h=j] = d∑i=1Pr[t-th word=i|h=j] ei = d∑i=1[μj]i ei = μj,j∈[k]

(where is the -th entry in the vector ). Because the words are conditionally independent given the topic, we can use this same property with conditional cross moments, say, of and :

 E[x1⊗x2|h=j] = E[x1|h=j]⊗E[x2|h=j] = μj⊗μj,j∈[k].

This and similar calculations lead one to the following theorem.

Theorem 3.1 ([Ahk12]).

If

 M2 := E[x1⊗x2] M3 := E[x1⊗x2⊗x3],

then

 M2 = k∑i=1wi μi⊗μi M3 = k∑i=1wi μi⊗μi⊗μi.

As we will see in Section 4.3, the structure of and revealed in Theorem 3.1 implies that the topic vectors can be estimated by computing a certain symmetric tensor decomposition. Moreover, due to exchangeability, all triples (resp., pairs) of words in a document—and not just the first three (resp., two) words—can be used in forming (resp., ); see Section 6.1.

3.2 Beyond raw moments

In the single topic model above, the raw (cross) moments of the observed words directly yield the desired symmetric tensor structure. In some other models, the raw moments do not explicitly have this form. Here, we show that the desired tensor structure can be found through various manipulations of different moments.

Spherical Gaussian mixtures

We now consider a mixture of Gaussian distributions with spherical covariances. We start with the simpler case where all of the covariances are identical; this probabilistic model is closely related to the (non-probabilistic) -means clustering problem [Mac67]

. We then consider the case where the spherical variances may differ.

Common covariance.

Let be the probability of choosing component , be the component mean vectors, and be the common covariance matrix. An observation in this model is given by

 x :=μh+z,

where is the discrete random variable with for (similar to the exchangeable single topic model), and is an independent multivariate Gaussian random vector in with zero mean and spherical covariance .

The Gaussian mixture model differs from the exchangeable single topic model in the way observations are generated. In the single topic model, we observe multiple draws (words in a particular document) given the same fixed (the topic of the document). In contrast, for the Gaussian mixture model, every realization of corresponds to a different realization of .

Theorem 3.2 ([Hk13]).

Assume . The variance is the smallest eigenvalue of the covariance matrix . Furthermore, if

 M2 := E[x⊗x]−σ2I M3 := E[x⊗x⊗x]−σ2d∑i=1(E[x]⊗ei⊗ei+ei⊗E[x]⊗ei+ei⊗ei⊗E[x]),

then

 M2 = k∑i=1wi μi⊗μi M3 = k∑i=1wi μi⊗μi⊗μi.
Differing covariances.

The general case is where each component may have a different spherical covariance. An observation in this model is again , but now is a random vector whose conditional distribution given (for some ) is a multivariate Gaussian with zero mean and spherical covariance .

Theorem 3.3 ([Hk13]).

Assume . The average variance is the smallest eigenvalue of the covariance matrix . Let be any unit norm eigenvector corresponding to the eigenvalue . If

 M1 := E[x(v⊤(x−E[x]))2] M2 := E[x⊗x]−¯σ2I M3 := E[x⊗x⊗x]−d∑i=1(M1⊗ei⊗ei+ei⊗M1⊗ei+ei⊗ei⊗M1),

then

 M2 = k∑i=1wi μi⊗μi M3 = k∑i=1wi μi⊗μi⊗μi.

As shown in [HK13], . Note that for the common covariance case, where , we have that (cf. Theorem 3.2).

Independent component analysis (ICA)

The standard model for ICA [Com94, CC96, HO00, CJ10], in which independent signals are linearly mixed and corrupted with Gaussian noise before being observed, is specified as follows. Let be a latent random vector with independent coordinates, the mixing matrix, and be a multivariate Gaussian random vector. The random vectors and are assumed to be independent. The observed random vector is

 x :=Ah+z.

Let denote the -th column of the mixing matrix .

Theorem 3.4 ([Cj10]).

Define

 M4 := E[x⊗x⊗x⊗x]−T

where is the fourth-order tensor with

 [T]i1,i2,i3,i4:=E[xi1xi2]E[xi3xi4]+E[xi1xi3]E[xi2xi4]+E[xi1xi4]E[xi2xi3],1≤i1,i2,i3,i4≤k

(i.e., is the fourth derivative tensor of the function , so is the fourth cumulant tensor). Let for each . Then

 M4 = k∑i=1κi μi⊗μi⊗μi⊗μi.

Note that corresponds to the excess kurtosis, a measure of non-Gaussianity as if is a standard normal random variable. Furthermore, note that is not identifiable if is a multivariate Gaussian.

We may derive forms similar to that of and from Theorem 3.1 using by observing that

 M4(I,I,u,v) =k∑i=1κi(μ⊤iu)(μ⊤iv) μi⊗μi, M4(I,I,I,v) =k∑i=1κi(μ⊤iv) μi⊗μi⊗μi

for any vectors .

Latent Dirichlet Allocation (LDA)

An increasingly popular class of latent variable models are mixed membership models, where each datum may belong to several different latent classes simultaneously. LDA is one such model for the case of document modeling; here, each document corresponds to a mixture over topics (as opposed to just a single topic). The distribution over such topic mixtures is a Dirichlet distribution with parameter vector with strictly positive entries; its density over the probability simplex is given by

 pα(h)=Γ(α0)∏ki=1Γ(αi)k∏i=1hαi−1i,h∈Δk−1

where

 α0:=α1+α2+⋯+αk.

As before, the topics are specified by probability vectors . To generate a document, we first draw the topic mixture , and then conditioned on , we draw words independently from the discrete distribution specified by the probability vector (i.e., for each , we independently sample a topic according to and then sample according to ). Again, we encode a word by setting iff the -th word in the document is .

The parameter (the sum of the “pseudo-counts”) characterizes the concentration of the distribution. As , the distribution degenerates to a single topic model (i.e., the limiting density has, with probability , exactly one entry of being and the rest are ). At the other extreme, if for some scalar , then as , the distribution of becomes peaked around the uniform vector (furthermore, the distribution behaves like a product distribution). We are typically interested in the case where is small (e.g., a constant independent of ), whereupon typically has only a few large entries. This corresponds to the setting where the documents are mainly comprised of just a few topics.

Theorem 3.5 ([Afh+12]).

Define

 M1 := E[x1] M2 := E[x1⊗x2]−α0α0+1M1⊗M1 M3 := E[x1⊗x2⊗x3] −α0α0+2(E[x1⊗x2⊗M1]+E[x1⊗M1⊗x2]+E[M1⊗x1⊗x2]) +2α20(α0+2)(α0+1)M1⊗M1⊗M1.

Then

 M2 = k∑i=1αi(α0+1)α0 μi⊗μi M3 = k∑i=12αi(α0+2)(α0+1)α0 μi⊗μi⊗μi.

Note that needs to be known to form and from the raw moments. This, however, is a much weaker than assuming that the entire distribution of is known (i.e., knowledge of the whole parameter vector ).

3.3 Multi-view models

Multi-view models (also sometimes called naïve Bayes models) are a special class of Bayesian networks in which observed variables

are conditionally independent given a latent variable . This is similar to the exchangeable single topic model, but here we do not require the conditional distributions of the to be identical. Techniques developed for this class can be used to handle a number of widely used models including hidden Markov models (HMMs) [MR06, AHK12], phylogenetic tree models [Cha96, MR06], certain tree mixtures [AHHK12], and certain probabilistic grammar models [HKL12].

As before, we let be a discrete random variable with for all . Now consider random vectors , , and which are conditionally independent given , and

 E[xt|h=j] =μt,j,j∈[k], t∈{1,2,3}

where the are the conditional means of the given . Thus, we allow the observations to be random vectors, parameterized only by their conditional means. Importantly, these conditional distributions may be discrete, continuous, or even a mix of both.

We first note the form for the raw (cross) moments.

Proposition 3.1.

We have that:

 E[xt⊗xt′] = k∑i=1wi μt,i⊗μt′,i,{t,t′}⊂{1,2,3},t≠t′ E[x1⊗x2⊗x3] = k∑i=1wi μ1,i⊗μ2,i⊗μ3,i.

The cross moments do not possess a symmetric tensor form when the conditional distributions are different. Nevertheless, the moments can be “symmetrized” via a simple linear transformation of

and (roughly speaking, this relates and to ); this leads to an expression from which the conditional means of (i.e., ) can be recovered. For simplicity, we assume ; the general case (with ) is easily handled using low-rank singular value decompositions.

Theorem 3.6 ([Afh+12]).

Assume that the vectors are linearly independent for each . Define

 ~x1 := E[x3⊗x2]E[x1⊗x2]−1x1 ~x2 := E[x3⊗x1]E[x2⊗x1]−1x2 M2 := E[~x1⊗~x2] M3 := E[~x1⊗~x2⊗x3].

Then

 M2 = k∑i=1wi μ3,i⊗μ3,i M3 = k∑i=1wi μ3,i⊗μ3,i⊗μ3,i.

We now discuss three examples (taken mostly from [AHK12]) where the above observations can be applied. The first two concern mixtures of product distributions, and the last one is the time-homogeneous hidden Markov model.

Mixtures of axis-aligned Gaussians and other product distributions

The first example is a mixture of product distributions in under a mild incoherence assumption [AHK12]. Here, we allow each of the component distributions to have a different product distribution (e.g., Gaussian distribution with an axis-aligned covariance matrix), but require the matrix of component means to satisfy a certain (very mild) incoherence condition. The role of the incoherence condition is explained below.

For a mixture of product distributions, any partitioning of the dimensions into three groups creates three (possibly asymmetric) “views” which are conditionally independent once the mixture component is selected. However, recall that Theorem 3.6 requires that for each view, the conditional means be linearly independent. In general, this may not be achievable; consider, for instance, the case for each . Such cases, where the component means are very aligned with the coordinate basis, are precluded by the incoherence condition.

Define to be the largest diagonal entry of the orthogonal projector to the range of , and assume has rank . The coherence lies between and ; it is largest when the range of is spanned by the coordinate axes, and it is when the range is spanned by a subset of the Hadamard basis of cardinality . The incoherence condition requires, for some ,

 coherence(A)≤ε2/6ln(3k/δ).

Essentially, this condition ensures that the non-degeneracy of the component means is not isolated in just a few of the dimensions. Operationally, it implies the following.

Proposition 3.2 ([Ahk12]).

Assume has rank and for some . With probability at least , a random partitioning of the dimensions into three groups (for each , independently pick uniformly at random and put in group ) has the following property. For each and , let be the entries of put into group , and let . Then for each , has full column rank, and the -th largest singular value of is at least times that of .

Therefore, three asymmetric views can be created by randomly partitioning the observed random vector into , , and , such that the resulting component means for each view satisfy the conditions of Theorem 3.6.

Spherical Gaussian mixtures, revisited

Consider again the case of spherical Gaussian mixtures (cf. Section 3.2). As we shall see in Section 4.3, the previous techniques (based on Theorem 3.2 and Theorem 3.3) lead to estimation procedures when the dimension of is or greater (and when the component means are linearly independent). We now show that when the dimension is slightly larger, say greater than , a different (and simpler) technique based on the multi-view structure can be used to extract the relevant structure.

We again use a randomized reduction. Specifically, we create three views by (i) applying a random rotation to , and then (ii) partitioning into three views for . By the rotational invariance of the multivariate Gaussian distribution, the distribution of after random rotation is still a mixture of spherical Gaussians (i.e., a mixture of product distributions), and thus are conditionally independent given . What remains to be checked is that, for each view , the matrix of conditional means of for each view has full column rank. This is true with probability as long as the matrix of conditional means has rank and . To see this, observe that a random rotation in followed by a restriction to coordinates is simply a random projection from to , and that a random projection of a linear subspace of dimension to is almost surely injective as long as . Applying this observation to the range of implies the following.

Proposition 3.3 ([Hk13]).

Assume has rank and that . Let be chosen uniformly at random among all orthogonal matrices, and set and . Partition into three groups of sizes with for each . Furthermore, for each , define (respectively, ) to be the subvector of (resp., submatrix of ) obtained by selecting the entries (resp., rows) in the -th group. Then are conditionally independent given ; for each and ; and with probability , the matrices have full column rank.

It is possible to obtain a quantitative bound on the -th largest singular value of each in terms of the -th largest singular value of (analogous to Proposition 3.2). One avenue is to show that a random rotation in fact causes to have low coherence, after which we can apply Proposition 3.2. With this approach, it is sufficient to require (for constant and ), which results in the -th largest singular value of each being a constant fraction of the -th largest singular value of . We conjecture that, in fact, for some suffices.

Hidden Markov models

Our last example is the time-homogeneous HMM for sequences of vector-valued observations

. Consider a Markov chain of discrete hidden states

over possible states ; given a state at time , the observation at time (a random vector taking values in ) is independent of all other observations and hidden states. See Figure 1(b).

Let be the initial state distribution (i.e., the distribution of ), and be the stochastic transition matrix for the hidden state Markov chain: for all times ,

 Pr[yt+1=i|yt=j]=Ti,j,i,j∈[k].

Finally, let be the matrix whose -th column is the conditional expectation of given : for all times ,

 E[xt|yt=j]=Oej,j∈[k].
Proposition 3.4 ([Ahk12]).

Define , where is the second hidden state in the Markov chain. Then

• are conditionally independent given ;

• the distribution of is given by the vector ;

• for all ,

 E[x1|h=j] =Odiag(π)T⊤diag(w)−1ej E[x2|h=j] =Oej E[x3|h=j] =OTej.

Note the matrix of conditional means of has full column rank, for each , provided that: (i) has full column rank, (ii) is invertible, and (iii) and have positive entries.

4 Orthogonal tensor decompositions

We now show how recovering the ’s in our aforementioned problems reduces to the problem of finding a certain orthogonal tensor decomposition of a symmetric tensor. We start by reviewing the spectral decomposition of symmetric matrices, and then discuss a generalization to the higher-order tensor case. Finally, we show how orthogonal tensor decompositions can be used for estimating the latent variable models from the previous section.

4.1 Review: the matrix case

We first build intuition by reviewing the matrix setting, where the desired decomposition is the eigendecomposition of a symmetric rank- matrix , where is the matrix with orthonormal eigenvectors as columns, and is diagonal matrix of non-zero eigenvalues. In other words,

 M = k∑i=1λi viv⊤i=k∑i=1λi v⊗2i. (1)

Such a decomposition is guaranteed to exist for every symmetric matrix.

Recovery of the ’s and ’s can be viewed at least two ways. First, each is fixed under the mapping , up to a scaling factor :

 Mvi=k∑j=1λj(v⊤jvi)vj=λivi

as for all by orthogonality. The ’s are not necessarily the only such fixed points. For instance, with the multiplicity , then any linear combination of and is similarly fixed under . However, in this case, the decomposition in (1) is not unique, as is equal to

for any pair of orthonormal vectors,

and spanning the same subspace as and . Nevertheless, the decomposition is unique when are distinct, whereupon the ’s are the only directions fixed under up to non-trivial scaling.

The second view of recovery is via the variational characterization of the eigenvalues. Assume ; the case of repeated eigenvalues again leads to similar non-uniqueness as discussed above. Then the Rayleigh quotient

 u↦u⊤Muu⊤u

is maximized over non-zero vectors by . Furthermore, for any , the maximizer of the Rayleigh quotient, subject to being orthogonal to , is . Another way of obtaining this second statement is to consider the deflated Rayleigh quotient

 u↦u⊤(M−∑s−1j=1λjvjv⊤j)uu⊤u

and observe that is the maximizer.

Efficient algorithms for finding these matrix decompositions are well studied [GvL96, Section 8.2.3], and iterative power methods are one effective class of algorithms.

We remark that in our multilinear tensor notation, we may write the maps and as

 u↦Mu ≡ u↦M(I,u), (2) u↦u⊤Muu⊤u ≡ u↦M(u,u)u⊤u. (3)

4.2 The tensor case

Decomposing general tensors is a delicate issue; tensors may not even have unique decompositions. Fortunately, the orthogonal tensors that arise in the aforementioned models have a structure which permits a unique decomposition under a mild non-degeneracy condition. We focus our attention to the case , i.e., a third order tensor; the ideas extend to general with minor modifications.

An orthogonal decomposition of a symmetric tensor is a collection of orthonormal (unit) vectors together with corresponding positive scalars such that

 T =k∑i=1λiv⊗3i. (4)

Note that since we are focusing on odd-order tensors (

), we have added the requirement that the be positive. This convention can be followed without loss of generality since whenever is odd. Also, it should be noted that orthogonal decompositions do not necessarily exist for every symmetric tensor.

In analogy to the matrix setting, we consider two ways to view this decomposition: a fixed-point characterization and a variational characterization. Related characterizations based on optimal rank- approximations can be found in [ZG01].

Fixed-point characterization

For a tensor , consider the vector-valued map

 u↦T(I,u