Efficient Orthogonal Tensor Decomposition, with an Application to Latent Variable Model Learning

09/12/2013 ∙ by Franz J Király, et al. ∙ UCL 0

Decomposing tensors into orthogonal factors is a well-known task in statistics, machine learning, and signal processing. We study orthogonal outer product decompositions where the factors in the summands in the decomposition are required to be orthogonal across summands, by relating this orthogonal decomposition to the singular value decompositions of the flattenings. We show that it is a non-trivial assumption for a tensor to have such an orthogonal decomposition, and we show that it is unique (up to natural symmetries) in case it exists, in which case we also demonstrate how it can be efficiently and reliably obtained by a sequence of singular value decompositions. We demonstrate how the factoring algorithm can be applied for parameter identification in latent variable and mixture models.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Decomposing a tensors into its components, and determining the number of those (= the rank) is a multidimensional generalization of the singular value decomposition and the matrix rank, and a reoccurring task in all practical sciences, appearing many times under different names; first discovered by Hitchcock [9] and then re-discovered under names such as PARAFAC [8] or CANDECOMP [4], it has been applied in many fields such as chemometrics, psychometrics, and signal processing [3, 16, 14]. An extensive survey of many applications can be found in [15, 6].

Recently, motivated by real world applications, orthogonality constraints on the decomposition have been studied in the literature, such as the orthogonal rank decomposition and the combinatorial orthogonal rank decomposition, which can be traced back to [7, 12], and the orthogonal decomposition in [13] and [10]

, the latter of which occurs for example in the identification of latent variable models from empirical moments, and several other statistical estimation tasks, see 

[2] for a survey. The orthogonality constraints imposed in these two branches of literature are not the same, as [7, 12] imposes summand-wise orthogonality, while in [13, 10, 2], factor-wise orthogonality can be deduced from the model constraints. In [13]

, a Jacobi-like and heuristic algorithm was described to obtain a close orthogonal decomposition via Jacobi angle optimization for general tensors; in 

[2], the authors describe a second order fixed point method for obtaining the decomposition.

In [11, 17], hierarchical tensor decomposition models are discussed in the context of latent tree graphical models, and algorithms for the identification of this decomposition are described. While this is not explicitly done in the language of orthogonal tensor decompositions, the idea of using flattenings is similar to the one presented, and, in the specific context of tree models, a specific instance orthogonal tensor decomposition, as described in [2].

In this paper, we study the orthogonal decomposition model, as it occurs in [10, 2], namely with factor-wise orthogonality constraints. We show that this kind of decomposition can be directly transformed to a set of singular value decompositions, both theoretically and practically. We give identifiability results for this kind of orthogonal decomposition, showing that it is unique111up to natural symmetries in case of existence, and we provide algorithms to obtain the orthogonal decomposition, by reducing it to a sequence of singular value decompositions. We apply these algorithms to a latent variable identification problem which was discussed in [10, 2]

, reducing it to a series of eigenvalue problems. In particular, by performing the reduction to singular value decomposition, we show that all existing theory on the singular value decomposition, concerning theoretical issues as well as numerical and algorithmical ones, can be readily applied to the orthogonal decomposition problem.

2 Theoretical Background

2.1 Tensors

2.1.1 Definition of a Tensor

While tensors are common objects, their notation diverges throughout the literature. For ease of reading, we provide the basic definitions.

Definition 2.1

A real tensor of size and of degree is an element of the set

If we also write

2.1.2 Linear Transformation

Let us introduce a useful shorthand notation for linearly transforming tensors.

Definition 2.2

Let be a matrix. For a tensor , we denote by the application of to along all tensor dimensions, that is, the tensor defined as

Remark 2.3

For and , note that

2.1.3 Flattening

A flattening of a tensor is the tensor obtained from regarding different indices as one index.

Definition 2.4

Denote by }. A surjective map is called -to- flattening map.

Definition 2.5

Let be a tensor, and let be a -to- flattening map. Then, the -flattening of is the degree tensor with defined as

Conversely, if , then we write and call the unflattening of .

Note that the indices of are, as defined, tuples of indices of ; however, this does not contradict the definition of tensor since can be bijectively mapped onto It is convenient to choose the lexicographical ordering for the bijection, but it is mathematically not necessary to fix any such bijection.

For unflattening, if only and are given, it is not clear what should be without further specification, since the same unflattening can arise from different tensors even if is fixed. Therefore, we will use it only in the context where a given flattening is being reversed, or partially reversed, therefore making the unflattening well-defined.

Example 2.6

Let be a tensor, let . The -flattening of is a -matrix . The columns of are all the sub--tensors of where second and third index are fixed. The columns of are indexed by the pairs , or, alternatively, by bijection, by the lexicographical index number . Taking any -submatrix of , we can unflatten to obtain a -tensor .

2.1.4 Outer Product

Furthermore, we introduce notation for creating tensors of higher order out of tensors of lower order:

Definition 2.7

Let . The outer product of the is the tensor defined by

In case that , we also write

Similarly, if and are tensors, the outer product of and is the tensor defined as

Outer products of several tensors by induction on , namely:

A useful calculation rule for linear transformation is the following:

Lemma 2.8.

Let and let . Then,

Similarly, if , then

Outer products are also compatible with flattenings:

Lemma 2.9.

Let and Let be a -to--flattening, let be the restriction of to , and let be the -to--flattening defined by . Then,

2.2 Orthogonality and Duality

We briefly review the notions of scalar product and some results, which can also be found in [12] in slightly different formulation and slightly less generality.

Definition 2.10

A scalar product is defined on by

As usual, are called orthogonal to each other if , and is called normal if . A set is called orthonormal if , where is the Kronecker-delta.

By identification of with , where , the scalar product on tensors inherits all properties of the real scalar product.

Remark 2.11

It is seen by checking definitions that the scalar product on matrices is identical to the trace product, i.e., for .

An important property of the scalar product is compatibility with flattenings:

Lemma 2.12.

Let , let be a -to- flattening map. Then,

In particular, and are orthogonal to each other if and only if and are.

Proof.

A flattening is a bijection on the set of entries, therefore the result of the entry-wise scalar product is not changed by flattening. ∎

Proposition 2.13.

Let , for . Then,

In particular, if there exists such that are orthogonal to each other, then the outer products and are orthogonal to each other.

Proof.

By performing induction on , it suffices to prove the statement for : Let and Then,

We proceed to prove this statement. Let be the -to--flattening, let be the -to--flattening. Let , and for . By Lemma 2.12, it holds that

Let be the -to--flattening defined by . Let . By Lemma 2.12, it holds that

By Lemma 2.9, it holds that

Using that scalar product on tensors is the trace product (see 2.11), we obtain

The cyclic property of the trace product for matrices yields

All equalities put together yield the claim. ∎

Corollary 2.14.

Let , and , such that . Then,

Definition 2.15

Let , let be a partition. A decomposition

with , and , such that the set of with fixed is orthonormal, is called rank- orthogonal atomic decomposition of , with signature . If and , then the decomposition is called orthogonal CP-decomposition.

An orthogonal atomic decomposition does not need to exist necessarily. However, if it does, it is compatible with respect to flattenings, as Proposition 2.17 will show. We introduce notation for a more concise statement of the compatibility first:

Definition 2.16

Let be a partition of . We say a -to--flattening is compatible with the partition , if it holds that for some implies . We say that is strictly compatible with the partition , if it holds that for some if and only if .

Proposition 2.17.

Let . Let

be an orthogonal atomic decomposition with signature let be compatible with the signature. Then,

is an orthogonal atomic decomposition of . In particular, if is strictly compatible with the signature, then the decomposition is also an orthogonal CP-decomposition.

Proof.

This is a direct consequence of Lemma 2.12, checking compatibility of scalar product and orthogonality with the flattening at each of the sets of indices . ∎

2.3 Identifiability of the Orthogonal Atomic Decomposition

The orthogonal decomposition, as given in Definition 2.15, does not need to exist for a tensor, nor does it need to be unique. We will show that due to the compatibility with flattenings, if it exists, it is unique, if the rank is chosen minimal.

The main ingredient, besides flattenings, is uniqueness of singular value decomposition [18], a classical result, which we state in a convenient form:

Theorem 1.

Let , let . Then, there is a singular value decomposition (= orthogonal CP-decomposition)

such that the are orthonormal, and the are orthonormal. In particular, there is no singular value decomposition of rank strictly smaller than . Moreover, the singular value decomposition of is unique, up to:

(a)

the sequence of summation, i.e., up to arbitrary permutation of the indices

(b)

the choice of sign of , i.e., up to changing the sign in any two of for fixed

(c)

unitary transformations of the span of or such that

Condition (c) includes (b) as a special case, and (c) can be removed as a condition if no two distinct have the same absolute value.

Theorem 2.

Let , and assume that has an orthogonal atomic decomposition

of signature , such that for all . Then:

(i)

Denote for . Then, for all .

(ii)

There is no orthogonal atomic decomposition of with signature , and of rank strictly smaller than .

(iii)

The orthogonal atomic decomposition of of rank is unique, up to:

(a)

the sequence of summation, i.e., up to arbitrary permutation of the indices

(b)

the choice of sign of , i.e., up to changing the sign in any two of and the for fixed and arbitrary

(c)

transformations of factors and their respective tensor products, such that , which induce unitary transformations in all flattenings compatible with the signature .

Condition (c) includes (b) as a special case, and (c) can be removed as a condition if no two distinct have the same absolute value.

Proof.

Fix some arbitrary . Consider the -to--flattening for , note that is compatible with the signature. Let , and . Note that is a -matrix. Let

be the orthogonal atomic decomposition of , and let , and for all . Note that is an

-vector, and

is an -vector. By Proposition 2.17,

is a singular value decomposition of .

(i) In particular, the are a system of orthonormal vectors in . Therefore, . Since was arbitrary, statement (i) follows.
(ii) Since the are non-zero, it holds that . Would there be an orthogonal atomic decomposition of with signature of rank strictly smaller than , there would be a singular value decomposition of of rank strictly smaller than , contradicting Proposition 2.17.
(iii) Observe that the flattening by induces a bijection between the orthogonal atomic decompositions of , of rank , and the singular value decompositions of , of rank . The statement in (iii) then follows directly from the uniqueness asserted in Proposition 2.17 for the singular value decomposition of . ∎

Again, we would like to stress that the present orthogonal decomposition model is different from the one in [12]; ours being factor-wise orthogonal between different summands, while the orthogonal rank decomposition in [12] being summand-wise orthogonal, and the combinatorial orthogonal rank decomposition enforcing orthogonality of factors in the same summand. Therefore, Theorem 2 does not contradict Lemma 3.5 in [12].

Another result which seems to be folklore, but not available in the literature, is that it is a strong restruction for a tensor to assume that it has an orthogonal decomposition. Since it is almost implied by the identifiability Theorem 2, we state a quantitative version of this:

Proposition 2.18.

The set of tensors , with , and for all , for which has an orthogonal CP-decomposition, is a Lebesgue zero set.

Proof.

The CP-decomposition can be viewed as an algebraic map

Since the left hand side is an irreducible variety, the image of the map also is. The orthogonal CP-decompositions form an algebraic subset of the left hand side. Therefore the state follows from the fact that

is not surjective. This follows from a degree of freedom resp. dimension count. One has

Theorem 2 (i) implies

An explicit computation shows that , which proves the statement.

The proof above can be rephrased in terms of the CP-rank (see [5] for an introduction), can be obtained by observing that the generic CP-rank of the tensors in questions must be strictly larger than , then proceeding again by arguing that the algebraic set of tensors with orthogonal CP-decompositions must be a proper subset of all tensors with that format, thus a Lebesgue zero set. ∎

Proposition 2.18 can be extended to orthogonal atomic decompositions with signature , by considering suitable unflattenings.

2.4 Tensors and Moments

We briefly show how tensors relate to moments of multivariate real random variables:

Definition 2.19

Let be a real -dimensional random variable. Then, define:

where is a formal vector of variables. The -th moment (or moment tensor) of , and the -th cumulant (or cumulant tensor) of are defined222in case of convergence as the coefficients in the multivariate Taylor expansions

In the following, we will always assume that the moments and cumulants in question exist.

The moments and cumulants of a linearly transformed random variable are the multilinearly transformed moments.

Proposition 2.20.

Let be a real -dimensional random variable and let Then,

Proof.

We prove the statement for cumulants, the proof for moments is completely analogous. For the cumulant generating functions of and of , it holds that

The last equality follows from the definition of . But by definition, it also holds that

therefore the statement follows from comparing coefficient tensors. ∎

3 Relation to Mixture Models

3.1 The Estimation Problem

Throughout the paper, we will consider the following independent rank mixture model:

Generative Model: are independent, -valued random variables, with

, and probability/mass density functions

. Let be arbitrary such that , and let be the corresponding mixture of the . Assume that there are with , and random variables , such that . Assume that the are linearly independent, and for .

Estimation Task: Given , or estimators thereof, determine/estimate and for .

While the above scenario seems very restrictive, several important problems can be reduced to this setting, see for example [10], or chapter 3 of [2]. We recommend the interested reader to read the exposition there.

3.2 Algebraic Formulation via Moments

The estimation problem presented above can be reformulated as a purely algebraic problem, see [2]. Namely, the are explicitly calculable in terms of the expectations of the and . Then, Proposition 2.20 implies that for all , therefore for all , thus yielding the following algebraic version of the estimation problem.

Algebraic Problem: Let , let be linearly independent and arbitrary such that . Given (exact or noisy estimators for)

determine the and .

4 Algorithms

4.1 Orthogonal Decomposition of Tensors

A special case of orthogonal decomposition is singular value decomposition (SVD). There are a huge amount of well-studied methods for obtaining the singular value decomposition, which we will not discuss. However, we will make extensive use of the SVD algorithm, as described in Algorithm 1 as a black box.

Algorithm 1 SVD. Singular Value Decomposition of Matrices.
Input: A matrix . Output: The singular value decomposition , with orthogonal, diagonal, and the rank

First, for completeness, we treat the trivial case in Algorithm 2.

1:Return rank coefficients factors .
Algorithm 2 OTD1. Orthogonal Tensor Decomposition in one factor.
Input: A tensor , a signature . Output: The orthogonal atomic decomposition .

Now we explicitly describe how to compute the orthogonal decomposition if each summand has two tensor factors. Algorithm 3 computes the decomposition by a proper reformatting of the entries, computing the singular value decomposition, then reformatting again.

1:Define
2:Set . Note that , with
3:Compute the SVD of , see Algorithm 1.
4:Return rank .
5:Return coefficients for .
6:For all , let be the -th column of , let be the -th columns of .
7:Return factors for .
Algorithm 3 OTD2. Orthogonal Tensor Decomposition in two factors.
Input: A tensor , a signature . Output: The orthogonal atomic decomposition (assumed to exist), including the rank

The algorithm for the general case, Algorithm 4, consists as well of repeated applications of reindexing and singular value decomposition. Variants of singular value decomposition exist with adjustable noise tolerance or singular value thresholding, and can therefore be employed to obtain thresholding and numerically stable variants of Algorithm 4. Furthermore, step 1 allows for an arbitrary choice of -to--flattening, in each recursion. Since in the presence of noise, the results might differ when taking a different sequence flattenings, the numerical stability can be improved by clustering the results of all possible choices, then averaging.

1: Choose any -to--flattening map .
2:Set for .
3:Set .
4:Use OTD2, Algorithm 3, to compute the orthogonal atomic decomposition with signature .
5:Return the as coefficients and as the rank for the decomposition of .
6: For , use the suitable one of OTD1,OTD2,OTD, i.e., Algorithm 2,3, or 4, to compute the orthogonal atomic decomposition , noting that rank is one, and using the signature
7: For , use the suitable one of OTD1,OTD2,OTD, i.e., Algorithm 2,3, or 4, to compute the orthogonal atomic decomposition , noting that rank is one, and using the signature
8:Return the as factors for .
Algorithm 4 OTD. Orthogonal Tensor Decomposition.
Input: A tensor , a signature . Output: The orthogonal atomic decomposition (assumed to exist), including the rank

Termination of Algorithm 4 is implied by the observation that in each recursion, the partition of is made strictly finer. Since has finite cardinality, there is only a finite number of recursions. The fact that the decompositions in steps 6 and 7 have rank one, and coefficients , follows from the uniqueness of the orthogonal decomposition guaranteed in Theorem 2. Correctness of Algorithm 4 follows from repeated application of Proposition 2.17, and the uniqueness of singular value decomposition.

4.2 An Estimator for the Mixture Model

For illustrative purposes, we write out Algorithm 4 for the problem introduced in 3, which has also extensively been studied in [2]:

Example: Let , let be linearly independent and arbitrary such that . Given (exact or noisy estimators for)

determine the and .

Algorithm 5 solves the problem, by reducing it to

1:Set .
2:Compute the SVD333Note: since is symmetric, the SVD also is. .
3:Set .
4:Set .
5:Define the flattening map .
6:Set
7:Compute the rank SVD .
8: Return for .
9:Set for .
10: Compute the rank SVD .
11: Set for .
12: Compute the pseudo-inverse of . Return , for
Algorithm 5 Model identification.
Input: Output: .

Theorem 4.3 in [2] implies that the tensor obtained in step 4 has an orthogonal CP-decomposition, and it implies the correctness of steps 8, and 12. The fact that in step 8 has rank one, and the coefficients are , follow from the uniqueness of the decomposition guaranteed in Theorem 2.

Note that explicit presentation of the algorithm could be substantially abbreviated by applying ODT directly to in step 4, with signature , and then performing the analogues of steps 8 and 12. Furthermore, the accuracy of the estimator in step 11 can be improved, by repeating the procedure for the three possible signatures and , then averaging, or weighted averaging, over the nine estimates for each , making use of the symmetry of the problem.

Also, similar to Algorithm 4, the presented Algorithm 5, while already numerically stable, can be modified to cope better with noise by, e.g., introducing thresholding to the singular value decomposition and rank computations. The numerical stability with respect to noise is governed by the numerical stability of the SVDs performed, and the pseudo-inversion of in step 12.

Algorithm 5 is also related to Algorithm 1 proposed in  [1]. Namely, , as defined in section 4.1, is a degree -projection of the tensor , and therefore can be also understood as a random projection of the flattening .

Furthermore, an estimator for the hierarchical models described in [11, 17] can be constructed in a similar way.

5 Conclusion

We have demonstrated that computing the orthogonal decomposition of an arbitrary degree tensor, symmetric or not, can be reduced to a series of singular value decompositions, and we have described efficient algorithms to do so. This makes orthogonal tensor decomposition approachable by the wealth of theoretical results and existing methods for eigenvalue problems and singular value decomposition. Moreover, we have exemplified our method in the case of identifying components in a low-rank mixture model.

Acknowledgments

I thank Arthur Gretton, Zoltán Szabó, and Andreas Ziehe for interesting discussions. I thank the Mathematisches Forschungsinstitut Oberwolfach for support.

References

  • Anandkumar et al. [2012a] Anima Anandkumar, Dean P. Foster, Daniel Hsu, Sham M. Kakade, and Yi-Kai Lu. A spectral algorithm for latent dirichlet allocation. ArXiv e-print, 2012a.
  • Anandkumar et al. [2012b] Anima Anandkumar, Rong Ge, Daniel Hsu, Sham M. Kakade, and Matus Telgarsky. Tensor decomposition for learning latent variable models. ArXiv e-print, 2012b.
  • Bro [1997] Rasmus Bro. PARAFAC. tutorial and applications. Chemometrics and Intelligent Laboratory Systems, 38(2):149 – 171, 1997.
  • Carroll and Chang [1970] J. Douglas Carroll and Jih-Jie Chang. Analysis of individual differences in multidimensional scaling via an n-way generalization of ’eckart–young’ decomposition. Psychometrika, 35:283–319, 1970.
  • Catalisano et al. [2002] Maria V. Catalisano, Anthony V. Germatia, and Allesandro Gimigliano. Rank of tensors, secant varieties of Segre varieties and fat points. Linear Algebra and its Applications, pages 263–285, 2002.
  • De Lathauwer et al. [2008] Lieven De Lathauwer, Pierre Comon, and Nicola Mastronardi. Special issue on tensor decompositions and applications. SIAM Journal on Matrix Analysis and Applications, 30(3):.7–.7, September 2008. ISSN 0895-4798.
  • Denis and Dhorne [1989] J. B. Denis and T. Dhorne. Orthogonal tensor decomposition of 3-way tables. In R. Coppi and S. Bolasco, editors, Multiway data analysis., pages 31–37. Elsevier, Amsterdam, 1989.
  • Harshman [1970] Richard A. Harshman. Foundations of the parafac procedure: Models and conditions for an "explanatory" multi-modal factor analysis. UCLA Working Papers in Phonetics, 16(84):1–84, 1970.
  • Hitchcock [1927] Frank L. Hitchcock. The expression of a tensor or a polyadic as a sum of products. Journal of Mathematics and Physics, 6:164–189, 1927.
  • Hsu and Kakade [2012] Daniel Hsu and Sham M. Kakade. Learning mixtures of spherical Gaussians: Moment methods and spectral decompositions. ArXiv e-print, 2012.
  • Ishteva et al. [2013] Maria Ishteva, Haeson Park, and Le Song. Unfolding latent tree structures using 4th order tensors. JMLR Workshop and Conference Proceedings, ICML 2013, pages 316–324, 2013.
  • Kolda [2001] Tamara G. Kolda. Orthogonal tensor decompositions. SIAM Journal on Matrix Analysis and Applications, 23(1):243–255, 2001.
  • Martin and Loan [2006] Carla D. Moravitz Martin and Charles F. Van Loan. A jacobi-type method for computing orthogonal tensor decompositions. SIAM Journal on Matrix Analysis and Applications, pages 1219–1232, 2006.
  • Nion and Sidiropoulos [2009] Dimitri Nion and Nikos D. Sidiropoulos. A PARAFAC-based technique for detection and localization of multiple targets in a mimo radar system. In Proc. ICASSP ’09, pages 2077–2080, 2009.
  • Sidiropoulos [2004] Nikos D. Sidiropoulos. Low-rank decomposition of multi-way arrays: a signal processing perspective. In Sensor Array and Multichannel Signal Processing Workshop Proceedings, pages 52 – 58, 2004.
  • Sidiropoulos et al. [2000] Nikos D. Sidiropoulos, Rasmus Bro, and Georgios B. Giannakis. Parallel factor analysis in sensor array processing. IEEE Transactions on Signal Processing, 48:2377–2388, August 2000.
  • Song et al. [2013] Le Song, Maria Ishteva, Ankur Parikh, Eric Xing, and Haeson Park. Hierarchical tensor decomposition of latent tree graphical models. JMLR Workshop and Conference Proceedings, ICML 2013, pages 334–342, 2013.
  • Young and Eckart [1936] G. Young and C. Eckart. The approximation of one matrix by another of lower rank. Psychometrika, 1:211–218, 1936.