When Does Non-Orthogonal Tensor Decomposition Have No Spurious Local Minima?

11/22/2019 ∙ by Maziar Sanjabi, et al. ∙ University of Southern California Princeton University 0

We study the optimization problem for decomposing d dimensional fourth-order Tensors with k non-orthogonal components. We derive deterministic conditions under which such a problem does not have spurious local minima. In particular, we show that if κ = λ_max/λ_min < 5/4, and incoherence coefficient is of the order O(1/√(d)), then all the local minima are globally optimal. Using standard techniques, these conditions could be easily transformed into conditions that would hold with high probability in high dimensions when the components are generated randomly. Finally, we prove that the tensor power method with deflation and restarts could efficiently extract all the components within a tolerance level O(κ√(kτ^3)) that seems to be the noise floor of non-orthogonal tensor decomposition.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Tensor Decomposition approaches are demonstrated to be effective tools for modeling and solving a wide range of problems in the context of signal processing, statistical inference, and machine learning. In particular, many unsupervised learning problems such as Gaussian Mixture Models

(Gauss_Mix_ge_Kakade_2015), Latent Dirichlet Allocation (tensor_decomp_latent_anandkumar2014), Topic Modeling (cheng2015model; anandkumar2012two)

, Hidden Markov Models

(RL_azizzadenesheli_2016), Latent Graphical Models (song2013hierarchical; graph_chaganty_Liang_2014), and Community Detection (al2017tensor; anandkumar2014tensor) can be modeled as a canonical decomposition (CANDECOMP) problem which is also known as Parallel Factorization (PARAFAC).

It has been shown that under mild assumptions such as where is the number of components, and and are the k-rank of the component matrices and respectively, the CANDECOMP/PARAFAC (CP) decomposition exists and it is unique (harshman1970foundations; kruskal1977three). Finding such a decomposition is an NP-hard problem in general (haastad1990tensor; hillar2013most). Despite the hardness results, many of the proposed algorithms in the literature work well for practical problems (leurgans1993decomposition; tensor_decomp_latent_anandkumar2014; kolda2009tensor). In fact, for a wide range of these algorithms there are theoretical local and global guarantees under some realistic assumptions (uschmajew2012local; tensor_decomp_latent_anandkumar2014). One of the important cases where the problem of finding the decomposition has been very well-studied is the case where the components are orthogonal. (tensor_decomp_latent_anandkumar2014)

demonstrates that many problems in practice can be reduced to an orthogonal tensor decomposition with a pre-processing phase known as data whitening. While data whitening helps us transform the problem to the orthogonal case, it is computationally expensive especially in high-dimensional settings. Besides, it can affect the performance of the model for problems such as Independent Component Analysis

(le2011ica).

Practical drawbacks of data whitening, alongside with theoretical concerns such as instability in high-dimensional cases (anandkumar2014guaranteed), have motivated researchers to investigate the CP tensor decomposition problem in the non-orthogonal scenario. (anandkumar2014guaranteed) provide local and global guarantees for recovering the components of CP under mild non-orthogonality assumptions. However, their result requires on the proper initialization of the algorithm close to the global optimum. (ge2017optimization) analyze the non-convex landscape of the non-orthogonal tensor decomposition problem, and characterize the local minima of the problem under the over-complete regime (rank of the tensor is much higher than the dimension of the components). (non-orthogona_ALS_sharan_valiant_17) show that the orthogonalized alternating least square approach can globally recover the components of the tensor decomposition problem when , the rank of the tensor, is where d is the dimension. In this paper, we aim to show the global convergence of a variation of Tensor Power Method (TPM) augmented by the deflation and restart of the algorithm for fourth-order tensors. This algorithm can recover all components of a given non-orthogonal decomposition problem when .

Before proceeding to the main results, let us define some notations. Let be the

-th coordinate of vector

, the Kronecker product of vectors denoted by , is defined as an th-order tensor , such that . Moreover,

A fourth-order tensor

, can be seen as a multi-linear transformation, defined for given d-dimensional vectors

and as

(1)

To understand the ideas of the proposed algorithm, let us first start by considering the tensor decomposition of a fourth-order tensor in the orthogonal scenario. Suppose that the tensor of interest, , is a fourth-order tensor which can be decomposed as

(2)

where ’s are -dimensional orthogonal vectors, i.e., . Here, for simplicity of presentation, we assumed that different components have the same weight. For this case, the aim of tensor decomposition is to find the orthogonal decomposition vectors efficiently given the tensor . (tensor_decomp_latent_anandkumar2014) proves that in this case the non-convex optimization problem

(3)

does not have any spurious local minima. Consequently, all the local minima of this problem correspond to , . This property implies that most of the simple first order methods, such as randomly initialized manifold gradient descent, which are proven to converge to local minima (GD_local_min_lee2016) would be able to find the components, almost surely. However, in practice, algorithms such as gradient descent are shown to be slow due to their conservative and static step-size choices. On the other hand, algorithms such as tensor power method (TPM) (tensor_decomp_latent_anandkumar2014) are shown to be practically faster. Moreover, (tensor_decomp_latent_anandkumar2014) shows that TPM with multiple random restarts and deflation is capable of finding all components with high probability. Unfortunately, the results in (tensor_decomp_latent_anandkumar2014) assume the orthogonality of the components.

In this work, we extend the results of (tensor_decomp_latent_anandkumar2014) to the non-orthogonal case. To establish our result, we first analyze the optimization landscape of problem (3) In Section 3 under incoherence condition, restricted isometry property (RIP), and a certain upper-bound on the ratio of the weight of different components. We show that any local minimizer of problem (3) is close to one of the actual components. In other words, for any local minimizer of (3), there exists an index , such that is small (up to sign ambiguity in ). In Section 5, we show that Tensor Power Method (TPM) with deflation and restarting can recover all components of a given tensor with high probability.

2 Optimality Conditions for the Tensor Decomposition Problem

In order to solve problem (3), we can use the manifold gradient descent method. It is well-known that manifold gradient descent with random initialization converges to local minima (GD_local_min_lee2016), almost surely. Thus, if we prove any local minima of the above problem is close to one of the components , we can use manifold gradient descent for recovering the components. This gradient descent method for solving tensor decomposition has been used before in the case of orthogonal orthogonal tensors; see, e.g. (tensor_decomp_latent_anandkumar2014). In fact, in the orthogonal case, the gradient descent method coincides with Tensor Power Method (tensor_decomp_latent_anandkumar2014).

To study the landscape of (3), let us first present the first- and second-order optimality conditions. Let the projection matrix to the manifold at point be . Then, for the optimization problem (3), any local minimizer point has to satisfy the following two optimality conditions:

  • First-order optimality condition111The subscript refers to the fact that the corresponding derivative is calculated while projecting the directions on the manifold .:

    (4)
  • Second-order optimality condition:

    (5)

The first-order optimality condition implies that for any local optimal point , we have

(6)

Consequently, has to be in the span of ’s if . Based on these optimality conditions, we study the landscape of the tensor decomposition problem (3) in two steps: In the first step, we show that if a local minimum exists close to one of the components, the local minimum should be in fact very close to the true component. In other words, within a region around any true component, the local minima are all very close to the true component. Thus, the landscape is locally well-behaved around the true components. In the second step, we make our result global by showing that any local minimizer of the tensor decomposition problem (3) is relatively close to one of the true components (up to sign ambiguity).

To proceed, let us make the following standard assumptions:

Assumption 2.1.

’s are all norm and satisfy the following incoherence condition with constant , i.e.,

(7)

and . Moreover, we assume that for any vector in the span of , we have

(8)

where and . This condition is known in the literature as Restricted Isometric Property (RIP) that is usually satisfied with high probability when is large for many forms of random matrices.

Furthermore, for general matrices, it is easy to prove that .

3 Geometric Analysis

Throughout this section, for any given point , we define . For simplicity of notations and since it is clear from the context, we use instead of . We also, without loss of generality, assume that . Using this definition, the following lemma shows that there is always a gap between and the rest of the components , .

Lemma 3.1.

Let be a local minimizer of (3), then .

Proof.

We prove by contradiction. Assume the contrary that . Take a unit vector in the span of and which is orthogonal to . First of all, we have that

Based on the RIP assumption and since , we have . Furthermore, based on the contrary assumption we made. Thus, when we have

(9)

On the other hand, since is a local minimizer of (3), the second-order optimality condition implies that , which contradicts (9). ∎

The following lemma shows that any local minimizer of problem (3), is very close one of the to the component , with the highest value of for . In other words, the projection of , onto space spanned by the rest of components is very small compared to its projection onto .

Lemma 3.2.

If is a local minimizer of (3) and then

(10)
Proof.

Assume that . Based on (6), we have . Thus:

(11)

According to the previous lemma, for any . Therefore,

(12)

Moreover,

Note that . To find an upper-bound for , we use the following lemma.

Lemma 3.3.

For any two real numbers and , we have .

(13)

Since for , we have:

(14)

Combining (12) and (14), we obtain:

(15)

Now note that

Thus,

(16)

which completes the proof.

Theorem 3.4.

If , then for any local minimizer of (3) there exists an index such that .

Proof.

Based on the above lemma and without loss of generality, assume that . Thus, . Therefore,

Now we have . ∎

Corollary 3.5.

Note that in the case where the components are randomly generated with dimension our result shows that when there are no spurious local minima.

4 Extension to the Non-equally Weighted Scenario

In this section, we extend the result of the previous section to the case when the tensor decomposition components have not equal weights. In this scenario, the tensor of interest is in the form of:

Thus, the optimization problem (3) turns to:

(17)

Let and be the maximum and minimum values among , respectively. The following theorem demonstrates that under an additional assumption on the ratio of to , any local minimizer of the optimization problem (17) is very close to one of the actual components .

Theorem 4.1.

If and then for any local minimizer of (17) there is an index such that .

We leave the optimality conditions of problem (17), the extension of lemma 3.1, lemma 3.2, and the proof of the above theorem to appendix D.

5 Recovery via Tensor Power Method and Deflation

The landscape analysis in Section 3 demonstrates that to recover at least one of the true components of a given tensor , with the error of at most , it suffices to find any local optimum of problem (17). To recover the other components of , we can deflate the obtained component from , and find a local optimizer of problem (17) for the deflated tensor. However, the introduced deflation error could potentially make the problem of finding the remaining components difficult. In Section 5.1 we show that TPM can tolerate error residuals from deflation if it is well-initialized. In other words, TPM can recover all ’s within a noise floor of . Similar to the geometric result, our convergence guarantee for TPM with "good" initialization is deterministic.

Finally, in Section 5.2, we provide a probabilistic argument to show that when is large after restarting TPM randomly for times, with high probability, we obtain a "good" initialization. This concludes that TPM with restarts and deflation can efficiently find all ’s within some noise floor .

5.1 Convergence of Well-initialized TPM with Deflation

Tensor Power Method is one of the most widely used algorithms for solving tensor decomposition problems. For a given tensor , and vectors and , consider the following mapping:

(18)

where is the -th standard unit vector.

Assume that a tensor and initialization is given. At each iteration, TPM applies vector-valued mapping (18) to current , which is initialized to . Then, it normalizes the resulted vector, and update , by applying mapping (1) to . The details of Tensor Power Method are summarized in Algorithm 1.

  for  do
     
     
  end for
  return  ,
Algorithm 1

The following theorem shows that when is small enough, applying algorithm 1 to a deflated version of tensor can recover all components.

Theorem 5.1.

Fix . Assume that we have access to and , such that there are appropriate absolute constants and where:

  • and .

Moreover, assume that we have an initial unit vector , for which and . Define the deflated tensor . Suppose that we apply TPM for appropriate number of iterations to obtain . Then, we have

(19)
Corollary 5.2.

A similar result can be obtained when the original tensor is in the form of , for .

5.2 Obtaining good initialization by random restarts

In Section 5.1 we proved that TPM is effective in finding the components of when applied on deflated tensors sequentially and with good enough initialization. In this section we prove that we can obtain such a good initialization by doing multiple random restart in each iteration. Algorithm 2 describes Tensor Power Method augmented by multiple random restarts at each iteration (TPMR).

  for  do
     if i=1 then
        
     else
        
     end if
     Generate uniformly on
     Find ,
     Find
     Set &
  end for
  return  
Algorithm 2

TPMR calls TPM with multiple restarts on its inside loop. Thus, to prove the effectiveness of TMPR algorithm, it suffices to find good initialization points for TPM using random restarts. The following theorem states such a result.

Theorem 5.3.

For any small threshold , if we sample uniform vectors from such that satisfies

(20)
(21)

where

(22)

then, with probability , at least one of the samples will satisfy:

(23)
(24)
Corollary 5.4.

To make sure that the good initialization condition is satisfied throughout the for loop in TPMR Algorithm 2 with probability , we need to plug in in Theorem 5.3.

Corollary 5.5.

It is easy to verify that one can find that satisfies (20) and (21).

Remark 5.6.

Note that in the case where ’s are randomly generated dimensional vectors, then . Thus, . Thus, .

References

Appendix A Helper Lemmas for Proving TPM Convergence

Lemma A.1.

If and , then for any such that ,

(25)

where and

(26)

where .

Proof.

Let us use the following definitions: , and . Then, we have

(27)

Also, note that . In addition, and . Note that we can easily prove that . Now for we have

(28)

where the last step follows from .

For bounding we use the fact that for .

(29)

Lemma A.2.

If and for any , then for any such that , we have:

(30)

Thus, we can conclude that

(31)

where if we further have small enough, i.e. and , we will have

(32)
Proof.

The proof of the first part is very simple and only uses the the results of Lemma A.1 and the fact that ’s satisfy the RIP condition.

The rest of the proof is also simple arithmetic. ∎

Lemma A.3.

If is a unit vector, , , and moreover , and , then if

(33)
Proof.

We assume that . Let us look at each term in the right hand side of (31). For the first term

(34)

For the second term

(35)

For the third term we have

(36)

For the fourth term

(37)

And for the last term we have

(38)

Finally the bound could be obtained by adding all these bounds.

Now let us prove a recursive bound that we can use for the proving our final result.

Lemma A.4.

Assume that for a norm 1 vector , , and moreover . Also assume that the conditions of Lemma A.3 is satisfied and . Then for we have

(39)

Moreover, , .

Proof.

Let us first lower bound using Lemma A.3 and the fact that we have:

(40)

Let us define . Then, .

(41)

Note that . And . Also note that