1 Introduction
Nonconvex optimization is a critical component in modern machine learning. Unfortunately, theoretical guarantees for nonconvex optimization have been mostly negative, and the problems are computationally hard in the worst case. Nevertheless, simple localsearch algorithms such as stochastic gradient descent have enjoyed great empirical success in areas such as deep learning. As such, recent research efforts have attempted to bridge this gap between theory and practice.
For example, one property that can guarantee the success of local search methods over nonconvex functions is when all local minima are also the global minima. Interestingly, it has been recently proven that many well known nonconvex problems do have this property, under mild conditions. Consequently, localsearch methods, which are designed to find a local optimum, automatically achieve global optimality. Examples of such problems include matrix completion [1], orthogonal tensor decomposition [2, 3], phase retrieval [4], complete dictionary learning [5], and so on. However, such a class of nonconvex problems is limited, and there are many practical problems with poor local optima, where local search methods can fail.
The above property, while very helpful, imposes a strong assumption on the nonconvex problem. A less restrictive requirement for the success of local search methods is the ability to initialize local search in the basin of attraction of the global optimum using another polynomialtime algorithm. This approach does not require all the local optima to be of good quality, and thus can cover a broader set of problems. Efficient initialization strategies have recently been developed for many nonconvex problems such as overcomplete dictionary learning [6, 7], tensor decomposition [8], robust PCA [9]
, mixed linear regression
[10] and so on.Although the list of such tractable nonconvex problems is growing, currently, the initialization algorithms are problemspecific and as such, cannot be directly extended to new problems. An interesting question is whether there exist common principles that can be used in designing efficient initialization schemes for local search methods. In this paper, we demonstrate how a class of homotopy continuation methods may provide such a framework for efficient initialization of local search schemes.
1.1 Homotopy Method
The homotopy method is a general and a problem independent technique for tackling nonconvex problems. It starts from an objective function that is efficient to optimize (e.g. convex function), and progressively transforms it to the required objective [11]. Throughout this progression, the solution of each intermediate objective is used to initialize a local search on the next one. A particular approach for constructing this progression is to smooth the objective function. Precisely, the objective function is convolved with the Gaussian kernel and the amount of smoothing is varied to obtain the set of transformations. Intuitively, smoothing “erases wiggles” on the objective surface (which can lead to poor local optima), thereby resulting in a function that is easier to optimize. Global optimality guarantees for the homotopy method have been recently established [12, 13]. However, the assumptions in these results are either too restrictive [12] or extremely difficult to check [13]. In addition, homotopy algorithms are generally slow since local search is repeated within each instantiation of the smoothed objective.
In this paper, we address all the above issues for the nonconvex tensor PCA problem. We analyze the homotopy method and guarantee convergence to global optimum under a set of transparent conditions. Additionally, we demonstrate how the homotopy method can be drastically simplified without sacrificing the theoretical guarantees. Specifically, by taking advantage of the phase transitions in the homotopy path, we can avoid the intermediate transformations of the objective function. In fact, we can start from the extreme case of “easy” (convex) function of the homotopy, and use its solution to initialize local search on the original objective. Thus, we show that the homotopy method can serve as a problem independent principle for obtaining a smart initialization which is then employed in local search methods. Although we limit ourselves to the problem of tensor PCA in this paper, we expect the developed techniques to be applicable for a broader set of nonconvex problems.
1.2 Tensor PCA
Tensor PCA problem is an extension of the matrix PCA. The statistical model for tensor PCA was first introduced by [14]. This is a single spike model where the input tensor is a combination of an unknown rank tensor and a Gaussian noise tensor with for
(1) 
where is the signal that we would like to recover.
Tensor PCA belongs to the class of “needle in a haystack” or high dimensional denoising problems, where the goal is to separate the unknown signal from a large amount of random noise. Recovery in the high noise regime has intimate connections to computational hardness, and has been extensively studied in a number of settings. For instance, in the spiked random matrix model, the input is an additive combination of an unknown rank
matrix and a random noise matrix. The requirement on the signaltonoise ratio for simple algorithms, such as principal component analysis (PCA), to recover the unknown signal has been studied under various noise models
[15, 16]and sparsity assumptions on the signal vector
[17].Previous algorithms for tensor PCA belong to two classes: local search methods such as tensor power iterations [14], and global methods such as sum of squares [18]. Currently, the best signaltonoise guarantee is achieved by the sumofsquares algorithm and the flattening algorithm, which are more expensive compared to power iterations (see Table 1). In this paper, we analyze the Gaussian homotopy method for tensor PCA, and prove that it matches the best known signaltonoise performance. [18] also showed a lowerbound that no degree (or lower) sumofsquares algorithm can achieve better signaltonoise ratio, implying that our analysis is likely to be tight.
Method  Bound on  Time  Space 

Power method + initialization + noise injection (ours)  
Power method, random initialization  
SumofSquares  
Recover and Certify  
Eigendecomposition of flattened matrix  
Informationtheoretic  Exp 
1.3 Contributions
We analyze a simple variant of the popular tensor power method, which is a local search method for finding the best rank approximation of the input tensor. We modify it by introducing a specific initialization and injecting appropriate random noise in each iteration. This runs almost in linear time; see Table 1 for more details.
Theorem 1.1 (informal).
There is an almost linear time algorithm for tensor PCA that finds the signal as long as the signal strength .
Our algorithm achieves the best possible tradeoffs among all known algorithms (see Table 1).
Our algorithm is inspired by the homotopy framework. In particular, we establish a phase transition along the homotopy path.
Theorem 1.2 (informal).
Under a plausible independence conjecture, there is a threshold such that if the radius of smoothing is significantly larger than , the smoothed function will have a unique local and global maximum. If the radius of smoothing is smaller, then the smoothed function can have multiple local maxima, but one of them is close to the signal vector .
The above result allows us to skip the intermediate steps in the homotopy path. We only need two end points of the homotopy path: the original objective function with no smoothing and with an infinite amount of smoothing. The optimal solution for the latter can be obtained through any local search method; in fact, in our case, it has a closed form. This serves as initialization for the original objective function. In the proof we also design a new noise injection procedure that breaks the dependency between the steps. This allows for simpler analysis and our algorithm does not rely on the independence conjecture. We discuss this in more detail in Section 3.1.
The comparison of all the current algorithms for tensor PCA is given in Table 1. Note that the space in the table does not include the space for storing the tensor, this is because the more practical algorithms only access the tensor for a very small number of passes, which allows the algorithms to be implemented online and do not need to keep the whole tensor in the memory. We see that our algorithm has the best performance across all the measures. In our synthetic experiments (see Section 5
, we find that our method significantly outperforms the other methods: it converges to a better solution faster and with a lower variance.
2 Preliminaries
In this section, we formally define the tensor PCA problem and its associated objective function. Then we show how to compute the smoothed versions of these objective functions.
2.1 Tensors and Polynomials
Tensors are higher dimensional generalization of matrices. In this paper we focus on 3rd order tensors, which correspond to a 3 dimensional arrays. Given a vector , similar to rank one matrices , we consider rank 1 tensors to be a array whose th entry is equal to .
For a matrix , we often consider the quadratic form it defines: . Similarly, for a tensor , we define a degree 3 polynomial . This polynomial is just a special trilinear form defined by the tensor. Given three vectors , the trilinear form . Using this trilinear form, we can also consider the tensor as an operator that maps vectors to matrices, or two vectors into a single vector. In particular, is a matrix whose th entry is equal to where is the th basis vector. Similarly, is a vector whose th coordinate is equal to .
Since the tensor we consider is not symmetric ( is not necessarily equal to or other permutations), we also define the symmetric operator
2.2 Objective Functions for Tensor PCA
We first define the tensor PCA problem formally.
Definition 1 (Tensor PCA).
Given input tensor , where is an arbitrary unit vector, is the signaltonoise ratio, and is a random noise tensor with iid standard Gaussian entries, recover the signal approximately (find a vector such that ).
Similar to the Matrix PCA where we maximize the quadratic form, for tensor PCA we can focus on optimizing the degree 3 polynomial over the unit sphere.
(2)  
The optimal value of this program is known as the spectral norm of the tensor. It is often solved in practice by tensor power method. [14] noticed that:
Theorem 2.1.
When for large constant , the global optimum of (2) is close to the signal .
Unfortunately, solving this optimization problem is NPhard in the worstcase [19]. Currently, the best known algorithm uses sumofsquares hierarchy and works when . There is a huge gap between what’s achievable information theoretically () and what can be achieved algorithmically ().
2.3 Gaussian Smoothing for the Objective Function
Guaranteed homotopy methods rely on smoothing the objective function by the Gaussian kernel [11, 12]. More precisely, smoothing the objective (2) requires convolving it with the Gaussian kernel. Let be a mapping such that
Here, is the Gaussian density function for , satisfying
It is known that convolution of polynomials with the Gaussian kernel has a closed form expression [20]. In particular, the objective function of the Tensor PCA has the following smoothed form.
Lemma 1 (Smoothed Tensor PCA Objective).
The smoothed objective has the form
where the vector is defined by . Moreover, it is easy to compute vector given just the tensor , as .
The proof of this Lemma is based on interpreting the convolution as an expectation . We defer the detailed calculation to Appendix A.1
3 Tensor PCA by Homotopy Initialization
In this section we give a simple smart initialization algorithm for tensor PCA. Our algorithm only uses two points in homotopy path – the infinite smoothing and the no smoothing . This is inspired by our full analysis of the homotopy path (see Section 4), where we show there is a phase transition in the homotopy path. When the smoothing parameter is larger than a threshold, the function behaves like the infinite smoothing case; when the smoothing parameter is smaller than the threshold, the function behaves like the no smoothing case.
Recall that the smoothed function is:
(3) 
with as a vector such that . When , the solution of the smoothed problem has the special form . That is because the term dominates and thus its maximizer under yields .
Note that by Lemma 1, we can compute vector , and we know . Therefore we know can be computed just from the tensor. We use this point as an initialization, and then run power method on the original function. The resulting algorithm is described in Algorithm LABEL:algo:simple.
In order to analyze the algorithm, we use the following independence condition, which states that the “random”looking vectors and indeed have some properties satisfied by random vectors:
Condition 3.1.
[Independence Condition] The norm and correlation with for the vectors and are not far from expectation. More precisely: (1) and ; (2) for the sequence computed by Algorithm LABEL:algo:simple, , , and .
Note that if in every step of the algorithm, the noise tensor is resampled to be a fresh random tensor, independent of the previous step , then is just a random Gaussian vector. In this case the condition is trivially satisfied. Of course, in reality ’s are dependent on . However, we are able to modify the algorithm by a noise injection procedure, that adds more noise to the tensor , and make the noise tensor “look” as if they were independent. The extra dependency on in Condition 3.1 comes from noise injection procedure and will be more clear in Section 3.1. We will first show the correctness of the algorithm assuming independence condition here, and in Section 3.1 we discuss the noise injection procedure and prove the independence condition.
Theorem 3.1.
When for a large enough constant , under the Independence Condition (Condition 3.1), Algorithm LABEL:algo:simple finds a vector such that in iterations.
algocf[h]
The main idea is to show the correlation of and increases in every step. In order to do this, first notice that the initial point itself is equal to a normalization of , where the norm of and its correlation with are all bounded by the Independence Condition. It is easy to check that , which is already nontrivial because a random vector would only have correlation around . For the later iterations, let be the vector before normalization and we have . Notice that the first term is in the direction , and the Independence Condition bounds the norm and correlation with for the second term. We can show that the correlation with increases in every iteration, because the initial point already has a large inner product with . The detailed proof is deferred to Appendix A.2.
3.1 Noise Injection Procedure
algocf[h]
In order to prove the Independence Condition, we slightly modify the algorithm (see Algorithm LABEL:algo:resample). In particular, we add more noise in every step as follows

Get the input tensor ;

Draw a sequence of such that ;

Let with , run Algorithm LABEL:algo:resample by using in the th iteration;
Intuitively, by adding more noise the new noise will overwhelm the original noise , and every time it looks like a fresh random noise. We prove this formally by the following lemma:
Lemma 2.
Let the sequence be generated according to Algorithm LABEL:algo:resample. Let , where ’s are tensors with independent Gaussian entries. Each entry in is distributed as . The two sets of variables and has the same distribution.
This Lemma states that after our noise injection procedure, the tensors look exactly the same as tensors where the noise
is sampled independently. The basic idea for this lemma is that for two multivariate Gaussians to have the same distribution, we only need to show that they have the same first and second moments. We defer the details to Appendix
A.2.Using Lemma 2 we can create a sequence of such that its noise tensor is redrawn independently and each element is according to . Now, because each behave as if it is drawn independently, we can prove the Independence Condition:
Lemma 3 (Noise Injection).
Let be generated according to Algorithm LABEL:algo:resample and . Let be a vector such that , and
. With high probability
^{1}^{1}1Throughout this paper by “with high probability” we mean the probability is at least for a large constant ., (1) and ; (2) for the sequence computed by Algorithm LABEL:algo:resample, , , and . As a result Condition 3.1 is satisfied.This Lemma is now true because by Lemma 2, we know the noise tensors is independent of . As a result is independent of ! This lemma then follows immediately from standard concentration inequalities. We defer the full proof to Appendix A.2.
The noise injection technique is mostly a technicality that we need in order to make sure different steps are independent. This is standard in analyzing nonconvex optimization algorithms. As an example, previous works on alternating minimization for matrix completion [21] relied on the availability of different subsamples in different iterations to obtain the theoretical guarantees. Our noise injection procedure is very similar, however this is the first application of this idea for the case of Gaussian noise. The main usage of the noise injection is to get rid of the dependence of the noise matrix between different iterations. Moreover, this technique is designed to simplify the proof and rarely used in the real applications. In practice, an algorithm without noise injection, like Algorithm LABEL:algo:simple, usually performs well enough.
Combining Lemma 3 and Theorem 3.1, we know Algorithm LABEL:algo:resample solves the tensor PCA problem when .
Remark 1 (Estimation of the variance in practice).
In the above analysis, we assume the variance of entries of is
. In practice, we can estimate the variance
of entries of from by computing its Frobenius norm. Note that when is large, the simple power method already performs well. The interesting case is when is small, say . In this case, the square of the Frobenius norm of while the square of the Frobenius norm of the noise matrix in expectation is with variance . Therefore, we can get a good estimation of by computing the square of the Frobenius norm of divided by .4 Characterizing the Homotopy Path
This section analyzes the behavior of the smoothed objective function as varies. Under a plausible conjecture, we prove that a phase transition occurs: when is large behaves very similarly to and when is small behaves very similarly to . This motivates the algorithms in the previous section, as the phase transition suggests the most important regimes are very large and .
In this section we first describe how the homotopy method works in more details. Then we present an alternative objective function of Tensor PCA and derive its smoothed version. Finally, we prove that when , the smoothed function retains its maximizer around . However, when , the configuration of critical points change, with only one of the critical points being close to the solution . Importantly, we can find our way from the vicinity of toward this critical point via the dominant curvature direction of the function.
4.1 Homotopy
In the homotopy method, we start from the maximizer of the function with large amount of smoothing . We earlier denoted this maximizer as . Then we continuously decrease the amount of smoothing , while following the maximizer throughout this process, until reaching . We call the path taken by the maximizer the homotopy path. It is formally defined as follows.
Definition 2 (Homotopy Path).
A homotopy path is a continuous function satisfying and , , where the gradient is w.r.t. to the first argument of .
In practice, to search a homotopy path, one computes the initial point by analytical derivation or numerical approximation as and then successively minimizes the smoothed functions over a finite sequence of decreasing numbers to , where is sufficiently large, and . The resulted procedure is listed in Algorithm LABEL:algo:homotopy.
algocf[h]
4.2 Alternative Objective Function and Its Smoothing
Turning a constrained problem into an unconstrained problem can facilitate the computation of the effective gradient and Hessian of . In this section, we consider the alternative objective function: we modify by adding the penalty term :
Thus we consider the following unconstrained optimization problem,
(4) 
If we fix the magnitude , the function is . The optimizer of this is an increasing function of . Therefore the maximizer of (4) is exactly in the same direction as the constrained problem (2). The factor here is just to make sure the optimal solution has roughly unit norm; in practice we can choose any coefficient in front of and the solution will only differ by scaling.
Moreover, note that if in the absence of noise tensor , then
To get the stationary point, we have
Therefore, the new function is defined on and the maximizer of is close to . We also compute the smoothed version of this problem:
Lemma 4 (Smoothed Alternative Objective).
The smoothed version of the alternative objective is
Its gradient and Hessian are equal to
(5) 
and
(6) 
Here is the projection to symmetric matrices.
4.3 Phase Transition on the Homotopy Path
Notice that when , the dominating terms in are terms (the only term is a constant). Therefore, forms a quadratic function, so it has a unique global maximizer equal to , denoted as . Notice that this vector has different norm compared to the in previous section.
Before we state the Theorem, we need a counterpart of the Independence Condition. We call this the Strong Independence Conjecture:
Conjecture 1.
[Strong Independence Conjecture] Suppose and , be defined as before. With high probability, (1) and ; (2) for all on the homotopy path, and ,
Intuitively, this assumes that the noise is not adversarially correlated with the signal on the entire homotopy path. The main difference between the strong independence conjecture and the weak independence conjecture is that they apply to different algorithms with different number of iterations. The strong independence conjecture applies to the general Homotopy method, which may have a large number of iterations, and thus a conjecture that depends on the number of iterations does not provide us any useful properties. We use the strong independence conjecture to analyze the general Homotopy method to gain intuitions in order to design our algorithm. The weak conjecture is for our Algorithm LABEL:algo:simple, which only has rounds, and can be satisfied using the noise injection technique. Although we cannot use noise injection to prove the strong independence conjecture, similar conjectures are often used to get intuitions about optimization problems [22, 23, 24].
Theorem 4.1.
Assuming the Strong Independence Conjecture (Conjecture 1), when ,

When for a large enough constant , there exists a local maximizer of such that ;

When , we know there are two types of local maximizers :

and . This corresponds to a local maximizer near the true signal .

and . These local maximizers have poor correlation with the true signal.

Intuitively, this theorem shows that in the process of homotopy method, if we consider a continuous path in the sense that is close to for all , then (1) at the beginning, is close to ; (2) at some point , is a saddle point in the function and from the saddle point we are very likely to follow the Hessian direction to actually converge to the good local maximizer near the signal. This phenomenon is illustrated in Figure 1:
Figure 1(a) has large smoothing parameter, and the function has a unique local/global maximizer. Figure 1(b) has medium smoothing parameter, the original global maximizer now behaves like a local minimizer in one dimension, but it in general could be a saddle point in high dimensions. The Hessian at this point leads the direction of the homotopy path. In Figure 1(c) the smoothing is small and the algorithm should go to a different maximizer.
5 Experiments
For brevity we refer to our Tensor PCA with homotopy initialization method (Algorithm LABEL:algo:simple) as HomotopyPCA. We compare that with two other algorithms: the Flatten algorithm and the Power method. The Flatten algorithm was originally proposed by [14], where they show it works when . [18] accelerated the Flatten algorithm to nearlinear time, and improved the analysis to show it works when . The Power method is similar to our algorithm, except it does not use intuitions from homotopy, and initialize at a random vector. Note that there are other algorithms proposed in [18], however they are based on the SumofSquares SDP hierarchy, and even the fastest version runs in time (much worse than the algorithms compared here).
We first compare how often these algorithms successfully find the signal vector , given different values of and . The results are in Figure 2, in which axis represents and axis represents . We run 50 experiments for each values of , and the grayness in each grid shows how frequent each algorithm succeeds: black stands for “always fail” and white stands “always succeed”. For every algorithm, we say it fails if (1) when it converges, i.e., the result at two consecutive iterations are very close, the correlation with the signal is less than ; (2) the number of iterations exceeds 100. In the experiments for Power Method, we observe there are many cases where situation (1) is true, although our new algorithms can always find the correct solution. In these cases the function indeed have a local maximizer. From Figure 2, our algorithm outperforms both Power Method and the Flatten algorithm in practice. This suggests the constant hiding in our algorithm is possibly smaller.
Next we compare the number of iterations to converge with and , where varies in . In Figure 3, the xaxis is the number of iterations, and the axis is the correlation with the signal (error bars shows the distribution from 50 independent runs). For all , Homotopy PCA performs well — converges in less than iterations and finds the signal . The Power Method converges to a result with good correlations with the signal , but has large variance because it sometimes gets trapped in local optima. As for the Flatten algorithm, the algorithm always converges. However, it takes more iterations compared to our algorithm. Also when is small, the converged result has bad correlation with .
References
 [1] Rong Ge, Jason D Lee, and Tengyu Ma. Matrix completion has no spurious local minimum. arXiv preprint arXiv:1605.07272, 2016.
 [2] Animashree Anandkumar, Rong Ge, Daniel Hsu, Sham M Kakade, and Matus Telgarsky. Tensor decompositions for learning latent variable models. Journal of Machine Learning Research, 15(1):2773–2832, 2014.
 [3] Rong Ge, Furong Huang, Chi Jin, and Yang Yuan. Escaping from saddle points—online stochastic gradient for tensor decomposition. In Proceedings of The 28th Conference on Learning Theory, pages 797–842, 2015.
 [4] Ju Sun, Qing Qu, and John Wright. A geometric analysis of phase retrieval. Arxiv, 1602.06664, 2016.
 [5] Ju Sun, Qing Qu, and John Wright. Complete dictionary recovery over the sphere. In Sampling Theory and Applications (SampTA), 2015 International Conference on, pages 407–410. IEEE, 2015.
 [6] Sanjeev Arora, Rong Ge, and Ankur Moitra. New algorithms for learning incoherent and overcomplete dictionaries. In COLT, pages 779–806, 2014.
 [7] Alekh Agarwal, Animashree Anandkumar, Prateek Jain, Praneeth Netrapalli, and Rashish Tandon. Learning sparsely used overcomplete dictionaries. In Proceedings of The 27th Conference on Learning Theory, pages 123–137, 2014.
 [8] Animashree Anandkumar, Rong Ge, and Majid Janzamin. Learning overcomplete latent variable models through tensor methods. In Proceedings of The 28th Conference on Learning Theory, pages 36–112, 2015.
 [9] Praneeth Netrapalli, UN Niranjan, Sujay Sanghavi, Animashree Anandkumar, and Prateek Jain. Nonconvex robust pca. In Advances in Neural Information Processing Systems, pages 1107–1115, 2014.
 [10] Xinyang Yi, Constantine Caramanis, and Sujay Sanghavi. Solving a mixture of many random linear equations by tensor decomposition and alternating minimization. Arxiv, abs/1608.05749, 2016.

[11]
Hossein Mobahi and John W Fisher III.
On the link between gaussian homotopy continuation and convex
envelopes.
In
International Workshop on Energy Minimization Methods in Computer Vision and Pattern Recognition
, pages 43–56. Springer, 2015.  [12] Hossein Mobahi and John W Fisher III. A theoretical analysis of optimization by gaussian continuation. In AAAI, pages 1205–1211. Citeseer, 2015.
 [13] Elad Hazan, Kfir Y Levy, and Shai ShalevShwartz. On graduated optimization for stochastic nonconvex problems. In International Conference on Machine Learning (ICML), 2016.
 [14] Emile Richard and Andrea Montanari. A statistical model for tensor pca. In Advances in Neural Information Processing Systems, pages 2897–2905, 2014.
 [15] Amelia Perry, Alexander S Wein, Afonso S Bandeira, and Ankur Moitra. Optimality and suboptimality of pca for spiked random matrices and synchronization. arXiv preprint arXiv:1609.05573, 2016.
 [16] Alex Bloemendal and Bálint Virág. Limits of spiked random matrices i. Probability Theory and Related Fields, 156(34):795–825, 2013.
 [17] Quentin Berthet, Philippe Rigollet, et al. Optimal detection of sparse principal components in high dimension. The Annals of Statistics, 41(4):1780–1815, 2013.
 [18] Samuel B Hopkins, Jonathan Shi, and David Steurer. Tensor principal component analysis via sumofsquare proofs. In Proceedings of The 28th Conference on Learning Theory, COLT, pages 3–6, 2015.
 [19] Christopher J Hillar and LekHeng Lim. Most tensor problems are nphard. Journal of the ACM (JACM), 60(6):45, 2013.
 [20] Hossein Mobahi. Closed form for some gaussian convolutions. arXiv:1602.05610, 2016.

[21]
Prateek Jain, Praneeth Netrapalli, and Sujay Sanghavi.
Lowrank matrix completion using alternating minimization.
In
Proceedings of the fortyfifth annual ACM symposium on Theory of computing
, pages 665–674. ACM, 2013.  [22] David L Donoho, Arian Maleki, and Andrea Montanari. Messagepassing algorithms for compressed sensing. Proceedings of the National Academy of Sciences, 106(45):18914–18919, 2009.
 [23] Adel Javanmard and Andrea Montanari. State evolution for general approximate message passing algorithms, with applications to spatial coupling. Information and Inference, page iat004, 2013.
 [24] Anna Choromanska, Mikael Henaff, Michael Mathieu, Gérard Ben Arous, and Yann LeCun. The loss surfaces of multilayer networks. In AISTATS, 2015.
 [25] Beatrice Laurent and Pascal Massart. Adaptive estimation of a quadratic functional by model selection. Annals of Statistics, pages 1302–1338, 2000.
 [26] Ryota Tomioka and Taiji Suzuki. Spectral norm of random tensors. arXiv preprint arXiv:1407.1870, 2014.

[27]
Michel Ledoux.
Deviation inequalities on largest eigenvalues.
In Geometric aspects of functional analysis, pages 167–219. Springer, 2007.
Appendix A Omitted Proofs
a.1 Omitted Proof in Section 2
Lemma 5 (Lemma 1 restated).
where the vector is defined by . Moreover, let be a vector where , then we have .
Proof.
We can write as an expectation
Since
is just a degree 3 polynomial, we can expand it and use the lower moments of Gaussian distributions:
Therefore the first part of the lemma holds if we define to be the vector . In order to compute the vector , notice that the term is the linear term on , and it is equal to
This means
∎
a.2 Omitted Proof in Section 3
Theorem A.1 (Theorem 3.1 restated).
When for a large enough constant , under the Independence Condition (Condition 3.1), Algorithm LABEL:algo:simple finds a vector such that in iterations.
algocf[h]
Proof.
We first show the initial maximizer already has a nontrivial correlation with . Recall . Note that if is very large such that , then we already have . Later we will show that whenever all later iterations have the same property.
Therefore, we are left with the case when (this implies ). In this case, by Condition 3.1 we know and , therefore
Therefore, . Assume for large enough (where we will later show suffices)
Now let us consider the first step of power method. Let be the vector before normalization. Observe that . By Condition 3.1 we have bounds on and , therefore we have
Note that when and , the first term is much larger than . Hence for the first iteration, we have .
Similar as before, when , we already have . On the other hand, if , in this case, by Condition 3.1 we know . We again have . Therefore, . Combining the bounds for the norm of and its correlation with ,
Therefore, when , the correlation between and is larger than the correlation between and . This shows the first step makes an improvement.
In order to show this for the future steps, we do induction over . The induction hypothesis is for every , either or
Initially, for , we have already proved the induction hypothesis.
Now assume the induction hypothesis is true for . In the next iteration, let be the vector before normalization. Similar as before we have .
When , by Condition 3.1 we know the norm of is much smaller than . Therefore we still have .
In the other case, we follow the same strategy as the first step. By Condition 3.1 we can compute the correlation between and :
For the norm of , notice that the first term has norm , and the second term has norm . Note that these two terms are almost orthogonal by Independence Condition, therefore
If , then , where is a constant that is smaller than when is large enough. Therefore in this case . Thus we successfully recover in the next step.
Otherwise, we know . Then,
If we select , after rounds, we have , therefore we must always be in the first case. As a result . ∎
Lemma 6 (Lemma 2 restated).
Let the sequence be generated according to Section 3.1. Let , where ’s are tensors with independent Gaussian entries. Each entry in is distributed as . The two sets of variables and has the same distribution.
Proof.
Note that both distributions are multivariate Gaussians. Therefore we only need to show that they have the same first and second moments.
For the first moment, this is easy, we have and for all .
For the second moment (covariance), we consider the covariance between and . Note that for the distribution , as long as the 4 tuple the correlation is 0. We first show when