1 Introduction
Extracting the top eigenvalues/eigenvectors of a large symmetric matrix is a fundamental step in various machine learning algorithms. One prominent example of this problem is principal component analysis (PCA), in which we extract the top eigenvectors of the data covariance matrix, and there has been continuous effort in developing efficient stochastic/randomized algorithms for largescale PCA (e.g.,
Garber et al., 2016; Shamir, 2015; AllenZhu & Li, 2016). The more general eigenvalue problems for large matrices without the covariance structure is relatively less studied. The method of choice for this problem has been the power method, or the faster but often less known Lanczos algorithm (Golub & van Loan, 1996), which are based on iteratively computing matrixvector multiplications with the input matrix until the component of the vector that lies in the tailing eigenspace vanishes. However, for very largescale and dense matrices, even computing a single matrixvector product is expensive.
An alternative is to consider much cheaper vectorvector products, i.e., instead of updating all the entries in the vector on each iteration by a full matrixvector product, we consider the possibility of only updating one coordinate, by only computing the inner product of single row of the matrix with the vector. Such operations do not even require to store the entire matrix in memory. Intuitively, this may result in an overall significant speedup, in certain likely scenarios, since certain coordinates in the matrixvector product are more valuable than others for making local progress towards converging to the leading eigenvector. Indeed, this is the precise rational behind coordinatedescent methods that were extensively studied and are widely applied to convex optimization problems, see Wright (2015) for a comprehensive survey. Thus, given the structure of the eigenvalue problem which is extremely suitable for coordinatewise updates, and the celebrated success of coordinatedescent methods for convex optimization, a natural question is whether such updates can be applied for eigenvector computation with provable global convergence guarantees, despite the inherent nonconvexity of the problem.
Recently, Lei et al. (2016) have proposed two such methods, Coordinatewise Power Method (CPM) and Symmetric Greedy Coordinate Descent (SGCD). Both methods update on each iteration only entries in the vector, for some fixed . CPM updates on each iteration the
coordinates that would change the most under one step of power method, while SGCD applies a greedy heuristic for choosing the coordinates to be updated. The authors show that CPM enjoys a global convergence rate, with rate similar to the classical power iterations algorithm provided that
, the number of coordinates to be updated on each iteration, is sufficiently large (or equivalently, the ”noise” outside the selected coordinate is sufficiently small). In principle, this might force to be as large as the dimension , and indeed in their experiments they set to grow linearly with , which is overall not significantly faster than standard power iterations, and does not truly capture the concept of coordinate updates proved useful to convex problems. The second algorithm proposed in Lei et al. (2016), SGCD, is shown to converge locally with a linear rate already for (i.e., only a single coordinate is updated on each iteration), however this results assume that the method is initialized with a vector that is already sufficiently close (in a nontrivial way) to the leading eigenvector. The dependence of CPM and SGCD on the inverse relative eigengap^{1}^{1}1Defined as for a positive semidefinite matrix , where is the th largest eigenvalue of . of the input matrix is similar to that of the power iterations algorithm, i.e. linear dependence, which in principal is suboptimal, since squareroot dependence can be obtained by methods such as the Lanczos algorithm.We present globallyconvergent coordinatewise algorithms for the leading eigenvector problem which resolves the abovementioned concerns and significantly improve over previous algorithms. Our algorithms update only a single entry at each step and enjoy linear convergence. Furthermore, for a particular variant, the convergence rate depends only on the squareroot of the inverse relative eigengap, yielding a total runtime that dominates that of the standard power method and competes with that of the Lanczos’s method. In Section 2, we discuss the basis of our algorithm, the shiftandinvert power method (Garber & Hazan, 2015; Garber et al., 2016), which transforms the eigenvalue problem into a series of convex least squares problems. In Section 3, we show the least squares problems can be solved using efficient coordinate descent methods that are wellstudied for convex problems. This allow us to make use of principled coordinate selection rules for each update, all of which have established convergence guarantees.
We provide a summary of the time complexities of different globallyconvergent methods in Table 1. In particular, it is observable that in cases where either the spectrum of the input matrix or the magnitude of its diagonal entries is slowly decaying, our methods can yield provable and significant improvements over previous methods. For example, for a spiked covariance model whose eigenvalues are , our algorithm (SIGSL) can have a runtime independent of the eigengap , while the runtime of Lanczos’s method depends on . We also verify this intuition empirically via numerical experiments.
Method  Time complexity 

Power method  
Lanczos  
Ours (SIGSL)  
Ours (SIACDM) 
Time complexity (total number of coordinate updates) for finding an estimate
satisfyingwith at least constant probability, where
is the leading eigenvector of a positive semidefinite matrix , with eigenvalues in descending order. Since all methods are randomized, we assume all are initialized with a random unit vector.Notations
We use boldface uppercase letters (e.g., ) to denote matrices, and boldface lowercase letters (e.g., ) to denote vectors. For a positive definite matrix , the vector norm is defined as for any . We use to denote the element in row and column of the matrix , and use to denote the th element of the vector unless stated otherwise. Additionally, and denote the th row and th column of the matrix respectively.
Problem formulation
We consider the task of extracting the top eigenvector of a symmetric positive definite matrix (extensions to other setting are discussed in Section 5). Let the complete set of eigenvalues of be , with corresponding eigenvectors which form an orthonormal basis of . Without loss of generality, we assume (which can always be obtained by rescaling the matrix). Furthermore, we assume the existence of a positive eigenvalue gap so that the top eigenvector is unique.
2 Shiftandinvert power method
In this section, we introduce the shiftandinvert approach to the eigenvalue problem and review its analysis, which will be the basis for our algorithms.
The most popular iterative algorithm for the leading eigenvalue problem is the power method, which iteratively performs the following matrixvector multiplications and normalization steps
It can be shown that the iterates become increasingly aligned with the eigenvector corresponding to the largest eigenvalue in magnitude, and the number of iterations needed to achieve suboptimality in alignment is (Golub & van Loan, 1996)^{2}^{2}2We can always guarantee that by taking to be a random unit vector.. We see that the computational complexity depends linearly on , and thus power method converges slowly if the gap is small.
Shiftandinvert (Garber & Hazan, 2015; Garber et al., 2016) can be viewed as a preconditioning approach which improves the dependence of the time complexity on the eigenvalue gap. The main idea behind this approach is that, instead of running power method on directly, we can equivalently run power method on the matrix where is a shifting parameter. Observe that has exactly the same set of eigenvectors as , and its eigenvalues are
If we have access to a that is slightly larger than , and in particular if , then the inverse relative eigengap of is
, which means that the power method, applied to this shift and inverted matrix, will converge to the top eigenvector in only a polylogarithmic number of iterations, in particular, without linear dependence on
.In shiftandinvert power method, the matrixvector multiplications have the form , which is equivalent to solving the convex least squares problem
(1) 
Solving such least squares problems exactly could be costly itself if is large. Fortunately, power method with approximate matrixvector multiplications still converges, provided that the errors in each step is controlled; analysis of inexact power method together with applications to PCA and CCA can be found in Hardt & Price (2014); Garber et al. (2016); Ge et al. (2016); Wang et al. (2016).
We give the inexact shiftandinvert preconditioning power method in Algorithm 1, which consists of two phases. In Phase I, starting from an lower estimate of and an upper bound of , namely (recall that by assumption, ), the repeatuntil loop locates an estimate of the eigenvalue such that , and it does so by estimating the top eigenvalue of (through a small number of power iterations in the for loop), and shrinking the gap between and . In Phase II, the algorithm fixes the shiftandinvert parameter , and runs many power iterations in the last for loop to achieve an accurate estimate of the top eigenvector .
We now provide convergence analysis of Algorithm 1 following that of Garber & Hazan (2015) closely, but improving their rate (removing an extra factor) with the warmstart strategy for least squares problems by Garber et al. (2016), and making the initial versus final error ratio explicit in the exposition. We discuss the leastsquares solver in Algorithm 1 in the next section.
Measure of progress Since form an orthonormal basis of , we can write each normalized iterate as a linear combination of the eigenvectors as
and . Our goal is to have high alignment between the estimate and , i.e., for . Equivalently, we would like to have (see Lemma 5 in Appendix A).
2.1 Iteration complexity of inexact power method
Consider the inexact shiftandinvert power iterations: , , where , for . We can divide the convergence behavior of this method into two regimes: in the crude regime, our goal is to estimate the top eigenvalue (as in Phase I of Algorithm 1) regardless of the existence of an eigengap; whereas in the accurate regime, our goal is to estimate the top eigenvector (as in Phase II of Algorithm 1).
Following Garber & Hazan (2015); Garber et al. (2016), we provide convergence guarantees for inexact shiftandinvert power iterations in Lemma 7 (Appendix B), quantifying the sufficient accuracy in solving each least squares problem for the overall method to converge. Note that the iteration complexity for the crude regime is independent of eigengap, while in the accurate regime the rate in which the error decreases does depend on the eigengap between the first two eigenvalues of .
2.2 Bounding initial error for each least squares
For each least squares problem in inexact shiftandinvert power method, one can show that the optimal initialization based on the previous iterate is (Garber et al., 2016)
We can bound the suboptimality of this initialization, defined as . See full analysis in Appendix C.
Lemma 1.
Initialize each least squares problem in inexact shiftandinvert power method from . Then the initial suboptimality in can be bounded by the necessary final suboptimality as follows.
2.3 Iteration complexity of Algorithm 1
Based on the iteration complexity of inexact power method and the warmstart strategy, we derive the total number of iterations (least squares problems) needed by Algorithm 1.
Lemma 2 (Iteration complexity of the repeatuntil loop in Algorithm 1).
Suppose that where . Set where , initialize each least squares problem as in Lemma 1, and maintain the ratio between initial and final error to be and throughout the repeatuntil loop. Then for all it holds that
Upon exiting this loop, the satisfies
(2) 
and the number of iterations by repeatuntil is .
Lemma 3 (Iteration complexity of the final for loop in Algorithm 1).
As shown in the next section, we will be using linearly convergent solvers for the least squares problems, whose runtime depends on . Lemma 2 implies that we need to solve least squares problems in Phase I, each with . And Lemma 3 implies that we need to solve least squares problems in Phase II, each with .
3 Coordinate descent for least squares
Different from PCA, for general eigenvalue problems, the matrix may not have the structure of data covariance. As a result, fast stochastic gradient methods for finitesums (such as SVRG Johnson & Zhang, 2013 used in Garber et al. 2016) does not apply. However, we can instead apply efficient coordinate descent (CD) methods for convex problems in our setting. In this section, we review the CD methods and study their complexity for solving the least squares problems created by Algorithm 1.
There is a rich literature on CD methods and it has attracted resurgent interests recently due to its simplicity and efficiency for big data problems; we refer the readers to Wright (2015) for a comprehensive survey. In each update of CD, we pick one coordinate and take a step along the direction of negative coordinatewise gradient. To state the convergence properties of CD methods, we need the definitions of a few key quantities. Recall that we would like to solve the least squares problems of the form
with the optimal solution .
Smoothness
The gradient is coordinatewise Lipschitz continuous: for all , , and , we have
where is the th standard basis. Note that for least squares problems, the coordinatewise gradient Lipschitz constanst are the diagonal entries of its Hessian. Denote by and the largest and average Lipschitz constant over all coordinates respectively. These two Lipschitz constants are to be distinguished from the “global” smoothness constant , which satisfies
Since is quadratic, . Observe that (Wright, 2015)
(3) 
with upper bound achieved by a Hessian matrix .
Strongconvexity
As mentioned before, the matrices in our least squares problems are always positive definite, with smallest eigenvalue . As a result, is strongly convex with respect to the norm. Another strong convexity parameters we will need is defined with respect to the norm . It is shown by Nutini et al. (2015) that .
We collect relevant parameters for CD methods in Table 2.
Notation  Description  Relevant properties for our least squares problems 

smoothness parameter for th coordinate  
average smoothness parameter of all coordinates  
largest smoothness parameter of all coordinates  
global smoothness parameter  
strongconvexity parameter in norm  
strongconvexity parameter in norm 
3.1 Coordinate selection
The choice of coordinate to update is a central research topic for CD methods. We now discuss several coordinate selection rules, along with convergence rates, that are most relevant in our setting.
GaussSouthwellLipschitz (GSL, Nutini et al., 2015): A greedy rule for selecting the coordinate where the coordinate with largest gradient magnitude (relative to the square root of Lipschitz constant) is chosen. The greedy rule tends to work well for high dimensional sparse problems (Dhillon et al., 2011).
Cyclic (Beck & Tetruashvili, 2013): Given an ordering of the coordinates, cyclic coordinate descent processes each coordinate exactly once according to the ordering in each pass. In the extension “essentially cyclic” rule, each coordinate is guaranteed to be chosen at least once in every updates. This rule is particularly useful when the data can not fit into memory, as we can iteratively load a fraction of and update the corresponding coordinates, and still enjoy convergence guarantee.
Accelerated randomized coordinate descent methods (ACDM)
: In each step, a coordinate is selected randomly, according to certain (possibly nonuniform) distribution based on the coordinatewise Lipschitz constant. In accelerated versions of randomized CD
(Nesterov, 2012; Lee & Sidford, 2013; Lu & Xiao, 2015; AllenZhu et al., 2016), one can maintain auxiliary sequences of iterates to better approximate the objective function and obtain faster convergence rate, at the cost of (slightly) more involved algorithm. We implement the variant NUACDM by (AllenZhu et al., 2016), which samples each coordinate with probability proportional to and has the fastest convergence rate to date.GSL:  
Cyclic:  
Random: 
We provide the pseudocode of coordinate descent of the above rules in Algorithm 2.^{3}^{3}3The pseudo code for acclerated randomized CD is more involved and we refer the readers to (AllenZhu et al., 2016). Note that the stepsize for the chosen coordinate is , the inverse Lipschitz constant; for least squares problems where is quadratic in each dimension, this stepsize exactly minimizes the function over given the rest coordinates. In some sense, the GSL rule is the “optimal myopic coordinate update” (Nutini et al., 2015).
Connection to the greedy selection rule of Lei et al. (2016)
Given the current estimate , the coordinatewise power method of Lei et al. (2016) selects the following coordinate to be updated
i.e., the coordinate that would be updated the most if a full power iteration were performed. Their selection rule is equivalent to choosing the largest element of , where is the current (lower) estimate of the eigenvalue . On the other hand, the greedy GaussSouthwell rule for our method chooses the largest element of the gradient , where is an earlier estimate of the eigenvector and is an upper estimate of .
3.2 Convergence properties
We quote the convergence of CD from the literature.
Lemma 4 (Iteration complexity of CD methods).
The numbers of coordinate updates for Algorithm 2 to achieve where , using different coordinate selection rules, are
Lemma 4 implies that the coordinate descent methods converges linearly for our strongly convex objective: the suboptimality decreases at different geometric rates for each method. Most remarkable is the time complexity of ACDM which has dependence on : as we mentioned earlier, the strong convexity parameter for our least squares problems (in the accurate regime) are of the order , thus instantiating the shiftandinvert power method with ACDM leads to a total time complexity of which depends on . In comparison, the total time complexity of standard power method depends on ). Thus we expect our algorithm to be much faster when the eigengap is small.
It is also illuminating to compare ACDM with full gradient descent methods. The iteration complexity to achieve the same accuracy is for accelerated gradient descent (AGD, Nesterov, 2004). However, each full gradient calculation cost (not taking into account the sparsity) and thus the total time complexity for AGD is . In contrast, each update of ACDM costs , giving a total time complexity of . In view of (3), we have
and thus ACDM is typically more efficient than AGD (and in the extreme case of all but one being , the rate of ACDM is times better). It is also observed empirically that ACDM outperforms AGD and the celebrated conjugate gradient method for solving least squares problems (Lee & Sidford, 2013; AllenZhu et al., 2016).
Although the convergence rates of GSL and cyclic rules are inferior than that of ACDM, we often observe competitive empirical performance from them. They are easier to implement and can be the method of choice in certain scenariors (e.g., cyclic coordinate descent may be employed when the data can not fit in memory). Finally, we remark that our general scheme and conclusion carries over to the case of block and parallel coordinate descent methods (Bradley et al., 2011; Richtarik & Takac, 2014), where more than one coordinate are updated in each step.
4 Putting it all together
Our proposed approach consists of instantiating Algorithm 1 with the different least squares optimizers (LSO) presented in Algorithm 2. Our analysis of the total time complexity thus combines the iteration complexity of shiftandinvert and the time complexity of the CD methods. First, by Lemma 2 the time complexity of Phase I is not dominant because it is independent of the final error . Second, by Lemma 3 we need to solve subproblem, each with constant ratio between initial and final suboptimality. Third, according to Lemma 4 and noting , the number of coordinate updates for solving each subproblem is for GSL, and for ACDM. Since each coordinate update costs time, we obtain the time complexity of our algorithms given in Table 1.
5 Extensions
Our algorithms can be easily extended to several other settings. Although our analysis have assumed that is positive semidefinite, given an estimate of the maximum eigenvalue, the shiftandinvert algorithm can always convert the problem into a series of convex least squares problems, regardless of whether is positive semidefinite, and so CD methods can be applied with convergence guarantee. To extract the tailing eigenvector of , one can equivalently extract the top eigenvector of . Futhermore, extracting the top singular vectors of a nonsymmetric matrix is equivalent to extracting the top eigenvector of .
To extend our algorithms to extracting multiple top eigenvectors setting, we can use the “peeling” procedure: we iteratively extract one eigenvector at a time, removing the already extracted component from the data matrix before extracting the new direction. For example, to extract the second eigenvector , we can equivalently extract the top eigenvector of . And note that we do not need to explicitly compute and store which may be less sparse than ; all we need in CD methods are columns of which can be evaluated by vector operations, e.g., by storing . We refer the readers to AllenZhu & Li (2016) for a careful analysis of the accuracy in solving each direction and the runtime.
6 Experiments
We now present experimental results to demonstrate the efficiency of our algorithms, denoted as SI (shiftandinvert) + least squares solvers. Besides CD methods, we include AGD as a solver since SI+AGD gives the same time complexity as Lanczos. We compare our algorithms with the coordinatewise power method (CPM) and SGCD of Lei et al. (2016), and also the standard power method.
6.1 Synthetic datasets
We first generate synthetic datasets to validate the fast convergence of our algorithms. Our test matrix has the form , where is a random rotation matrix and . The parameter (which is also the eigengap of ) controls the decay of the spectrum of . We use passes of coordinate updates (each pass has coordinate updates) for SICyclic, SIGSL and SIACDM, and accelerated gradient updates for SIAGD, to approximately solve the least squares problems (1), and set in Algorithm 1.
We test several settings of dimension and , and the results are summarized in Figure 1. We observe that in most cases, CPM/SGCD indeed converge faster than the standard power method, justifying the intuition that some coodinates are more important than others. Furthermore, our algorithm significantly improve over CPM/SGCD for accurate estimate of the top eigenvector, especially when the eigengap is small, validating our convergence analysis which has improved dependence on the gap.
6.2 Realworld datasets
A major source of largescale sparse eigenvalue problem in machine learning comes from graphrelated applications (Shi & Malik, 2000; Ng et al., 2002; Fowlkes et al., 2004). We now examine the efficiency of the proposed algorithms on several largescale networks datasets with up to a few million nodes^{4}^{4}4http://snap.stanford.edu/data/index.html, whose the statistics are summarized in Table 3 (see Appendix F). Let be the (binary) adjacency matrix of each network, our goal is to compute the top eigenvectors for the corresponding normalized graph Laplacian matrix (von Luxburg, 2007) defined as , where is a diagonal matrix containing the degrees of each node on the diagonal. This task can be extended to computing a lowrank approximation of the normalized graph Laplacian (Gittens & Mahoney, 2013). ^{5}^{5}5We have also experimented with the original task of Lei et al. (2016). For their task, our algorithm do not achieve significant improvement over CPM/SGCD because the gap of the unnormalized adjacency matrix is quite large. We discuss implementation details in Appendix F.
The results are summarized in Figure 2. We observe that the proposed methods usually improves over CPM and SGCD, with up to x speedup in runtime over CPM/SGD depending on the dataset. Different from the simulation study, though SIACDM have faster convergence in theory, it is often slightly slower than SIGSL due to more maintainance for each update (it needs to maintain two vector sequences instead of one for plain CD methods).
References
 AllenZhu & Li (2016) AllenZhu, Zeyuan and Li, Yuanzhi. LazySVD: Even faster SVD decomposition yet without agonizing pain. In NIPS, 2016.
 AllenZhu et al. (2016) AllenZhu, Zeyuan, Qu, Zheng, Richtarik, Peter, and Yuan, Yang. Even faster accelerated coordinate descent using nonuniform sampling. In ICML, 2016.
 Beck & Tetruashvili (2013) Beck, Amir and Tetruashvili, Luba. On the convergence of block coordinate descent type methods. SIAM J. Imaging Sciences, 23(4):2037–2060, 2013.
 Bradley et al. (2011) Bradley, Joseph, Kyrola, Aapo, Bickson, Daniel, and Guestrin, Carlos. Parallel coordinate descent for l1regularized loss minimization. In ICML, 2011.
 Dhillon et al. (2011) Dhillon, Inderjit, Ravikumar, Pradeep, and Tewari, Ambuj. Nearest neighbor based greedy coordinate descent. In NIPS, 2011.
 Fowlkes et al. (2004) Fowlkes, Charless, Belongie, Serge, Chung, Fan, and Malik, Jitendra. Spectral grouping using the Nyström method. IEEE Trans. Pattern Analysis and Machine Intelligence, 26(2):214–225, 2004.
 Garber & Hazan (2015) Garber, Dan and Hazan, Elad. Fast and simple PCA via convex optimization. arXiv:1509.05647 [math.OC], 2015.
 Garber et al. (2016) Garber, Dan, Hazan, Elad, Jin, Chi, Kakade, Sham M., Musco, Cameron, Netrapalli, Praneeth, and Sidford, Aaron. Faster eigenvector computation via shiftandinvert preconditioning. In ICML, 2016.
 Ge et al. (2016) Ge, Rong, Jin, Chi, Kakade, Sham M., Netrapalli, Praneeth, and Sidford, Aaron. Efficient algorithms for largescale generalized eigenvector computation and canonical correlation analysis. In ICML, 2016.
 Gittens & Mahoney (2013) Gittens, Alex and Mahoney, Michael. Revisiting the Nystrom method for improved largescale machine learning. In ICML, 2013.
 Golub & van Loan (1996) Golub, Gene H. and van Loan, Charles F. Matrix Computations. third edition, 1996.
 Hardt & Price (2014) Hardt, Moritz and Price, Eric. The noisy power method: A meta algorithm with applications. In NIPS, 2014.

Johnson & Zhang (2013)
Johnson, Rie and Zhang, Tong.
Accelerating stochastic gradient descent using predictive variance reduction.
In NIPS, 2013.  Lee & Sidford (2013) Lee, Yin Tat and Sidford, Aaron. Efficient accelerated coordinate descent methods and faster algorithms for solving linear systems. In Proc. of the 54th Annual Symposium on Foundations of Computer Science (FOCS 2013), 2013.
 Lei et al. (2016) Lei, Qi, Zhong, Kai, and Dhillon, Inderjit S. Coordinatewise power method. In NIPS, 2016.
 Lu & Xiao (2015) Lu, Zhaosong and Xiao, Lin. On the complexity analysis of randomized blockcoordinate descent methods. Math. Prog., 152(1):615–642, 2015.
 Nesterov (2004) Nesterov, Y. Introductory Lectures on Convex Optimization. A Basic Course. Number 87 in Applied Optimization. 2004.
 Nesterov (2012) Nesterov, Yu. Efficiency of coordinate descent methods on hugescale optimization problems. SIAM Journal on Optimization, 22(2):341–362, 2012.

Ng et al. (2002)
Ng, A. Y., Jordan, M. I., and Weiss, Y.
On spectral clustering: Analysis and an algorithm.
In NIPS, 2002.  Nutini et al. (2015) Nutini, Julie, Schmidt, Mark, Laradji, Issam H., Friedlander, Michael, and Koepke, Hoyt. Coordinate descent converges faster with the gausssouthwell rule than random selection. In ICML, 2015.
 Richtarik & Takac (2014) Richtarik, Peter and Takac, Martin. Parallel coordinate descent methods for big data optimization. Math. Prog., pp. 1–52, 2014.
 Shamir (2015) Shamir, Ohad. A stochastic PCA and SVD algorithm with an exponential convergence rate. In ICML, 2015.
 Shi & Malik (2000) Shi, Jianbo and Malik, Jitendra. Normalized cuts and image segmentation. IEEE Trans. Pattern Analysis and Machine Intelligence, 22(8):888–905, 2000.
 von Luxburg (2007) von Luxburg, Ulrike. A tutorial on spectral clustering. Statistics and Computing, 17(4):395–416, December 2007.
 Wang et al. (2016) Wang, Weiran, Wang, Jialei, Garber, Dan, and Srebro, Nathan. Globally convergent stochastic optimization for canonical correlation analysis. In NIPS, 2016.
 Wright (2015) Wright, Stephen J. Coordinate descent algorithms. Math. Prog., 2015.
Appendix A Auxiliary Lemmas
Lemma 5.
Consider a positive semidefinite matrix with eigenvalues and corresponding eigenvectors . For a unit vector satisfying where , we have
Proof.
By direct calculation, we obtain
∎
Lemma 6.
For two nonzero vectors , we have
Proof.
By direct calculation, we obtain
where we have used the triangle inequality in the second inequality. ∎
Appendix B Analysis of inexact power method in different regimes
Lemma 7 (Iteration complexity of inexact power method).
Consider the following inexact shiftandinvert power iterations: for
where .

(Crude regime) Define . Assume that for all , the error in minimizing satisfies
Then we have .

(Accurate regime) Define . Assume that for all , the error in minimizing satisfies
Let and . Then we have for all .
Proof.
In the following, we use the shorthand . Observe that
Let the exact solution to the least squares problem be . If we obtain an approximate solution such that , it follows from the quadratic objective that
(4) 
Furthermore, we have for the exact solution that
(5) 
Crude regime
For the crude regime, we denote by the number of eigenvalues of that are greater than or equal to .
We will relate the iterates of inexact power method to those of the exact power method
from the same initialization .
On the one hand, we can show that exact power iterations converges quickly for eigenvalue estimation. Since and , we have
This indicates that after iterations, we have for all . And as a result, for all , it holds that
On the other hand, we can lower bound