 # Efficient coordinate-wise leading eigenvector computation

We develop and analyze efficient "coordinate-wise" methods for finding the leading eigenvector, where each step involves only a vector-vector product. We establish global convergence with overall runtime guarantees that are at least as good as Lanczos's method and dominate it for slowly decaying spectrum. Our methods are based on combining a shift-and-invert approach with coordinate-wise algorithms for linear regression.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Extracting the top eigenvalues/eigenvectors of a large symmetric matrix is a fundamental step in various machine learning algorithms. One prominent example of this problem is principal component analysis (PCA), in which we extract the top eigenvectors of the data covariance matrix, and there has been continuous effort in developing efficient stochastic/randomized algorithms for large-scale PCA (e.g.,

Garber et al., 2016; Shamir, 2015; Allen-Zhu & Li, 2016). The more general eigenvalue problems for large matrices without the covariance structure is relatively less studied. The method of choice for this problem has been the power method, or the faster but often less known Lanczos algorithm (Golub & van Loan, 1996)

, which are based on iteratively computing matrix-vector multiplications with the input matrix until the component of the vector that lies in the tailing eigenspace vanishes. However, for very large-scale and dense matrices, even computing a single matrix-vector product is expensive.

An alternative is to consider much cheaper vector-vector products, i.e., instead of updating all the entries in the vector on each iteration by a full matrix-vector product, we consider the possibility of only updating one coordinate, by only computing the inner product of single row of the matrix with the vector. Such operations do not even require to store the entire matrix in memory. Intuitively, this may result in an overall significant speedup, in certain likely scenarios, since certain coordinates in the matrix-vector product are more valuable than others for making local progress towards converging to the leading eigenvector. Indeed, this is the precise rational behind coordinate-descent methods that were extensively studied and are widely applied to convex optimization problems, see Wright (2015) for a comprehensive survey. Thus, given the structure of the eigenvalue problem which is extremely suitable for coordinate-wise updates, and the celebrated success of coordinate-descent methods for convex optimization, a natural question is whether such updates can be applied for eigenvector computation with provable global convergence guarantees, despite the inherent non-convexity of the problem.

Recently, Lei et al. (2016) have proposed two such methods, Coordinate-wise Power Method (CPM) and Symmetric Greedy Coordinate Descent (SGCD). Both methods update on each iteration only entries in the vector, for some fixed . CPM updates on each iteration the

coordinates that would change the most under one step of power method, while SGCD applies a greedy heuristic for choosing the coordinates to be updated. The authors show that CPM enjoys a global convergence rate, with rate similar to the classical power iterations algorithm provided that

, the number of coordinates to be updated on each iteration, is sufficiently large (or equivalently, the ”noise” outside the selected coordinate is sufficiently small). In principle, this might force to be as large as the dimension , and indeed in their experiments they set to grow linearly with , which is overall not significantly faster than standard power iterations, and does not truly capture the concept of coordinate updates proved useful to convex problems. The second algorithm proposed in Lei et al. (2016), SGCD, is shown to converge locally with a linear rate already for (i.e., only a single coordinate is updated on each iteration), however this results assume that the method is initialized with a vector that is already sufficiently close (in a non-trivial way) to the leading eigenvector. The dependence of CPM and SGCD on the inverse relative eigengap111Defined as for a positive semidefinite matrix , where is the -th largest eigenvalue of . of the input matrix is similar to that of the power iterations algorithm, i.e. linear dependence, which in principal is suboptimal, since square-root dependence can be obtained by methods such as the Lanczos algorithm.

We present globally-convergent coordinate-wise algorithms for the leading eigenvector problem which resolves the abovementioned concerns and significantly improve over previous algorithms. Our algorithms update only a single entry at each step and enjoy linear convergence. Furthermore, for a particular variant, the convergence rate depends only on the square-root of the inverse relative eigengap, yielding a total runtime that dominates that of the standard power method and competes with that of the Lanczos’s method. In Section 2, we discuss the basis of our algorithm, the shift-and-invert power method (Garber & Hazan, 2015; Garber et al., 2016), which transforms the eigenvalue problem into a series of convex least squares problems. In Section 3, we show the least squares problems can be solved using efficient coordinate descent methods that are well-studied for convex problems. This allow us to make use of principled coordinate selection rules for each update, all of which have established convergence guarantees.

We provide a summary of the time complexities of different globally-convergent methods in Table 1. In particular, it is observable that in cases where either the spectrum of the input matrix or the magnitude of its diagonal entries is slowly decaying, our methods can yield provable and significant improvements over previous methods. For example, for a spiked covariance model whose eigenvalues are , our algorithm (SI-GSL) can have a runtime independent of the eigengap , while the runtime of Lanczos’s method depends on . We also verify this intuition empirically via numerical experiments.

#### Notations

We use boldface uppercase letters (e.g., ) to denote matrices, and boldface lowercase letters (e.g., ) to denote vectors. For a positive definite matrix , the vector norm is defined as for any . We use to denote the element in row and column of the matrix , and use to denote the -th element of the vector unless stated otherwise. Additionally, and denote the -th row and -th column of the matrix respectively.

#### Problem formulation

We consider the task of extracting the top eigenvector of a symmetric positive definite matrix (extensions to other setting are discussed in Section 5). Let the complete set of eigenvalues of be , with corresponding eigenvectors which form an orthonormal basis of . Without loss of generality, we assume (which can always be obtained by rescaling the matrix). Furthermore, we assume the existence of a positive eigenvalue gap so that the top eigenvector is unique.

## 2 Shift-and-invert power method

In this section, we introduce the shift-and-invert approach to the eigenvalue problem and review its analysis, which will be the basis for our algorithms.

The most popular iterative algorithm for the leading eigenvalue problem is the power method, which iteratively performs the following matrix-vector multiplications and normalization steps

 ~wt←Awt−1,wt←~wt∥~wt∥,fort=1,….

It can be shown that the iterates become increasingly aligned with the eigenvector corresponding to the largest eigenvalue in magnitude, and the number of iterations needed to achieve -suboptimality in alignment is  (Golub & van Loan, 1996)222We can always guarantee that by taking to be a random unit vector.. We see that the computational complexity depends linearly on , and thus power method converges slowly if the gap is small.

Shift-and-invert (Garber & Hazan, 2015; Garber et al., 2016) can be viewed as a pre-conditioning approach which improves the dependence of the time complexity on the eigenvalue gap. The main idea behind this approach is that, instead of running power method on directly, we can equivalently run power method on the matrix where is a shifting parameter. Observe that has exactly the same set of eigenvectors as , and its eigenvalues are

 β1≥β2≥⋯≥βd>0,whereβi=1λ−ρi.

If we have access to a that is slightly larger than , and in particular if , then the inverse relative eigengap of is

, which means that the power method, applied to this shift and inverted matrix, will converge to the top eigenvector in only a poly-logarithmic number of iterations, in particular, without linear dependence on

.

In shift-and-invert power method, the matrix-vector multiplications have the form , which is equivalent to solving the convex least squares problem

 (1)

Solving such least squares problems exactly could be costly itself if is large. Fortunately, power method with approximate matrix-vector multiplications still converges, provided that the errors in each step is controlled; analysis of inexact power method together with applications to PCA and CCA can be found in Hardt & Price (2014); Garber et al. (2016); Ge et al. (2016); Wang et al. (2016).

We give the inexact shift-and-invert preconditioning power method in Algorithm 1, which consists of two phases. In Phase I, starting from an lower estimate of and an upper bound of , namely (recall that by assumption, ), the repeat-until loop locates an estimate of the eigenvalue such that , and it does so by estimating the top eigenvalue of (through a small number of power iterations in the for loop), and shrinking the gap between and . In Phase II, the algorithm fixes the shift-and-invert parameter , and runs many power iterations in the last for loop to achieve an accurate estimate of the top eigenvector .

We now provide convergence analysis of Algorithm 1 following that of Garber & Hazan (2015) closely, but improving their rate (removing an extra factor) with the warm-start strategy for least squares problems by Garber et al. (2016), and making the initial versus final error ratio explicit in the exposition. We discuss the least-squares solver in Algorithm 1 in the next section.

Measure of progress Since form an orthonormal basis of , we can write each normalized iterate as a linear combination of the eigenvectors as

 wt=d∑i=1ξtipi,% whereξti=w⊤tpi,fori=1,…,d,

and . Our goal is to have high alignment between the estimate and , i.e., for . Equivalently, we would like to have (see Lemma 5 in Appendix A).

### 2.1 Iteration complexity of inexact power method

Consider the inexact shift-and-invert power iterations: , , where , for . We can divide the convergence behavior of this method into two regimes: in the crude regime, our goal is to estimate the top eigenvalue (as in Phase I of Algorithm 1) regardless of the existence of an eigengap; whereas in the accurate regime, our goal is to estimate the top eigenvector (as in Phase II of Algorithm 1).

Following Garber & Hazan (2015); Garber et al. (2016), we provide convergence guarantees for inexact shift-and-invert power iterations in Lemma  7 (Appendix B), quantifying the sufficient accuracy in solving each least squares problem for the overall method to converge. Note that the iteration complexity for the crude regime is independent of eigengap, while in the accurate regime the rate in which the error decreases does depend on the eigengap between the first two eigenvalues of .

### 2.2 Bounding initial error for each least squares

For each least squares problem in inexact shift-and-invert power method, one can show that the optimal initialization based on the previous iterate is (Garber et al., 2016)

 winitt=wt−1w⊤t−1(λI−A)wt−1.

We can bound the suboptimality of this initialization, defined as . See full analysis in Appendix C.

###### Lemma 1.

Initialize each least squares problem in inexact shift-and-invert power method from . Then the initial suboptimality in can be bounded by the necessary final suboptimality as follows.

• In the crude regime,

 ϵinitt≤64β21ϵ2β2d((2β1/βd)T1−1(2β1/βd)−1)2⋅ϵt

where is the total number of iterations used (see precise definition in Lemma 7).

• In the accurate regime,

 ϵinitt≤max(G0,1)⋅16β21(β1−β2)2⋅ϵt

where is the initial condition for shift-and-invert power iterations (see precise definition in Lemma 7).

### 2.3 Iteration complexity of Algorithm 1

Based on the iteration complexity of inexact power method and the warm-start strategy, we derive the total number of iterations (least squares problems) needed by Algorithm 1.

###### Lemma 2 (Iteration complexity of the repeat-until loop in Algorithm 1).

Suppose that where . Set where , initialize each least squares problem as in Lemma 1, and maintain the ratio between initial and final error to be and throughout the repeat-until loop. Then for all it holds that

 12(λ(s−1)−ρ1)≤Δs≤λ(s−1)−ρ1.

Upon exiting this loop, the satisfies

 ρ1+~Δ4≤λ(f)≤ρ1+3~Δ2, (2)

and the number of iterations by repeat-until is .

###### Lemma 3 (Iteration complexity of the final for loop in Algorithm 1).

Suppose that where . Set where , initialize each least squares problem as in Lemma 1, and maintain the ratio between initial and final error to be throughout the final for loop. Then the output of Algorithm 1 satisfies

 w⊤sm1+m2p1≥1−ϵ.

As shown in the next section, we will be using linearly convergent solvers for the least squares problems, whose runtime depends on . Lemma 2 implies that we need to solve least squares problems in Phase I, each with . And Lemma 3 implies that we need to solve least squares problems in Phase II, each with .

## 3 Coordinate descent for least squares

Different from PCA, for general eigenvalue problems, the matrix may not have the structure of data covariance. As a result, fast stochastic gradient methods for finite-sums (such as SVRG Johnson & Zhang, 2013 used in Garber et al. 2016) does not apply. However, we can instead apply efficient coordinate descent (CD) methods for convex problems in our setting. In this section, we review the CD methods and study their complexity for solving the least squares problems created by Algorithm 1.

There is a rich literature on CD methods and it has attracted resurgent interests recently due to its simplicity and efficiency for big data problems; we refer the readers to Wright (2015) for a comprehensive survey. In each update of CD, we pick one coordinate and take a step along the direction of negative coordinate-wise gradient. To state the convergence properties of CD methods, we need the definitions of a few key quantities. Recall that we would like to solve the least squares problems of the form

 minxf(x)=12x⊤(λI−A)x−y⊤x.

with the optimal solution .

#### Smoothness

The gradient is coordinate-wise Lipschitz continuous: for all , , and , we have

 |∇if(x+αei)−∇if(x)|≤Li|α|% forLi:=λ−A[ii]

where is the -th standard basis. Note that for least squares problems, the coordinate-wise gradient Lipschitz constanst are the diagonal entries of its Hessian. Denote by and the largest and average Lipschitz constant over all coordinates respectively. These two Lipschitz constants are to be distinguished from the “global” smoothness constant , which satisfies

 ∥∇f(x)−∇f(y)∥≤~L⋅∥x−y∥,∀x,y.

Since is quadratic, . Observe that (Wright, 2015)

 1≤~L/Lmax≤d, (3)

with upper bound achieved by a Hessian matrix .

#### Strong-convexity

As mentioned before, the matrices in our least squares problems are always positive definite, with smallest eigenvalue . As a result, is -strongly convex with respect to the norm. Another strong convexity parameters we will need is defined with respect to the norm . It is shown by Nutini et al. (2015) that .

We collect relevant parameters for CD methods in Table 2.

### 3.1 Coordinate selection

The choice of coordinate to update is a central research topic for CD methods. We now discuss several coordinate selection rules, along with convergence rates, that are most relevant in our setting.

Gauss-Southwell-Lipschitz (GSL, Nutini et al., 2015): A greedy rule for selecting the coordinate where the coordinate with largest gradient magnitude (relative to the square root of Lipschitz constant) is chosen. The greedy rule tends to work well for high dimensional sparse problems (Dhillon et al., 2011).

Cyclic (Beck & Tetruashvili, 2013): Given an ordering of the coordinates, cyclic coordinate descent processes each coordinate exactly once according to the ordering in each pass. In the extension “essentially cyclic” rule, each coordinate is guaranteed to be chosen at least once in every updates. This rule is particularly useful when the data can not fit into memory, as we can iteratively load a fraction of and update the corresponding coordinates, and still enjoy convergence guarantee.

Accelerated randomized coordinate descent methods (ACDM)

: In each step, a coordinate is selected randomly, according to certain (possibly non-uniform) distribution based on the coordinate-wise Lipschitz constant. In accelerated versions of randomized CD

(Nesterov, 2012; Lee & Sidford, 2013; Lu & Xiao, 2015; Allen-Zhu et al., 2016), one can maintain auxiliary sequences of iterates to better approximate the objective function and obtain faster convergence rate, at the cost of (slightly) more involved algorithm. We implement the variant NU-ACDM by (Allen-Zhu et al., 2016), which samples each coordinate with probability proportional to and has the fastest convergence rate to date.

We provide the pseudocode of coordinate descent of the above rules in Algorithm 2.333The pseudo code for acclerated randomized CD is more involved and we refer the readers to (Allen-Zhu et al., 2016). Note that the stepsize for the chosen coordinate is , the inverse Lipschitz constant; for least squares problems where is quadratic in each dimension, this stepsize exactly minimizes the function over given the rest coordinates. In some sense, the GSL rule is the “optimal myopic coordinate update” (Nutini et al., 2015).

#### Connection to the greedy selection rule of Lei et al. (2016)

Given the current estimate , the coordinate-wise power method of Lei et al. (2016) selects the following coordinate to be updated

i.e., the coordinate that would be updated the most if a full power iteration were performed. Their selection rule is equivalent to choosing the largest element of , where is the current (lower) estimate of the eigenvalue . On the other hand, the greedy Gauss-Southwell rule for our method chooses the largest element of the gradient , where is an earlier estimate of the eigenvector and is an upper estimate of .

### 3.2 Convergence properties

We quote the convergence of CD from the literature.

###### Lemma 4 (Iteration complexity of CD methods).

The numbers of coordinate updates for Algorithm 2 to achieve where , using different coordinate selection rules, are

 GSL:O(1μL⋅logf(x0)−f∗ϵ′), Cyclic:O⎛⎝dLmax(1+d~L2/L2max)μ⋅logf(x0)−f∗ϵ′⎞⎠, NU-ACDM:O(∑di=1√Li√μ⋅logf(x0)−f∗ϵ′).

Lemma 4 implies that the coordinate descent methods converges linearly for our strongly convex objective: the suboptimality decreases at different geometric rates for each method. Most remarkable is the time complexity of ACDM which has dependence on : as we mentioned earlier, the strong convexity parameter for our least squares problems (in the accurate regime) are of the order , thus instantiating the shift-and-invert power method with ACDM leads to a total time complexity of which depends on . In comparison, the total time complexity of standard power method depends on ). Thus we expect our algorithm to be much faster when the eigengap is small.

It is also illuminating to compare ACDM with full gradient descent methods. The iteration complexity to achieve the same accuracy is for accelerated gradient descent (AGD, Nesterov, 2004). However, each full gradient calculation cost (not taking into account the sparsity) and thus the total time complexity for AGD is . In contrast, each update of ACDM costs , giving a total time complexity of . In view of (3), we have

 d∑di=1√Li√μ≤d2√Lmax√μ≤d2√~L√μ,

and thus ACDM is typically more efficient than AGD (and in the extreme case of all but one being , the rate of ACDM is times better). It is also observed empirically that ACDM outperforms AGD and the celebrated conjugate gradient method for solving least squares problems (Lee & Sidford, 2013; Allen-Zhu et al., 2016).

Although the convergence rates of GSL and cyclic rules are inferior than that of ACDM, we often observe competitive empirical performance from them. They are easier to implement and can be the method of choice in certain scenariors (e.g., cyclic coordinate descent may be employed when the data can not fit in memory). Finally, we remark that our general scheme and conclusion carries over to the case of block and parallel coordinate descent methods (Bradley et al., 2011; Richtarik & Takac, 2014), where more than one coordinate are updated in each step.

## 4 Putting it all together

Our proposed approach consists of instantiating Algorithm 1 with the different least squares optimizers (LSO) presented in Algorithm 2. Our analysis of the total time complexity thus combines the iteration complexity of shift-and-invert and the time complexity of the CD methods. First, by Lemma 2 the time complexity of Phase I is not dominant because it is independent of the final error . Second, by Lemma 3 we need to solve subproblem, each with constant ratio between initial and final suboptimality. Third, according to Lemma 4 and noting , the number of coordinate updates for solving each subproblem is for GSL, and for ACDM. Since each coordinate update costs time, we obtain the time complexity of our algorithms given in Table 1.

## 5 Extensions

Our algorithms can be easily extended to several other settings. Although our analysis have assumed that is positive semidefinite, given an estimate of the maximum eigenvalue, the shift-and-invert algorithm can always convert the problem into a series of convex least squares problems, regardless of whether is positive semidefinite, and so CD methods can be applied with convergence guarantee. To extract the tailing eigenvector of , one can equivalently extract the top eigenvector of . Futhermore, extracting the top singular vectors of a non-symmetric matrix is equivalent to extracting the top eigenvector of .

To extend our algorithms to extracting multiple top eigenvectors setting, we can use the “peeling” procedure: we iteratively extract one eigenvector at a time, removing the already extracted component from the data matrix before extracting the new direction. For example, to extract the second eigenvector , we can equivalently extract the top eigenvector of . And note that we do not need to explicitly compute and store which may be less sparse than ; all we need in CD methods are columns of which can be evaluated by vector operations, e.g., by storing . We refer the readers to Allen-Zhu & Li (2016) for a careful analysis of the accuracy in solving each direction and the runtime.

## 6 Experiments

We now present experimental results to demonstrate the efficiency of our algorithms, denoted as SI (shift-and-invert) + least squares solvers. Besides CD methods, we include AGD as a solver since SI+AGD gives the same time complexity as Lanczos. We compare our algorithms with the coordinate-wise power method (CPM) and SGCD of Lei et al. (2016), and also the standard power method.

### 6.1 Synthetic datasets

We first generate synthetic datasets to validate the fast convergence of our algorithms. Our test matrix has the form , where is a random rotation matrix and . The parameter (which is also the eigengap of ) controls the decay of the spectrum of . We use passes of coordinate updates (each pass has coordinate updates) for SI-Cyclic, SI-GSL and SI-ACDM, and accelerated gradient updates for SI-AGD, to approximately solve the least squares problems (1), and set in Algorithm 1.

We test several settings of dimension and , and the results are summarized in Figure 1. We observe that in most cases, CPM/SGCD indeed converge faster than the standard power method, justifying the intuition that some coodinates are more important than others. Furthermore, our algorithm significantly improve over CPM/SGCD for accurate estimate of the top eigenvector, especially when the eigengap is small, validating our convergence analysis which has improved dependence on the gap. Figure 1: Comparison of the convergence behavior (ρ1−w⊤tAwt vs. number of coordinate passes) of various algorithms for computing the leading eigenvector on synthetic datasets.

### 6.2 Real-world datasets

A major source of large-scale sparse eigenvalue problem in machine learning comes from graph-related applications (Shi & Malik, 2000; Ng et al., 2002; Fowlkes et al., 2004). We now examine the efficiency of the proposed algorithms on several large-scale networks datasets with up to a few million nodes, whose the statistics are summarized in Table 3 (see Appendix F). Let be the (binary) adjacency matrix of each network, our goal is to compute the top eigenvectors for the corresponding normalized graph Laplacian matrix (von Luxburg, 2007) defined as , where is a diagonal matrix containing the degrees of each node on the diagonal. This task can be extended to computing a low-rank approximation of the normalized graph Laplacian (Gittens & Mahoney, 2013)555We have also experimented with the original task of  Lei et al. (2016). For their task, our algorithm do not achieve significant improvement over CPM/SGCD because the gap of the unnormalized adjacency matrix is quite large. We discuss implementation details in Appendix F. Figure 2: Comparison of the runtime efficiency (ρ1−w⊤tAwt vs. run time in seconds) of various algorithms for computing the leading eigenvector on real-world network datasets.

The results are summarized in Figure 2. We observe that the proposed methods usually improves over CPM and SGCD, with up to x speedup in runtime over CPM/SGD depending on the dataset. Different from the simulation study, though SI-ACDM have faster convergence in theory, it is often slightly slower than SI-GSL due to more maintainance for each update (it needs to maintain two vector sequences instead of one for plain CD methods).

## References

• Allen-Zhu & Li (2016) Allen-Zhu, Zeyuan and Li, Yuanzhi. LazySVD: Even faster SVD decomposition yet without agonizing pain. In NIPS, 2016.
• Allen-Zhu et al. (2016) Allen-Zhu, Zeyuan, Qu, Zheng, Richtarik, Peter, and Yuan, Yang. Even faster accelerated coordinate descent using non-uniform sampling. In ICML, 2016.
• Beck & Tetruashvili (2013) Beck, Amir and Tetruashvili, Luba. On the convergence of block coordinate descent type methods. SIAM J. Imaging Sciences, 23(4):2037–2060, 2013.
• Bradley et al. (2011) Bradley, Joseph, Kyrola, Aapo, Bickson, Daniel, and Guestrin, Carlos. Parallel coordinate descent for l1-regularized loss minimization. In ICML, 2011.
• Dhillon et al. (2011) Dhillon, Inderjit, Ravikumar, Pradeep, and Tewari, Ambuj. Nearest neighbor based greedy coordinate descent. In NIPS, 2011.
• Fowlkes et al. (2004) Fowlkes, Charless, Belongie, Serge, Chung, Fan, and Malik, Jitendra. Spectral grouping using the Nyström method. IEEE Trans. Pattern Analysis and Machine Intelligence, 26(2):214–225, 2004.
• Garber & Hazan (2015) Garber, Dan and Hazan, Elad. Fast and simple PCA via convex optimization. arXiv:1509.05647 [math.OC], 2015.
• Garber et al. (2016) Garber, Dan, Hazan, Elad, Jin, Chi, Kakade, Sham M., Musco, Cameron, Netrapalli, Praneeth, and Sidford, Aaron. Faster eigenvector computation via shift-and-invert preconditioning. In ICML, 2016.
• Ge et al. (2016) Ge, Rong, Jin, Chi, Kakade, Sham M., Netrapalli, Praneeth, and Sidford, Aaron. Efficient algorithms for large-scale generalized eigenvector computation and canonical correlation analysis. In ICML, 2016.
• Gittens & Mahoney (2013) Gittens, Alex and Mahoney, Michael. Revisiting the Nystrom method for improved large-scale machine learning. In ICML, 2013.
• Golub & van Loan (1996) Golub, Gene H. and van Loan, Charles F. Matrix Computations. third edition, 1996.
• Hardt & Price (2014) Hardt, Moritz and Price, Eric. The noisy power method: A meta algorithm with applications. In NIPS, 2014.
• Johnson & Zhang (2013) Johnson, Rie and Zhang, Tong.

Accelerating stochastic gradient descent using predictive variance reduction.

In NIPS, 2013.
• Lee & Sidford (2013) Lee, Yin Tat and Sidford, Aaron. Efficient accelerated coordinate descent methods and faster algorithms for solving linear systems. In Proc. of the 54th Annual Symposium on Foundations of Computer Science (FOCS 2013), 2013.
• Lei et al. (2016) Lei, Qi, Zhong, Kai, and Dhillon, Inderjit S. Coordinate-wise power method. In NIPS, 2016.
• Lu & Xiao (2015) Lu, Zhaosong and Xiao, Lin. On the complexity analysis of randomized block-coordinate descent methods. Math. Prog., 152(1):615–642, 2015.
• Nesterov (2004) Nesterov, Y. Introductory Lectures on Convex Optimization. A Basic Course. Number 87 in Applied Optimization. 2004.
• Nesterov (2012) Nesterov, Yu. Efficiency of coordinate descent methods on huge-scale optimization problems. SIAM Journal on Optimization, 22(2):341–362, 2012.
• Ng et al. (2002) Ng, A. Y., Jordan, M. I., and Weiss, Y.

On spectral clustering: Analysis and an algorithm.

In NIPS, 2002.
• Nutini et al. (2015) Nutini, Julie, Schmidt, Mark, Laradji, Issam H., Friedlander, Michael, and Koepke, Hoyt. Coordinate descent converges faster with the gauss-southwell rule than random selection. In ICML, 2015.
• Richtarik & Takac (2014) Richtarik, Peter and Takac, Martin. Parallel coordinate descent methods for big data optimization. Math. Prog., pp. 1–52, 2014.
• Shamir (2015) Shamir, Ohad. A stochastic PCA and SVD algorithm with an exponential convergence rate. In ICML, 2015.
• Shi & Malik (2000) Shi, Jianbo and Malik, Jitendra. Normalized cuts and image segmentation. IEEE Trans. Pattern Analysis and Machine Intelligence, 22(8):888–905, 2000.
• von Luxburg (2007) von Luxburg, Ulrike. A tutorial on spectral clustering. Statistics and Computing, 17(4):395–416, December 2007.
• Wang et al. (2016) Wang, Weiran, Wang, Jialei, Garber, Dan, and Srebro, Nathan. Globally convergent stochastic optimization for canonical correlation analysis. In NIPS, 2016.
• Wright (2015) Wright, Stephen J. Coordinate descent algorithms. Math. Prog., 2015.

## Appendix A Auxiliary Lemmas

###### Lemma 5.

Consider a positive semidefinite matrix with eigenvalues and corresponding eigenvectors . For a unit vector satisfying where , we have

 v⊤Av≥ρ1(1−2ϵ).
###### Proof.

By direct calculation, we obtain

 v⊤Av =v⊤(d∑i=1ρipip⊤i)v =d∑i=1ρi(v⊤pi)2 ≥ρ1(v⊤p1)2 ≥ρ1(1−ϵ)2 ≥ρ1(1−2ϵ).

###### Lemma 6.

For two nonzero vectors , we have

###### Proof.

By direct calculation, we obtain

 ∥∥∥a∥a∥−b∥b∥∥∥∥ =2∥a−b∥∥a∥

where we have used the triangle inequality in the second inequality. ∎

## Appendix B Analysis of inexact power method in different regimes

###### Lemma 7 (Iteration complexity of inexact power method).

Consider the following inexact shift-and-invert power iterations: for

 ~wt≈argminwft(w)=12w⊤(λI−A)w−w⊤t−1w, wt←~wt∥~wt∥,

where .

• (Crude regime) Define . Assume that for all , the error in minimizing satisfies

 ϵt≤ϵ2β2d128β1((2β1/βd)−1(2β1/βd)T1−1)2.

Then we have .

• (Accurate regime) Define . Assume that for all , the error in minimizing satisfies

 ϵt≤min(d∑i=2ξ2(t−1)i/βi, ξ2(t−1)1/β1)⋅(β1−β2)232.

Let and . Then we have for all .

###### Proof.

In the following, we use the shorthand . Observe that

 p⊤iMλpi =βi,fori=1,…,d, p⊤iMλpj =0,fori≠j.

Let the exact solution to the least squares problem be . If we obtain an approximate solution such that , it follows from the quadratic objective that

 ϵt =12(~wt−~w∗t)⊤M−1λ(wt−w∗t)=12∥~wt−~w∗t∥2M−1λ. (4)

Furthermore, we have for the exact solution that

 ~w∗t=Mλwt−1=Mλd∑i=1ξ(t−1)ipi=d∑i=1βiξ(t−1)ipi. (5)

#### Crude regime

For the crude regime, we denote by the number of eigenvalues of that are greater than or equal to .

We will relate the iterates of inexact power method to those of the exact power method

 ~vt←Mλvt−1,vt←~vt∥~vt∥,t=1,…

from the same initialization .

On the one hand, we can show that exact power iterations converges quickly for eigenvalue estimation. Since and , we have

 d∑i=n+1(v⊤tpi)2 =∑di=n+1β2tiξ20i∑di=1β2tiξ20i≤β2tn+1∑di=n+1ξ20iβ2t1ξ201 ≤1ξ201(1−ϵ4)2t≤e−ϵt/2ξ201.

This indicates that after iterations, we have for all . And as a result, for all , it holds that

 v⊤tMλvt = v⊤t(d∑i=1βipip⊤i)vt=d∑i=1βi(v⊤tpi)2 ≥ βnn∑i=1(v⊤tpi)2≥(1−ϵ4)β1⋅(1−d∑i=n+1(v⊤tpi)2) ≥ (1−