 # Efficient Algorithms for Large-scale Generalized Eigenvector Computation and Canonical Correlation Analysis

This paper considers the problem of canonical-correlation analysis (CCA) (Hotelling, 1936) and, more broadly, the generalized eigenvector problem for a pair of symmetric matrices. These are two fundamental problems in data analysis and scientific computing with numerous applications in machine learning and statistics (Shi and Malik, 2000; Hardoon et al., 2004; Witten et al., 2009). We provide simple iterative algorithms, with improved runtimes, for solving these problems that are globally linearly convergent with moderate dependencies on the condition numbers and eigenvalue gaps of the matrices involved. We obtain our results by reducing CCA to the top-k generalized eigenvector problem. We solve this problem through a general framework that simply requires black box access to an approximate linear system solver. Instantiating this framework with accelerated gradient descent we obtain a running time of O(z k √(κ)/ρ(1/ϵ) (kκ/ρ)) where z is the total number of nonzero entries, κ is the condition number and ρ is the relative eigenvalue gap of the appropriate matrices. Our algorithm is linear in the input size and the number of components k up to a (k) factor. This is essential for handling large-scale matrices that appear in practice. To the best of our knowledge this is the first such algorithm with global linear convergence. We hope that our results prompt further research and ultimately improve the practical running time for performing these important data analysis procedures on large data sets.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Canonical-correlation analysis (CCA) and the generalized eigenvector problem are fundamental problems in scientific computing, data analysis, and statistics (Barnett and Preisendorfer, 1987; Friman et al., 2001).

These problems arise naturally in statistical settings. Let denote two large sets of data points, with empirical covariance matrices , , and and suppose we wish to find features that best encapsulate the similarity or dissimilarity of the data sets. CCA is the problem of maximizing the empirical correlation

 maxx⊤Sxxx=1 and y⊤Syyy=1x⊤Sxyy (1)

and thereby extracts common features of the data sets. On the other hand the generalized eigenvalue problems

 maxx≠0x⊤Sxxxx⊤Syyx and maxy≠0y⊤Syyyy⊤Sxxy

compute features that maximizes discrepancies between the data sets. Both these problems are easily extended to the -feature case (See Section 3). Algorithms for solving them are commonly used to extract features to compare and contrast large data sets and are used commonly in regression (Kakade and Foster, 2007), clustering (Chaudhuri et al., 2009), classification (Karampatziakis and Mineiro, 2013), word embeddings (Dhillon et al., 2011) and more.

Despite the prevalence of these problems and the breadth of research on solving them in practice ((Barnett and Preisendorfer, 1987; Barnston and Ropelewski, 1992; Sherry and Henson, 2005; Karampatziakis and Mineiro, 2013) to name a few), there are relatively few results on obtaining provably efficient algorithms. Both problems can be reduced to performing principle component analysis (PCA), albeit on complicated matrices e.g for CCA and for generalized eigenvector. However applying PCA to these matrices traditionally involves the formation of and

which is prohibitive for sufficiently large datasets if we only want to estimate top-

eigenspace.

A natural open question in this area is to what degree can the formation of and can be bypassed to obtain efficient scalable algorithms in the case where the number of features is much smaller than the dimensions of the problem and . Can we develop simple iterative practical methods that solve this problem in close to linear time when is small and the condition number and eigenvalue gaps are bounded? While there has been recent work on solving these problems using iterative methods (Avron et al., 2014; Paul, 2015; Lu and Foster, 2014; Ma et al., 2015) we are unaware of previous provable global convergence results and more strongly, linearly convergent scalable algorithms.

The central goal of this paper is to answer this question in the affirmative. We present simple globally linearly convergent iterative methods that solve these problems. The running time of these problems scale well as the number of features and conditioning of the problem stay fixed and the size of the datasets grow. Moreover, we implement the method and perform experiments demonstrating that the techniques may be effective for large scale problems.

Specializing our results to the single feature case we show how to solve the problems all in time , where is the maximum of condition numbers of and and is the eigengap of appropriate matrices and mentioned above, and is the number of nonzero entries in and . To the best of our knowledge this is the first such globally linear convergent algorithm for solving these problems.

We achieve our results through a general and versatile framework that allows us to utilize fast linear system solvers in various regimes. We hope that by initiating this theoretical and practical analysis of CCA and the generalized eigenvector problem we can promote further research on the problem and ultimately advance the state-of-the-art for efficient data analysis.

### 1.1 Our Approach

To solve the problems motivated in the previous section we first directly reduce CCA to a generalized eigenvector problem (See Section 5). Consequently, for the majority of the paper we focus on the following:

###### Definition 1 (Top-k Generalized Eigenvector666We use the term generalized eigenvector to refer to a non-zero vector v such that Av=λBv for symmetric A and B, not the general notion of eigenvectors for asymmetric matrices.).

Given symmetric matrices where is positive definite compute defined for all by

 wi∈argmaxw∣∣w⊤Aw∣∣ s.t.~{}w⊤Bw=1 and w⊤Bwj=0∀j∈[i−1].

The generalized eigenvector is equivalent to the problem of computing the PCA of in the norm. Consequently, it is the same as computing the top eigenvectors of largest absolute value of the symmetric matrix and then multiplying by .

Unfortunately, as we have discussed, explicitly computing is prohibitively expensive when is large and therefore we wish to avoid forming explicitly. One natural approach is to develop an iterative methods to approximately apply

to a vector and then use that method as a subroutine to perform the power method on

. Even if we could perform the error analysis to make this work, such an approach would likely require at least a suboptimal iterations to achieve error .

To bypass these difficulties, we take a closer look at the power method. For some initial vector , let be the result of iterations of power method on followed by multiplying . Clearly . Furthermore, since we typically initialize the power method by a random vector and since is positive definite, if we instead we computed for random we would likely converge at the same rate as the power method at the cost of just a slightly worse initialization quality.

Consequently, we can compute our desired eigenvectors by simply alternating between applying and to a random initial vector. Unfortunately, computing exactly is again outside our computational budget. At best we should only attempt to apply approximately by linear system solvers.

One of our main technical contributions is to argue about the effect of inexact solvers in this method. Whereas solving every linear system to target accuracy would again require time per linear system, which leads to a sub-optimal overall running time, i.e. sublinear convergence, we instead show how to warm start the linear system solvers and obtain a faster rate. We exploit the fact that as we perform many iterations of power methods, points at time converge to eigenvectors and therefore we can initialize our linear system solver at time carefully using our points at time . Ultimately we show that we only need to make fixed multiplicative progress in solving the linear system in every iteration of the power method, thus the runtime for solving each linear system is independent of .

Putting these pieces together with careful error analysis yields our main result. Our algorithm only requires the ability to apply to a vector and an approximate linear system solver for , which in turn can be obtained by just applying to vectors. Consequently, our framework is versatile, scalable, and easily adaptable to take advantage of faster linear system solvers.

### 1.2 Previous Work

While there has been limited previous work on provably solving CCA and generalized eigenvectors, we note that there is an impressive body of literature on performing PCA(Rokhlin et al., 2009; Halko et al., 2011; Musco and Musco, 2015; Garber and Hazan, 2015; Jin et al., 2015) and solving positive semidefinite linear systems(Hestenes and Stiefel, 1952; Nesterov, 1983; Spielman and Teng, 2004). Our analysis in this paper draws on this work extensively and our results should be viewed as the principled application of them to the generalized eigenvector problem.

There has been much recent interest in designing scalable algorithms for CCA(Ma et al., 2015; Wang et al., 2015; Wang and Livescu, 2015; Michaeli et al., 2015)

. To our knowledge, there are no provable guarantees for approximate methods for this problem. Heuristic-based approachs

(Witten et al., 2009; Lu and Foster, 2014) compute efficiently, but only give suboptimal result due to coarse approximation. The work in (Ma et al., 2015) provides one natural iterative procedure, where the per iterate computational complexity is low. This work only provides local convergence guarantees and does not provide guarantees of global convergence.

Also of note is that many recent algorithms (Ma et al., 2015; Wang et al., 2015) have mini-batch variations, but there’s no guarantees for mini-batch style algorithm for CCA yet. Our algorithm can also be easily extends to a mini-batch version. While we do not explicitly analyze this variation, and we believe our analysis and techniques are helpful for extensions to this setting. We also view this as an important direction for future work.

We hope that by establishing the generalized eigenvector problem and providing provable guarantees under moderate regularity assumptions that our results may be further improved and ultimately this may advance the state-of-the-art in practical algorithms for performing data analysis.

### 1.3 Our Results

Our main result in this paper is a linearly convergent algorithm for computing the top generalized eigenvectors (see Definition 1). In order to be able to state our results we introduce some notation. Let be the eigenvalues of (their existence is guaranteed by Lemma 9 in the appendix). The eigengap and . Let denote the number of nonzero entries in and .

###### Theorem 2 (Informal version of Theorem 6).

Given two matrices and , there is an algorithm that computes the top- generalized eigenvectors up to an error in time , where is the condition number of and hides logarithmic terms in , , and , and nothing else.

Here is a comparison of our result with previous work.

Turning to the problem of CCA, cf. (1), the relevant parameters are i.e., the maximum of the condition numbers of and , where are the eigenvalues of in decreasing absolute value. Let denote the number of nonzeros in and . Our main results are a reduction from CCA to the generalized eigenvector problem.

###### Theorem 3 (Informal version of Theorem 7).

Given two data matrices and , there is an algorithm that performs top- CCA up to an error in time , where and hides logarithmic terms in , , and , and nothing else.

Table 2 compares our result with existing results. 777(Ma et al., 2015) only shows local convergence for S-AppGrad. Starting within this radius of convergence requires us to already solve the problem to a high accuracy.

We should note that the actual bounds we obtain are somewhat stronger than the above informal bounds. Some of the terms in logarithm also appear only as additive terms. Finally we also give natural stochastic extensions of our algorithms where the cost of each iteration may be much smaller than the input size. The key idea behind our approach is to use an approximate linear system solver as a black box inside power method on an appropriate matrix. We show that this dependence on a linear system solver is in some sense essential. In Section 6 we show that the generalized eigenvector problem is strictly more general than the problem of solving positive semidefinite linear systems and consequently our dependence on the condition number of is in some cases optimal.

Subsequent to the submission of this paper, we learned of the closely related work in (Wang et al., 2016), which presents a number of additional interesting results. We think it is worthwhile to point out that our algorithm only requires black box access to any linear solver. Although the result in Theorem 3 was stated by instantiating the linear system solver by accelerated gradient descent (AGD), it is immediate to apply Theorem 7

and give the corresponding rates if we instantiate it by other popular algorithms, including gradient descent (GD), stochastic variance reduction (SVRG)

(Johnson and Zhang, 2013), and its accelerated version (ASVRG) (Frostig et al., 2015; Lin et al., 2015). We summarize the corresponding runtime in Table 3. There , and , are -th column of matrix and . Note by definition we always have . For generalized eigenvector problem, results of similar flavor as in Table 3 can also be easily derived.

Finally, we also run experiments to demonstrate the practical effectiveness of our algorithm on both small and large scale datasets.

### 1.4 Paper Overview

In Section 2, we present our notation. In Section 3, we formally define the problems we solve and their relevant parameters. In Section 4, we present our results for the generalized eigenvector problem. In Section 5, we present our results for the CCA problem. In Section 6 we argue that generalized eigenvector computation is as hard as linear system solving and that our dependence on is near optimal. In Section 7, we present experimental results of our algorithms on some real world data sets. Due to space limitations, proofs are deferred to the appendix.

## 2 Notation

We use bold capital letters to denote matrices and bold lowercase letters for vectors. For symmetric positive semidefinite (PSD) matrix , we let denote the -norm of and we let denotes the inner product of and in the -norm. We say that a matrix is -orthonormal if . We let denotes the

largest singular value of

, and denote the smallest and largest singular values of respectively. Similarly we let refers to the largest eigenvalue of in magnitude. We let denotes the number of nonzeros in . We also let denote the condition number of (i.e., the ratio of the largest to smallest eigenvalue).

## 3 Problem Statement

In this section, we recall the generalized eigenvalue problem, define our error metric, and introduce all relevant parameters. Recall that the generalized eigenvalue problem is to find vectors , such that

Using stationarity conditions, it can be shown that the vectors are given by , where is an eigenvector of with eigenvalue such that . Our goal is to recover the top-k eigen space i.e., . In order to quantify the error in estimating the eigenspace, we use largest principal angle, which is a standard notion of distance between subspaces (Golub and Van Loan, 2012).

###### Definition 4 (Largest principal angle).

Let and be two dimensional subspaces, and their -orthonormal basis respectively. The largest principal angle in the -norm is defined to be

 θ(W,V)def=arccos(σmin(V⊤BW)),

Intuitively, the largest principal angle corresponds to the largest angle between any vector in the span of and its projection onto the span of . In the special case where , the above definition reduces to our choice in the top- setting. Given two matrices and , we use to denote the largest principle angle between the subspaces spanned by the columns of and . We say that achieves an error of if and , where is the matrix whose columns are . The relevant parameters for us are the eigengap, i.e. the relative difference between and eigenvalues, , and , the condition number of .

## 4 Our Results

In this section, we provide our algorithms and results for solving the generalized eigenvector problem. We present our results for the special case of computing the top generalized eigenvector (Section 4.1) followed by the general case of computing the top- generalized eigenvectors (Section 4.2). However, first we formally define a linear system solver as follows:

Linear system solver: In each of our main results (Theorems 5 and 6) we assume black box access to an approximate linear system solver. Given a PSD matrix , a vector , an initial estimate , and an error parameter , we require to decrease the error by a multiplicative , i.e. output with . We let denote the time needed for this operation. Since the error metric is equivalent to function error on minimizing the convex quadratic up to constant scaling, an approximate linear system solver is equivalent to an optimization algorithm for . We also specialize our results using Nesterov’s accelerated gradient descent to state our bounds. Stating our results using linear system solver as a blackbox allows the user to choose an efficient solver depending on the structure of and helps pass any improvements in linear system solvers on to the problem of generalized eigenvectors.

### 4.1 Top-1 Setting

Our algorithm for computing the top generalized eigenvector, GenELin is given in Algorithm 1.

The algorithm implements an approximate power method where each iteration consists of approximately multiplying a vector by . In order to do this, GenELin solves a linear system in and then scales the resulting vector to have unit -norm. Our main result states that given an oracle for solving the linear systems,101010For example, we could use Nesterov’s accelerated gradient descent, Algorithm 4 the number of iterations taken by Algorithm 1 to compute the top eigenvector up to an accuracy of is at most where .

###### Theorem 5.

Recall that the linear system solver takes time to reduce the error by a factor . Given matrices and , GenELin (Algorithm 1) computes a vector achieving an error of in iterations, where . The running time of the algorithm is at most

 O(1ρ(log1cosθ0⋅T(ρ2cos2θ016)+log1ϵ⋅T(ρ216))+1ρ(nnz(A)+nnz(B)+d)log1ϵcosθ0).

Furthermore, if we use Nesterov’s accelerated gradient descent (Algorithm 4) to solve the linear systems in Algorithm 1, the time can be bounded as

 O(nnz(B)√κ(B)ρ(log1cosθ0log1ρcosθ0+log1ϵlog1ρ)+1ρnnz(A)log1ϵcosθ0).

Remarks:

• Since GenELin chooses randomly, Lemma 13 tells us that

with probability greater than

.

• Note that GenELin exploits the sparsity of input matrices since we only need to apply them as operators.

• Depending on computational restrictions, we can also use a subset of samples in each iteration of GenELin. In some large scale learning applications using minibatches of data in each iteration helps make the method scalable while still maintaining the quality of performance.

### 4.2 Top-k Setting

In this section, we give an extension of our algorithm and result for computing the top- generalized eigenvectors. Our algorithm, GenELinK is formally given as Algorithm 2.

GenELinK is a natural generalization of GenELin from the previous section. Given an initial set of vectors , the algorithm proceeds by doing approximate orthogonal iteration. Each iteration involves solving independent linear systems111111Similarly, as before, we could use Nesterov’s accelerated gradient descent, i.e. Algorithm 4. and orthonormalizing the iterates. The following theorem is the main result of our paper which gives runtime bounds for Algorithm 2. As before, we assume access to a blackbox linear system solver and also give a result instantiating the theorem with Nesterov’s accelerated gradient descent algorithm.

###### Theorem 6.

Suppose the linear system solver takes time to reduce the error by a factor . Given input matrices and , GenELinK computes a matrix which is an estimate of the top generalized eigenvectors with an error of i.e., and , where in iterations where . The run time of this algorithm is at most

 O(1ρ(log1cosθ0⋅T(ρ2cos4θ064kγ2)+T(ρ264kγ2)log1ϵ)+1ρ(nnz(A)k+nnz(B)k+dk2)log1ϵcosθ0),

where , being the top- eigenvalues of . Furthermore, if we use Nesterov’s accelerated gradient descent (Algorithm 4) to solve the linear systems in Algorithm 2, the time above can be bounded as

 O(nnz(B)k√κ(B)ρ(log1cosθ0logkγρcosθ0+log1ϵlogkγρ)+(nnz(A)k+dk2)ρlog1ϵcosθ0).

Remarks:

• Lemma 13 again tells us that since is chosen to be normalized after choosing uniformly at random from the unit sphere, with probability greater than .

• This result recovers Theorem 5 as a special case, since when , we also have .

## 5 Application to CCA

We now outline how the CCA problem can be reduced to computing generalized eigenvectors. The CCA problem is as follows. Given two sets of data points and , let , , and . We wish to find vectors and which are defined recursively as

 (ϕi,ψi)∈argmaxϕ,ψϕ⊤Sxyψ s.t.~{}∥ϕ∥Sx=1 and ϕ⊤Sxϕj=0∀j≤i−1∥ψ∥Sy=1 and ψ⊤Syψj=0∀j≤i−1.

where the values of are called canonical correlations between and .

For reduction, we know any stationary point of this optimization problem satisfies , and , where and are two constants. Combined with the constraints, we also see that . This can be written in matrix form as . Suppose the generalized eigenvalues of the above matrices are . The top -dimensional eigen-space of this generalized eigenvalue problem corresponds to the linear subspace spanned by the eigenvectors of and , which are . Once we solve the top- generalized eigenvector problem for the matrices and , we can pick any orthonormal basis that spans the output subspace and choose a random -dimensional projection of those vectors. The formal algorithm is given in Algorithm 3. Combining this with our results for computing generalized eigenvectors, we obtain the following result.

###### Theorem 7.

Suppose the linear system solver takes time to reduce the error by a factor . Given inputs and , with probability greater than , then there is some universal constant , so that Algorithm 3 outputs and such that , and , in time

 O(1ρ(logdκζ⋅T(cζ6ρ2d2k5κ2γ2)+T(cζ2ρ2k3γ2)log1ϵ)+1ρ(nnz(X,Y)k+dk2)logdκζϵ),

where and and . If we use Nesterov’s accelerated gradient descent (Algorithm 4) to solve the linear systems in GenELink, then the total runtime is

 O(nnz(X,Y)k√κρ(logdκζlogdκγζρ+log1ϵlogkγρ)+dk2ρlogdκζϵ),

Remarks:

• Note that we depend on the maximum of the condition numbers of and since the linear systems that arise in GenELinK decompose into two separate linear systems, one in and the other in .

• We can also exploit sparsity in the data matrices and since we only need to apply or only as operators, which can be done by applying and in appropriate order. Exploiting sparsity is crucial for any large scale algorithm since there are many data sets (e.g., URL dataset in our experiments) where dense operations are impractical.

## 6 Reduction to Linear System

Here we show that solving linear systems in is inherent in solving the top- generalized eigenvector problem in the worst case and we provide evidence a factor in the running time is essential for a broad class of iterative methods for the problem.

Let be a symmetric positive definite matrix and suppose we wish to solve the linear system , i.e. compute with . If we set and then

and consequently computing the top-1 generalized eigenvector yields the solution to the linear system. Therefore, the problem of computing top- generalized eigenvectors is in general harder than the problem of solving symmetric positive definite linear systems.

Moreover, it is well known that any method which starts at and iteratively applies to linear combinations of the points computed so far must apply at least in order to halve the error in the standard norm for the problem (Shewchuk, 1994). Consequently, methods that solve the top- generalized eigenvector problem by simply applying and , which is the same as applying and taking linear combinations with , must apply at least times to achieve small error, unless they exploit more structure of or the initialization.

## 7 Simulations

In this section, we present our experiment results performing CCA on three benchmark datasets which are summarized in Table 4. We wish to demonstrate two things via these simulations: 1) the behavior of CCALin verifies our theoretical result on relatively small-scale dataset, and 2) scalability of CCALin comparing it with other existing algorithms on a large-scale dataset.

Let us now specify the error metrics we use in our experiments. The first ones are the principal angles between the estimated subspaces and the true ones. Let and be the estimated subspaces and , be the true canonical subspaces. We will use principle angles under -norm, under -norm and 131313See Algorithm 3 for definition of , under the norm. Unfortunately, we cannot compute these error metrics for large-scale datasets since they require knowledge of the true canonical components. Instead we will use Total Correlations Captured (TCC), which is another metric widely used by practitioners, defined to be the sum of canonical correlation between two matrices. Also, Proportion of Correlations Captured is given as

 PCC=TCC(XWx,YWy)/%TCC(XVx,YVy)

For a fair comparison with other algorithms (which usually call highly optimized matrix inversion subroutines), we use number of FLOPs instead of wall clock time to measure the performance.

### 7.1 Small-scale Datasets

MNIST dataset(LeCun et al., 1998) consists of 60,000 handwritten digits from 0 to 9. Each digit is a image represented by real values in [0,1]. Here CCA is performed between left half images and right half images. The data matrix is dense but the dimension is fairly small.

Penn Tree Bank (PTB) dataset comes from full Wall Street Journal Part of Penn Tree Bank which consists of 1.17 million tokens and a vocabulary size of 43k(Marcus et al., 1993), which has already been used to successfully learn the word embedding by CCA(Dhillon et al., 2011). Here, the task is to learn correlated components between two consecutive words. We only use the top 10,000 most frequent words. Each row of data matrix is an indicator vector and hence it is very sparse and is diagonal.

Since the input matrices are very ill conditioned, we add some regularization and replace by (and similarly with ). In CCALin, we run GenELinK with and accelerated gradient descent (Algorithm 4 in the supplementary material) to solve the linear systems. The results are presented in Figure 1 and Figure 2. Figure 1: Global convergence of PCC and principle angles on MNIST and PTB Datasets

Figure 1 shows a typical run of CCALin from random initialization on both MNIST and PTB dataset. We see although may be even 90 degree at some point respectively, is always monotonically decreasing (as monotonically increasing) as predicted by our theory. In the end, as goes to zero, it will push both and go to zero, and PCC go to 1. This demonstrates that our algorithm indeed converges to the true canonical space. Figure 2: Linear convergence of principle angles on MNIST and PTB Datasets

Furthermore, by a more detailed examination of experimental data in Figure 1, we observe in Figure 2 that is indeed linearly convergent as we predicted in the theory. In the meantime, and may initially converge a bit slower than , but in the end they will be upper bounded by times a constant factor, thus will eventually converge at a linear rate at least as fast as .

### 7.2 Large-scale Dataset

URL Reputation dataset contains 2.4 million URLs and 3.2 million features including both host-based features and lexical based features. Each feature is either real valued or binary. For experiments in this section, we follow the setting of (Ma et al., 2015). We use the first 2 million samples, and run CCA between a subset of host based features and a subset of lexical based features to extract the top components. Although the data matrix is relatively sparse, unlike PTB, it has strong correlations among different coordinates, which makes much denser ().

Classical algorithms are impractical for this dataset on a typical computer, either running out of memory or requiring prohibitive amount of time. Since we cannot estimate the principal angles, we will evaluate TCC performance of CCALin.

We compare our algorithm to S-AppGrad (Ma et al., 2015) which is an iterative algorithm and PCA-CCA (Ma et al., 2015), NW-CCA (Witten et al., 2009) and DW-CCA (Lu and Foster, 2014) which are one-shot estimation procedures.

In CCALin, we employ GenELinK using stochastic accelerated gradient descent for solving linear systems using minibatches in each of the gradient steps and also leverage sparsity of the data to deal with the large data size. The result is shown in Figure 3. It is clear from the plot that our algorithm takes fewer computations than the other algorithms to achieve the same accuracy.

## 8 Conclusion

In summary, we have provided the first provable globally linearly convergent algorithms for solving canonical correlation analysis and the generalized eigenvector problems. We have shown that for recovering the top

components our algorithms are much faster than traditional methods based on fast matrix multiplication and singular value decomposition when

and the condition numbers and eigenvalue gaps of the matrices involved are moderate. Moreover, we have provided empirical evidence that our algorithms may be useful in practice. We hope these results serve as the basis for further improvements in performing large scale data analysis both in theory and in practice.

## Appendix A Solving Linear System via Accelerated Gradient Descent

Since we use accelerated gradient descent in our main theorems, for completeness, we put the algorithm and cite its result about iteration complexity here without proof.

###### Theorem 8 ((Nesterov, 1983)).

Let be -strongly convex and -smooth, then accelerated gradient descent with learning rate and satisfies:

 f(xt)−f(x⋆)≤2(f(x0)−f(x⋆))exp(−t√Q) (2)

## Appendix B Proofs of Main Theorem

In this section we will prove Theorems 5,  6 and  7.

### b.1 Rank-1 Setting

We first prove our claim that has an eigenbasis.

###### Lemma 9.

Let be the eigenpairs of the symmetric matrix . Then is an eigenvector of with eigenvalue .

###### Proof.

The proof is straightforward.

 B−1A(B−1/2ui) =B−1/2(B−1/2AB−1/2ui)=σiB−1/2ui.

Denote the eigenpairs of by , the above lemma further tells us that .

Recall that we defined the angle between and in the -norm: .

To measure the distance from optimality, we use the following potential function for normalized vector ():

 tanθ(w,v1)=√1−|v⊤1Bw|2|v⊤1Bw|. (3)
###### Lemma 10.

Consider any such that and . Then, we have:

 cos2θ(w,v1)=(v⊤1Bw)2≥1−ϵ2 and w⊤Aw≥λ1(1−ϵ2).
###### Proof.

Clearly,

 (v⊤1Bw)2=cos2θ(w,v1)=11+tan2θ(w,v1)≥11+ϵ2≥1−ϵ2,

proving the first part. For the second part, we have the following:

 w⊤Aw =∑i,j(v⊤iBw)(v⊤jBw)v⊤iAvj=∑i,jλj(v⊤iBw)(v⊤jBw)v⊤iBvj =∑iλi(v⊤iBw)2≥λ1(v⊤1Bw)2≥(1−ϵ2)λ1,

proving the lemma. ∎

###### Proof of Theorem 5.

We will show that the potential function decreases geometrically with . This will directly provides an upper bound for . For simplicity, through out the proof we will simply denote as .

Recall the updates in Algorithm 1, suppose at time , we have such that . Let us say

 wt+1=1Z(B−1Awt+ξ) (4)

where is some normalization factor, and is the error in solving the least squares. We will first prove the geometric convergence claim assuming

 ∥ξ∥B≤|λ1|−|λ2|4min{cosθt,sinθt}, (5)

and then bound the time taken by black-box linear system solver to provide such an accuracy. Since can be written as , we know . Since