1 Introduction
Canonicalcorrelation analysis (CCA) and the generalized eigenvector problem are fundamental problems in scientific computing, data analysis, and statistics (Barnett and Preisendorfer, 1987; Friman et al., 2001).
These problems arise naturally in statistical settings. Let denote two large sets of data points, with empirical covariance matrices , , and and suppose we wish to find features that best encapsulate the similarity or dissimilarity of the data sets. CCA is the problem of maximizing the empirical correlation
(1) 
and thereby extracts common features of the data sets. On the other hand the generalized eigenvalue problems
compute features that maximizes discrepancies between the data sets. Both these problems are easily extended to the feature case (See Section 3). Algorithms for solving them are commonly used to extract features to compare and contrast large data sets and are used commonly in regression (Kakade and Foster, 2007), clustering (Chaudhuri et al., 2009), classification (Karampatziakis and Mineiro, 2013), word embeddings (Dhillon et al., 2011) and more.
Despite the prevalence of these problems and the breadth of research on solving them in practice ((Barnett and Preisendorfer, 1987; Barnston and Ropelewski, 1992; Sherry and Henson, 2005; Karampatziakis and Mineiro, 2013) to name a few), there are relatively few results on obtaining provably efficient algorithms. Both problems can be reduced to performing principle component analysis (PCA), albeit on complicated matrices e.g for CCA and for generalized eigenvector. However applying PCA to these matrices traditionally involves the formation of and
which is prohibitive for sufficiently large datasets if we only want to estimate top
eigenspace.A natural open question in this area is to what degree can the formation of and can be bypassed to obtain efficient scalable algorithms in the case where the number of features is much smaller than the dimensions of the problem and . Can we develop simple iterative practical methods that solve this problem in close to linear time when is small and the condition number and eigenvalue gaps are bounded? While there has been recent work on solving these problems using iterative methods (Avron et al., 2014; Paul, 2015; Lu and Foster, 2014; Ma et al., 2015) we are unaware of previous provable global convergence results and more strongly, linearly convergent scalable algorithms.
The central goal of this paper is to answer this question in the affirmative. We present simple globally linearly convergent iterative methods that solve these problems. The running time of these problems scale well as the number of features and conditioning of the problem stay fixed and the size of the datasets grow. Moreover, we implement the method and perform experiments demonstrating that the techniques may be effective for large scale problems.
Specializing our results to the single feature case we show how to solve the problems all in time , where is the maximum of condition numbers of and and is the eigengap of appropriate matrices and mentioned above, and is the number of nonzero entries in and . To the best of our knowledge this is the first such globally linear convergent algorithm for solving these problems.
We achieve our results through a general and versatile framework that allows us to utilize fast linear system solvers in various regimes. We hope that by initiating this theoretical and practical analysis of CCA and the generalized eigenvector problem we can promote further research on the problem and ultimately advance the stateoftheart for efficient data analysis.
1.1 Our Approach
To solve the problems motivated in the previous section we first directly reduce CCA to a generalized eigenvector problem (See Section 5). Consequently, for the majority of the paper we focus on the following:
Definition 1 (Top Generalized Eigenvector^{6}^{6}6We use the term generalized eigenvector to refer to a nonzero vector such that for symmetric and , not the general notion of eigenvectors for asymmetric matrices.).
Given symmetric matrices where is positive definite compute defined for all by
The generalized eigenvector is equivalent to the problem of computing the PCA of in the norm. Consequently, it is the same as computing the top eigenvectors of largest absolute value of the symmetric matrix and then multiplying by .
Unfortunately, as we have discussed, explicitly computing is prohibitively expensive when is large and therefore we wish to avoid forming explicitly. One natural approach is to develop an iterative methods to approximately apply
to a vector and then use that method as a subroutine to perform the power method on
. Even if we could perform the error analysis to make this work, such an approach would likely require at least a suboptimal iterations to achieve error .To bypass these difficulties, we take a closer look at the power method. For some initial vector , let be the result of iterations of power method on followed by multiplying . Clearly . Furthermore, since we typically initialize the power method by a random vector and since is positive definite, if we instead we computed for random we would likely converge at the same rate as the power method at the cost of just a slightly worse initialization quality.
Consequently, we can compute our desired eigenvectors by simply alternating between applying and to a random initial vector. Unfortunately, computing exactly is again outside our computational budget. At best we should only attempt to apply approximately by linear system solvers.
One of our main technical contributions is to argue about the effect of inexact solvers in this method. Whereas solving every linear system to target accuracy would again require time per linear system, which leads to a suboptimal overall running time, i.e. sublinear convergence, we instead show how to warm start the linear system solvers and obtain a faster rate. We exploit the fact that as we perform many iterations of power methods, points at time converge to eigenvectors and therefore we can initialize our linear system solver at time carefully using our points at time . Ultimately we show that we only need to make fixed multiplicative progress in solving the linear system in every iteration of the power method, thus the runtime for solving each linear system is independent of .
Putting these pieces together with careful error analysis yields our main result. Our algorithm only requires the ability to apply to a vector and an approximate linear system solver for , which in turn can be obtained by just applying to vectors. Consequently, our framework is versatile, scalable, and easily adaptable to take advantage of faster linear system solvers.
1.2 Previous Work
While there has been limited previous work on provably solving CCA and generalized eigenvectors, we note that there is an impressive body of literature on performing PCA(Rokhlin et al., 2009; Halko et al., 2011; Musco and Musco, 2015; Garber and Hazan, 2015; Jin et al., 2015) and solving positive semidefinite linear systems(Hestenes and Stiefel, 1952; Nesterov, 1983; Spielman and Teng, 2004). Our analysis in this paper draws on this work extensively and our results should be viewed as the principled application of them to the generalized eigenvector problem.
There has been much recent interest in designing scalable algorithms for CCA(Ma et al., 2015; Wang et al., 2015; Wang and Livescu, 2015; Michaeli et al., 2015)
. To our knowledge, there are no provable guarantees for approximate methods for this problem. Heuristicbased approachs
(Witten et al., 2009; Lu and Foster, 2014) compute efficiently, but only give suboptimal result due to coarse approximation. The work in (Ma et al., 2015) provides one natural iterative procedure, where the per iterate computational complexity is low. This work only provides local convergence guarantees and does not provide guarantees of global convergence.Also of note is that many recent algorithms (Ma et al., 2015; Wang et al., 2015) have minibatch variations, but there’s no guarantees for minibatch style algorithm for CCA yet. Our algorithm can also be easily extends to a minibatch version. While we do not explicitly analyze this variation, and we believe our analysis and techniques are helpful for extensions to this setting. We also view this as an important direction for future work.
We hope that by establishing the generalized eigenvector problem and providing provable guarantees under moderate regularity assumptions that our results may be further improved and ultimately this may advance the stateoftheart in practical algorithms for performing data analysis.
1.3 Our Results
Our main result in this paper is a linearly convergent algorithm for computing the top generalized eigenvectors (see Definition 1). In order to be able to state our results we introduce some notation. Let be the eigenvalues of (their existence is guaranteed by Lemma 9 in the appendix). The eigengap and . Let denote the number of nonzero entries in and .
Theorem 2 (Informal version of Theorem 6).
Given two matrices and , there is an algorithm that computes the top generalized eigenvectors up to an error in time , where is the condition number of and hides logarithmic terms in , , and , and nothing else.
Here is a comparison of our result with previous work.
GenELinK(this paper)  
Fast matrix inversion 
Turning to the problem of CCA, cf. (1), the relevant parameters are i.e., the maximum of the condition numbers of and , where are the eigenvalues of in decreasing absolute value. Let denote the number of nonzeros in and . Our main results are a reduction from CCA to the generalized eigenvector problem.
Theorem 3 (Informal version of Theorem 7).
Given two data matrices and , there is an algorithm that performs top CCA up to an error in time , where and hides logarithmic terms in , , and , and nothing else.
Table 2 compares our result with existing results. ^{7}^{7}7(Ma et al., 2015) only shows local convergence for SAppGrad. Starting within this radius of convergence requires us to already solve the problem to a high accuracy.
This paper  

SAppGrad (Ma et al., 2015)  
Fast matrix inversion 
We should note that the actual bounds we obtain are somewhat stronger than the above informal bounds. Some of the terms in logarithm also appear only as additive terms. Finally we also give natural stochastic extensions of our algorithms where the cost of each iteration may be much smaller than the input size. The key idea behind our approach is to use an approximate linear system solver as a black box inside power method on an appropriate matrix. We show that this dependence on a linear system solver is in some sense essential. In Section 6 we show that the generalized eigenvector problem is strictly more general than the problem of solving positive semidefinite linear systems and consequently our dependence on the condition number of is in some cases optimal.
Our result + GD  

Our result + AGD  
Our result + SVRG  
Our result + ASVRG 
Subsequent to the submission of this paper, we learned of the closely related work in (Wang et al., 2016), which presents a number of additional interesting results. We think it is worthwhile to point out that our algorithm only requires black box access to any linear solver. Although the result in Theorem 3 was stated by instantiating the linear system solver by accelerated gradient descent (AGD), it is immediate to apply Theorem 7
and give the corresponding rates if we instantiate it by other popular algorithms, including gradient descent (GD), stochastic variance reduction (SVRG)
(Johnson and Zhang, 2013), and its accelerated version (ASVRG) (Frostig et al., 2015; Lin et al., 2015). We summarize the corresponding runtime in Table 3. There , and , are th column of matrix and . Note by definition we always have . For generalized eigenvector problem, results of similar flavor as in Table 3 can also be easily derived.Finally, we also run experiments to demonstrate the practical effectiveness of our algorithm on both small and large scale datasets.
1.4 Paper Overview
In Section 2, we present our notation. In Section 3, we formally define the problems we solve and their relevant parameters. In Section 4, we present our results for the generalized eigenvector problem. In Section 5, we present our results for the CCA problem. In Section 6 we argue that generalized eigenvector computation is as hard as linear system solving and that our dependence on is near optimal. In Section 7, we present experimental results of our algorithms on some real world data sets. Due to space limitations, proofs are deferred to the appendix.
2 Notation
We use bold capital letters to denote matrices and bold lowercase letters for vectors. For symmetric positive semidefinite (PSD) matrix , we let denote the norm of and we let denotes the inner product of and in the norm. We say that a matrix is orthonormal if . We let denotes the
largest singular value of
, and denote the smallest and largest singular values of respectively. Similarly we let refers to the largest eigenvalue of in magnitude. We let denotes the number of nonzeros in . We also let denote the condition number of (i.e., the ratio of the largest to smallest eigenvalue).3 Problem Statement
In this section, we recall the generalized eigenvalue problem, define our error metric, and introduce all relevant parameters. Recall that the generalized eigenvalue problem is to find vectors , such that
Using stationarity conditions, it can be shown that the vectors are given by , where is an eigenvector of with eigenvalue such that . Our goal is to recover the topk eigen space i.e., . In order to quantify the error in estimating the eigenspace, we use largest principal angle, which is a standard notion of distance between subspaces (Golub and Van Loan, 2012).
Definition 4 (Largest principal angle).
Let and be two dimensional subspaces, and their orthonormal basis respectively. The largest principal angle in the norm is defined to be
Intuitively, the largest principal angle corresponds to the largest angle between any vector in the span of and its projection onto the span of . In the special case where , the above definition reduces to our choice in the top setting. Given two matrices and , we use to denote the largest principle angle between the subspaces spanned by the columns of and . We say that achieves an error of if and , where is the matrix whose columns are . The relevant parameters for us are the eigengap, i.e. the relative difference between and eigenvalues, , and , the condition number of .
4 Our Results
In this section, we provide our algorithms and results for solving the generalized eigenvector problem. We present our results for the special case of computing the top generalized eigenvector (Section 4.1) followed by the general case of computing the top generalized eigenvectors (Section 4.2). However, first we formally define a linear system solver as follows:
Linear system solver: In each of our main results (Theorems 5 and 6) we assume black box access to an approximate linear system solver. Given a PSD matrix , a vector , an initial estimate , and an error parameter , we require to decrease the error by a multiplicative , i.e. output with . We let denote the time needed for this operation. Since the error metric is equivalent to function error on minimizing the convex quadratic up to constant scaling, an approximate linear system solver is equivalent to an optimization algorithm for . We also specialize our results using Nesterov’s accelerated gradient descent to state our bounds. Stating our results using linear system solver as a blackbox allows the user to choose an efficient solver depending on the structure of and helps pass any improvements in linear system solvers on to the problem of generalized eigenvectors.
4.1 Top1 Setting
Our algorithm for computing the top generalized eigenvector, GenELin is given in Algorithm 1.
The algorithm implements an approximate power method where each iteration consists of approximately multiplying a vector by . In order to do this, GenELin solves a linear system in and then scales the resulting vector to have unit norm. Our main result states that given an oracle for solving the linear systems,^{10}^{10}10For example, we could use Nesterov’s accelerated gradient descent, Algorithm 4 the number of iterations taken by Algorithm 1 to compute the top eigenvector up to an accuracy of is at most where .
Theorem 5.
Recall that the linear system solver takes time to reduce the error by a factor . Given matrices and , GenELin (Algorithm 1) computes a vector achieving an error of in iterations, where . The running time of the algorithm is at most
Furthermore, if we use Nesterov’s accelerated gradient descent (Algorithm 4) to solve the linear systems in Algorithm 1, the time can be bounded as
Remarks:

Note that GenELin exploits the sparsity of input matrices since we only need to apply them as operators.

Depending on computational restrictions, we can also use a subset of samples in each iteration of GenELin. In some large scale learning applications using minibatches of data in each iteration helps make the method scalable while still maintaining the quality of performance.
4.2 Topk Setting
In this section, we give an extension of our algorithm and result for computing the top generalized eigenvectors. Our algorithm, GenELinK is formally given as Algorithm 2.
GenELinK is a natural generalization of GenELin from the previous section. Given an initial set of vectors , the algorithm proceeds by doing approximate orthogonal iteration. Each iteration involves solving independent linear systems^{11}^{11}11Similarly, as before, we could use Nesterov’s accelerated gradient descent, i.e. Algorithm 4. and orthonormalizing the iterates. The following theorem is the main result of our paper which gives runtime bounds for Algorithm 2. As before, we assume access to a blackbox linear system solver and also give a result instantiating the theorem with Nesterov’s accelerated gradient descent algorithm.
Theorem 6.
Suppose the linear system solver takes time to reduce the error by a factor . Given input matrices and , GenELinK computes a matrix which is an estimate of the top generalized eigenvectors with an error of i.e., and , where in iterations where . The run time of this algorithm is at most
where , being the top eigenvalues of . Furthermore, if we use Nesterov’s accelerated gradient descent (Algorithm 4) to solve the linear systems in Algorithm 2, the time above can be bounded as
5 Application to CCA
We now outline how the CCA problem can be reduced to computing generalized eigenvectors. The CCA problem is as follows. Given two sets of data points and , let , , and . We wish to find vectors and which are defined recursively as
where the values of are called canonical correlations between and .
For reduction, we know any stationary point of this optimization problem satisfies , and , where and are two constants. Combined with the constraints, we also see that . This can be written in matrix form as . Suppose the generalized eigenvalues of the above matrices are . The top dimensional eigenspace of this generalized eigenvalue problem corresponds to the linear subspace spanned by the eigenvectors of and , which are . Once we solve the top generalized eigenvector problem for the matrices and , we can pick any orthonormal basis that spans the output subspace and choose a random dimensional projection of those vectors. The formal algorithm is given in Algorithm 3. Combining this with our results for computing generalized eigenvectors, we obtain the following result.
Theorem 7.
Suppose the linear system solver takes time to reduce the error by a factor . Given inputs and , with probability greater than , then there is some universal constant , so that Algorithm 3 outputs and such that , and , in time
where and and . If we use Nesterov’s accelerated gradient descent (Algorithm 4) to solve the linear systems in GenELink, then the total runtime is
Remarks:

Note that we depend on the maximum of the condition numbers of and since the linear systems that arise in GenELinK decompose into two separate linear systems, one in and the other in .

We can also exploit sparsity in the data matrices and since we only need to apply or only as operators, which can be done by applying and in appropriate order. Exploiting sparsity is crucial for any large scale algorithm since there are many data sets (e.g., URL dataset in our experiments) where dense operations are impractical.
6 Reduction to Linear System
Here we show that solving linear systems in is inherent in solving the top generalized eigenvector problem in the worst case and we provide evidence a factor in the running time is essential for a broad class of iterative methods for the problem.
Let be a symmetric positive definite matrix and suppose we wish to solve the linear system , i.e. compute with . If we set and then
and consequently computing the top1 generalized eigenvector yields the solution to the linear system. Therefore, the problem of computing top generalized eigenvectors is in general harder than the problem of solving symmetric positive definite linear systems.
Moreover, it is well known that any method which starts at and iteratively applies to linear combinations of the points computed so far must apply at least in order to halve the error in the standard norm for the problem (Shewchuk, 1994). Consequently, methods that solve the top generalized eigenvector problem by simply applying and , which is the same as applying and taking linear combinations with , must apply at least times to achieve small error, unless they exploit more structure of or the initialization.
7 Simulations
In this section, we present our experiment results performing CCA on three benchmark datasets which are summarized in Table 4. We wish to demonstrate two things via these simulations: 1) the behavior of CCALin verifies our theoretical result on relatively smallscale dataset, and 2) scalability of CCALin comparing it with other existing algorithms on a largescale dataset.
Dataset  sparsity ^{12}^{12}12Sparsity is given by .  

MNIST  
Penn Tree Bank  
URL Reputation 
Let us now specify the error metrics we use in our experiments. The first ones are the principal angles between the estimated subspaces and the true ones. Let and be the estimated subspaces and , be the true canonical subspaces. We will use principle angles under norm, under norm and ^{13}^{13}13See Algorithm 3 for definition of , under the norm. Unfortunately, we cannot compute these error metrics for largescale datasets since they require knowledge of the true canonical components. Instead we will use Total Correlations Captured (TCC), which is another metric widely used by practitioners, defined to be the sum of canonical correlation between two matrices. Also, Proportion of Correlations Captured is given as
For a fair comparison with other algorithms (which usually call highly optimized matrix inversion subroutines), we use number of FLOPs instead of wall clock time to measure the performance.
7.1 Smallscale Datasets
MNIST dataset(LeCun et al., 1998) consists of 60,000 handwritten digits from 0 to 9. Each digit is a image represented by real values in [0,1]. Here CCA is performed between left half images and right half images. The data matrix is dense but the dimension is fairly small.
Penn Tree Bank (PTB) dataset comes from full Wall Street Journal Part of Penn Tree Bank which consists of 1.17 million tokens and a vocabulary size of 43k(Marcus et al., 1993), which has already been used to successfully learn the word embedding by CCA(Dhillon et al., 2011). Here, the task is to learn correlated components between two consecutive words. We only use the top 10,000 most frequent words. Each row of data matrix is an indicator vector and hence it is very sparse and is diagonal.
Since the input matrices are very ill conditioned, we add some regularization and replace by (and similarly with ). In CCALin, we run GenELinK with and accelerated gradient descent (Algorithm 4 in the supplementary material) to solve the linear systems. The results are presented in Figure 1 and Figure 2.
Figure 1 shows a typical run of CCALin from random initialization on both MNIST and PTB dataset. We see although may be even 90 degree at some point respectively, is always monotonically decreasing (as monotonically increasing) as predicted by our theory. In the end, as goes to zero, it will push both and go to zero, and PCC go to 1. This demonstrates that our algorithm indeed converges to the true canonical space.
Furthermore, by a more detailed examination of experimental data in Figure 1, we observe in Figure 2 that is indeed linearly convergent as we predicted in the theory. In the meantime, and may initially converge a bit slower than , but in the end they will be upper bounded by times a constant factor, thus will eventually converge at a linear rate at least as fast as .
7.2 Largescale Dataset
URL Reputation dataset contains 2.4 million URLs and 3.2 million features including both hostbased features and lexical based features. Each feature is either real valued or binary. For experiments in this section, we follow the setting of (Ma et al., 2015). We use the first 2 million samples, and run CCA between a subset of host based features and a subset of lexical based features to extract the top components. Although the data matrix is relatively sparse, unlike PTB, it has strong correlations among different coordinates, which makes much denser ().
Classical algorithms are impractical for this dataset on a typical computer, either running out of memory or requiring prohibitive amount of time. Since we cannot estimate the principal angles, we will evaluate TCC performance of CCALin.
We compare our algorithm to SAppGrad (Ma et al., 2015) which is an iterative algorithm and PCACCA (Ma et al., 2015), NWCCA (Witten et al., 2009) and DWCCA (Lu and Foster, 2014) which are oneshot estimation procedures.
In CCALin, we employ GenELinK using stochastic accelerated gradient descent for solving linear systems using minibatches in each of the gradient steps and also leverage sparsity of the data to deal with the large data size. The result is shown in Figure 3. It is clear from the plot that our algorithm takes fewer computations than the other algorithms to achieve the same accuracy.
8 Conclusion
In summary, we have provided the first provable globally linearly convergent algorithms for solving canonical correlation analysis and the generalized eigenvector problems. We have shown that for recovering the top
components our algorithms are much faster than traditional methods based on fast matrix multiplication and singular value decomposition when
and the condition numbers and eigenvalue gaps of the matrices involved are moderate. Moreover, we have provided empirical evidence that our algorithms may be useful in practice. We hope these results serve as the basis for further improvements in performing large scale data analysis both in theory and in practice.References
 Avron et al. (2014) Avron, H., Boutsidis, C., Toledo, S., and Zouzias, A. (2014). Efficient dimensionality reduction for canonical correlation analysis. SIAM Journal on Scientific Computing, 36(5):S111–S131.
 Barnett and Preisendorfer (1987) Barnett, T. and Preisendorfer, R. (1987). Origins and levels of monthly and seasonal forecast skill for united states surface air temperatures determined by canonical correlation analysis. Monthly Weather Review, 115(9):1825–1850.
 Barnston and Ropelewski (1992) Barnston, A. G. and Ropelewski, C. F. (1992). Prediction of enso episodes using canonical correlation analysis. Journal of climate, 5(11):1316–1345.
 Chaudhuri et al. (2009) Chaudhuri, K., Kakade, S. M., Livescu, K., and Sridharan, K. (2009). Multiview clustering via canonical correlation analysis. In Proceedings of the 26th annual international conference on machine learning, pages 129–136. ACM.
 Dhillon et al. (2011) Dhillon, P., Foster, D. P., and Ungar, L. H. (2011). Multiview learning of word embeddings via cca. In Advances in Neural Information Processing Systems, pages 199–207.
 Friman et al. (2001) Friman, O., Cedefamn, J., Lundberg, P., Borga, M., and Knutsson, H. (2001). Detection of neural activity in functional mri using canonical correlation analysis. Magnetic Resonance in Medicine, 45(2):323–330.
 Frostig et al. (2015) Frostig, R., Ge, R., Kakade, S. M., and Sidford, A. (2015). Unregularizing: approximate proximal point and faster stochastic algorithms for empirical risk minimization. In ICML2015.
 Garber and Hazan (2015) Garber, D. and Hazan, E. (2015). Fast and simple pca via convex optimization. arXiv preprint arXiv:1509.05647.
 Golub and Van Loan (2012) Golub, G. H. and Van Loan, C. F. (2012). Matrix computations, volume 3. JHU Press.
 Halko et al. (2011) Halko, N., Martinsson, P.G., and Tropp, J. A. (2011). Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions. SIAM review, 53(2):217–288.
 Hardoon et al. (2004) Hardoon, D. R., Szedmak, S., and ShaweTaylor, J. (2004). Canonical correlation analysis: An overview with application to learning methods. Neural computation, 16(12):2639–2664.
 Hestenes and Stiefel (1952) Hestenes, M. R. and Stiefel, E. (1952). Methods of conjugate gradients for solving linear systems, volume 49. NBS.
 Hotelling (1936) Hotelling, H. (1936). Relations between two sets of variates. Biometrika, 28(3/4):321–377.
 Jin et al. (2015) Jin, C., Kakade, S. M., Musco, C., Netrapalli, P., and Sidford, A. (2015). Robust shiftandinvert preconditioning: Faster and more sample efficient algorithms for eigenvector computation. CoRR, abs/1510.08896.

Johnson and Zhang (2013)
Johnson, R. and Zhang, T. (2013).
Accelerating stochastic gradient descent using predictive variance reduction.
In NIPS2013, pages 315–323.  Kakade and Foster (2007) Kakade, S. M. and Foster, D. P. (2007). Multiview regression via canonical correlation analysis. In Learning theory, pages 82–96. Springer.
 Karampatziakis and Mineiro (2013) Karampatziakis, N. and Mineiro, P. (2013). Discriminative features via generalized eigenvectors. arXiv preprint arXiv:1310.1934.

LeCun et al. (1998)
LeCun, Y., Cortes, C., and Burges, C. J. (1998).
The mnist database of handwritten digits.
 Lin et al. (2015) Lin, H., Mairal, J., and Harchaoui, Z. (2015). A universal catalyst for firstorder optimization. arXiv preprint arXiv:1506.02186.
 Lu and Foster (2014) Lu, Y. and Foster, D. P. (2014). Large scale canonical correlation analysis with iterative least squares. In Advances in Neural Information Processing Systems, pages 91–99.
 Ma et al. (2015) Ma, Z., Lu, Y., and Foster, D. P. (2015). Finding Linear Structure in Large Datasets with Scalable Canonical Correlation Analysis. In Proceedings of the 32nd International Conference on Machine Learning, JMLR Proceedings.
 Marcus et al. (1993) Marcus, M. P., Marcinkiewicz, M. A., and Santorini, B. (1993). Building a large annotated corpus of english: The penn treebank. Computational linguistics, 19(2):313–330.
 Michaeli et al. (2015) Michaeli, T., Wang, W., and Livescu, K. (2015). Nonparametric Canonical Correlation Analysis. CoRR, abs/1511.04839.
 Musco and Musco (2015) Musco, C. and Musco, C. (2015). Stronger approximate singular value decomposition via the block lanczos and power methods. arXiv preprint arXiv:1504.05477.
 Nesterov (1983) Nesterov, Y. (1983). A method of solving a convex programming problem with convergence rate o (1/k2). In Soviet Mathematics Doklady, volume 27, pages 372–376.
 Paul (2015) Paul, S. (2015). Coresets for canonical correlation analysis. In Proceedings of the 24th ACM International on Conference on Information and Knowledge Management, pages 1887–1890. ACM.

Rokhlin et al. (2009)
Rokhlin, V., Szlam, A., and Tygert, M. (2009).
A randomized algorithm for principal component analysis.
SIAM Journal on Matrix Analysis and Applications, 31(3):1100–1124.  Rudelson and Vershynin (2010) Rudelson, M. and Vershynin, R. (2010). Nonasymptotic theory of random matrices: extreme singular values. arXiv preprint arXiv:1003.2990.
 Sherry and Henson (2005) Sherry, A. and Henson, R. K. (2005). Conducting and interpreting canonical correlation analysis in personality research: A userfriendly primer. Journal of personality assessment, 84(1):37–48.
 Shewchuk (1994) Shewchuk, J. R. (1994). An introduction to the conjugate gradient method without the agonizing pain. Technical report, Pittsburgh, PA, USA.
 Shi and Malik (2000) Shi, J. and Malik, J. (2000). Normalized cuts and image segmentation. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 22(8):888–905.

Spielman and Teng (2004)
Spielman, D. A. and Teng, S.H. (2004).
Nearlylinear time algorithms for graph partitioning, graph
sparsification, and solving linear systems.
In
Proceedings of the thirtysixth annual ACM symposium on Theory of computing
, pages 81–90. ACM.  Wang et al. (2015) Wang, W., Arora, R., Livescu, K., and Srebro, N. (2015). Stochastic optimization for deep CCA via nonlinear orthogonal iterations. volume abs/1510.02054.
 Wang and Livescu (2015) Wang, W. and Livescu, K. (2015). LargeScale Approximate Kernel Canonical Correlation Analysis. CoRR, abs/1511.04773.
 Wang et al. (2016) Wang, W., Wang, J., and Srebro, N. (2016). Globally convergent stochastic optimization for canonical correlation analysis. arXiv preprint arXiv:1604.01870.
 Witten et al. (2009) Witten, D. M., Tibshirani, R., and Hastie, T. (2009). A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis. Biostatistics, page kxp008.
Appendix A Solving Linear System via Accelerated Gradient Descent
Since we use accelerated gradient descent in our main theorems, for completeness, we put the algorithm and cite its result about iteration complexity here without proof.
Theorem 8 ((Nesterov, 1983)).
Let be strongly convex and smooth, then accelerated gradient descent with learning rate and satisfies:
(2) 
Appendix B Proofs of Main Theorem
b.1 Rank1 Setting
We first prove our claim that has an eigenbasis.
Lemma 9.
Let be the eigenpairs of the symmetric matrix . Then is an eigenvector of with eigenvalue .
Proof.
The proof is straightforward.
∎
Denote the eigenpairs of by , the above lemma further tells us that .
Recall that we defined the angle between and in the norm: .
To measure the distance from optimality, we use the following potential function for normalized vector ():
(3) 
Lemma 10.
Consider any such that and . Then, we have:
Proof.
Clearly,
proving the first part. For the second part, we have the following:
proving the lemma. ∎
Proof of Theorem 5.
We will show that the potential function decreases geometrically with . This will directly provides an upper bound for . For simplicity, through out the proof we will simply denote as .
Recall the updates in Algorithm 1, suppose at time , we have such that . Let us say
(4) 
where is some normalization factor, and is the error in solving the least squares. We will first prove the geometric convergence claim assuming
(5) 
and then bound the time taken by blackbox linear system solver to provide such an accuracy. Since can be written as , we know . Since and , we have