Let and. The objective of CCA (Hotelling, 1936) in the population setting is to find and
such that projections of the random variables onto these directions are maximally correlated:111For simplicity (especially for the streaming setting), we assume that and . Nonzero means can be easily handled in the ERM approach (see Remark 3).
This objective can be written in the equivalent constrained form
where the cross- and auto-covariance matrices are defined as
The global optimum of (2), denoted by , can be computed in closed-form. Define
and let be the (unit-length) top left and right singular vector pair associated with
’s largest singular value. Then the optimal objective value, i.e., the canonical correlation between and , is , achieved by .
In practice, we do not have access to the population covariance matrices, but observe samples pairs drawn from . In this paper, we are concerned with both the number of samples needed to approximately solve (2), and the time complexity for obtaining the approximate solution. Note that the CCA objective is not a stochastic convex program due to the ratio form (1), and standard stochastic approximation methods do not apply (Arora et al., 2012). Globally convergent stochastic optimization of CCA has long been a challenge, until the recent breakthrough by Ge et al. (2016); Wang et al. (2016) for solving the empirical objective.
The contributions of our paper are summarized as follows.
First, we provide the ERM sample complexity of CCA. We show that in order to achieve
-suboptimality in the alignment between the estimated canonical directions and the population solution (relative to the population covariances, see Section2), we can solve the empirical objective exactly with samples where is the singular value gap of the whitened cross-covariance and
is a upper bound of the condition number of the auto-covariance, for several general classes of distributions widely used in statistics and machine learning.
Second, to alleviate the high computational complexity of exactly solving the empirical objective, we show that we can achieve the same learning accuracy by drawing the same level of samples and solving the empirical objective approximately with the stochastic optimization algorithm of Wang et al. (2016). This algorithm is based on the shift-and-invert power iterations. We provide tightened analysis of the algorithm’s time complexity, removing an extra factor from the complexity given by Wang et al. (2016). Our analysis shows that asymptotically it suffices to process the sample set for passes. While near-linear runtime in the required number of samples is known and achieved for convex learning problems using SGD, no such result was estabilished for the nonconvex CCA objective previously.
Third, we show that the streaming version of shift-and-invert power iterations achieves the same learning accuracy with the same level of sample complexity, given a good estimate of the canonical correlation. This approach requires only memory and thus further alleviates the memory cost of solving the empirical objective. This addresses the challenge of the existence of a stochastic algorithm for CCA proposed by Arora et al. (2012).
Notations We use to denote the -th largest singular value of a matrix , and use and to denote the largest and smallest singular values of respectively. We use to denote the spectral norm of a matrix or the -norm of a vector. For a positive definite matrix , the vector norm is defined as for any . Denote . We use and to denote universal constants that are independent of the problem parameters, and their specific values may vary among appearances.
2 Problem setup
We assume the following properties of the input random variables.
Concentration property: For sufficiently large sample sizes , the input variables satisfy the following inequality with high probability: 333We refrain ourselves from specifying the failure probability as it only adds additional mild dependences to our results.
where the empirical covariance matrices are defined as
Singular value gap: For the purpose of learning the canonical directions , we assume that there exists a positive singular value gap , such that the top left- and right-singular vector pair of is uniquely defined.
Measure of error
For an estimate which need not be correctly normalized (i.e., they may not satisfy the constraints of (2)), we can define as the correctly normalized version. And we can measure the quality of these directions by the alignment (cosine of the angle) between , or the sum of alignment between and alignment between (all vectors have unit length):
This measure of alignment is invariant to the lengths of and , and achieves the maximum of if lie in the same direction as . Intuitively, this measure respects the geometry imposed by the CCA constraints that the projections of each view have unit length. As we will show later, this measure is also closely related to the learning guarantee we can achieve with power iterations. Moreover, high alignment implies accurate estimate of the canonical correlation.
Let . If , then .
All proofs are deferred to the appendix.
3 The sample complexity of ERM
One approach to address this problem is empirical risk minization (ERM): We draw samples from and solve the empirical version of (2):
Similarly, define the empirical version of as .
In the following, we analyze three general distributions commonly used in the statistics and machine learning literature: Denote .
(Sub-Gaussian) Let be isotropic and sub-Gaussian, that is, and there exists constant such that for any unit vector .
(Regular polynomial-tail) Let be isotropic and regular polynomial-tail, that is, and there exist constants such that for any orthogonal projection in and any . Note that this class is general and only implies the existence of a
(Bounded) Let and be bounded and in particular (which implies as in Assumption 1).
3.1 Approximating the canonical correlation
We first discuss the error of approximating by
. Observe that, although the empirical covariance matrices are unbiased estimates of their population counterparts, we donot have due to the nonlinear operations (matrix multiplication, inverse, and square root) involved in computing . Nonetheless, we can provide approximation guanrantee based on concentrations.
We will separate the probabilistic property of data—Assumption 2—from the deterministic error analysis, and we show below that it is satisfied by the three classes considered here.
In our decomposition of the error , we need to bound terms of the form . Such bounds can be derived from our assumption on using the lemma below.
Lemma 4 (Perturbation of matrix square root, main result of Mathias (1997))
Let be positive definite, let be Hermitian, and suppose that . Then we have
where is independent of .
We can now bound the perturbation and the approximation error in canonical correlation.
where is same constant in Lemma 4.
Let . Then for , i.e,
we have with high probability that .
Due to better concentration properties, the sample complexity for the sub-Gaussian and regular polynomial-tail classes are independent of the condition number of the auto-covariances.
3.2 Approximating the canonical directions
We now discuss the error in learning by ERM, when has a singular value gap . Let the nonzero singular values of be , where , and the corresponding (unit-length) singular vector pairs be . Define
The eigenvalues of are
, with corresponding eigenvectorsTherefore, learning canonical directions reduces to learning the top eigenvector of .
We denote the empirical version of by , and the singular vector pairs of by . Due to the block structure of and , we have . Let the ERM solution be , which satisfy . We now state the sample complexity for learning the canonical directions by ERM.
Let . Then for , i.e.,
we have with high probability that .
Proof sketch The proof of Theorem 7 consists of two steps. We first bound the error between ’s top eigenvector and ’s top eigenvector using a standard result on perturbation of eigenvectors, namely the Davis-Kahan theorem (Davis and Kahan, 1970). We then show that is very close to the “correctly normalized” , so the later still aligns well with the population solution.
Comparison to prior analysis For the sub-Gaussian class, the tightest analysis of the sample complexity upper bound we are aware of was by Gao et al. (2015). However, their proof relies on the assumption that . In contrast, we do not require this assumption, and our bound is sharp in terms of the gap . Up to the factor, our ERM sample complexity for the same loss matches the minimax lower bound given by Gao et al. (2015) (see also Section 6).
4 Stochastic optimization for ERM
A disadvantage of the empirical risk minimization approach is that it can be time and memory consuming. To obtain the exact solution to (7
), we need to explicitly form and store the covariance matrixs and computing their singular value decompositions (SVDs); these steps have a time complexity ofand a memory complexity of .
In this section, we study the stochastic optimization of the empirical objective, and show that the computational complexity is low: We just need to process a large enough dataset (with the same level of samples as ERM requires) nearly constant times in order to achieve small error with respect to the population objective. The algorithm we use here is the shift-and-invert meta-algorithm proposed by Wang et al. (2016). However, in this section we provide refined analysis of the algorithm’s time complexity than that provided by Wang et al. (2016). We show that, using a better measure of progress and careful initializations for each least squares problem, the algorithm enjoys linear convergence (see Theorem 12), i.e., the time complexity for achieving -suboptimalilty in the empirical objective depends on , whereas the result of Wang et al. (2016) has a dependence of .
4.1 Shift-and-invert power iterations
Our algorithm runs the shift-and-invert power iterations on the following matrix
where . It is obvious that is positive definite and its eigenvalues are , with the same set of eigenvectors as .
Assume that there exists a singular value gap for (this can be guaranteed by drawing sufficiently many samples so that the singular values of are within a fraction of the gap of ), denoted as . The key observation is that, as opposed to running power iterations on (which is essentially done by Ge et al. 2016), has a large eigenvalue gap when with , and thus power iterations on converge more quickly. In particular, we assume for now the availability of an estimated eigenvalue such that where ; locating such a is discussed later in Remark 14.
Define , , and we have . Due to the block structure, ’s eigenvalues of can be bounded: we have and . And by the relationship , the eigenvalues of can be bounded:
It is convenient to study the convergence in the concatenated variables
Denote and using the ERM solution, which satisfy and respectively.
4.2 Error analysis of one iteration
Our algorithm iteratively applies the approximate matrix-vector multiplications
This equivalence allows us to directly work with and avoids computing or explicitly. Note that we do not perform normalizations of the form at each iteration as done by Wang et al. (2016) (Phase-I of their SI meta-algorithm); the length of each iterate is irrelevant for the purpose of optimizing the alignment between vectors and we could always perform the normalization in the end to satisfy the length constants. Exact power iterations is known to converge linearly when there exist an eigenvalue gap (Golub and van Loan, 1996).
The matrix-vector multiplication is equivalent to solving the least squares problem
whose unique solution is with the optimal objective . Of course, solving the problem exactly is costly and we will apply stochastic gradient methods to it. We will show that, when the least squares problems are solved accurately enough, the iterates are of the same quality as those of the exact solutions and enjoys linear convergence of power iterations.
We begin by introducing the measure of progress for the iterates. Denote the eigenvalues of by , with corresponding eigenvectors forming an an orthonormal basis of . Recall that , for , and for .
We therefore can write each iterate as a linear combination of the eigenvectors:
The potential function we use to evaluate the progress of each iteration is
where and denote the projections onto the subspaces that are perpendicular and parallel to respectively. The same potential function was used by Garber et al. (2016) for analyzing the convergence of shift-and-invert for PCA. The potential function is invariant to the length of , and is equivalent to the criterion where is the angle between and :
The lemma below shows that under the iterative scheme (10), converges linearly to .
Let . Assume that for each approximate matrix-vector multiplication, we solve the least squares problem so accurately that the approximate solution satisfies
Let . Then we have for all .
4.3 Bounding the initial error for each least squares
On the other hand, we can minimize the initial suboptimality for the least squares problem for reducing the time complexity of its solver. It is natural to use an initialization of the form , a scaled version of the previous iterate, which gives the following objective
This is a quadratic function of , and minimizing over gives the optimal scaling (and this quantity is also invariant to the length of ). Observe that naturally measures the quality of : As converges to , converges to . This initialization technique plays an important role in showing the linear convergence of our algorithm, and was used by Ge et al. (2016) for their standard power iterations (alternating least squares) scheme for CCA.
Lemma 9 (Warm start for least squares)
Initializing with , it suffices to set the ratio between the initial and the final error to be .
This result indicates that in the converging stage (), we just need to set the ratio between the initial and the final error to the constant 64 (and set it to be the constant before that). This will ensure that the time complexity of least squares has no dependence on the final error .
4.4 Solving the least squares by SGD
The least squares objective (11) can be further written as the sum of functions: where
There has been much recent progress on developping linearly convergent stochastic algorithms for solving finite-sum problems. We use SVRG (Johnson and Zhang, 2013) here due to its memory efficiency. Although is convex, each component may not be convex. We have the following time complexity of SVRG for this case (see, e.g., Garber and Hazan, 2015, Appendix B).
Lemma 10 (Time complexity of SVRG for (13))
With the initialization , SVRG outputs an such that in time
where with being the gradient Lipschitz constant of , and is the strongly-convex constant of . Futhermore, if we sample each component non-uniformly with probability proportional to for the SVRG stochastic updates, we have instead .444Although not explicitly stated by Garber and Hazan (2015), the result for non-uniform sampling is straightforward by a careful investigation of their analysis, and the effect of improved dependence on ’s through non-uniform sampling agrees with related work (Xiao and Zhang, 2014). The purpose of the non-uniform sampling variant is to bound with high probability for sub-Gaussian/regular polynomial-tail inputs.
The next lemma upper-bounds the “condition number” .
Solving using SVRG with non-uniform sampling, we have for the sub-Gaussian/regular polynomial-tail classes,555Strictly speaking, this holds with high probability over the sample set. and for the bounded class.
4.5 Putting everything together
We first provide the time complexity for solving the empirical objective using the (offline) shift-and-invert CCA algorithm, regardless of the number of samples used.
Let . For the ERM objective with samples, offline shift-and-invert outputs an satisfying in total time
We have already shown in Theorem 7 that the ERM solution aligns well with the population solution. By drawing slighly more samples and requiring our algorithm to find an approximate solution that aligns well with the ERM solution, we can guarantee high alignment for the approximate solution.
Let . Draw samples for the ERM objective. Then the total time for offline shift-and-invert to output with is
The -dependent term is near-linear in the ERM sample complexity and is also the dominant term in the total runtime (when for the first two classes). For sub-Gaussian/regular polynomial-tail classes, we incur an undesirable dependence for the condition number , mainly due to weak concentration regarding the data norm (in fact we have stronger concentration for the streaming setting discussed next). One can alleviate the issue of large using accelerated SVRG (Lin et al., 2015), or a sample splitting scheme. 666That is, we draw times more samples and solve each least squares on a different split. In such a way, the initialization for each least squares problem only depends on previous splits and not the current split, so that we have stronger concentration. The least squares condition number then depends on linearly.
We have assumed so far the availability of with for shift-and-invert to work. There exists an efficient algorithm for locating such an , see the repeat-until loop of Algorithm 3 in Wang et al. (2016). This procedure computes approximate matrix-vector multiplications, and its time complexity does not depend on as we only want to achieve good estimate of the top eigenvalue (and not the top eigenvector). So the cost of locating is not dominant in the total time complexity.
5 Streaming shift-and-invert CCA
A disadvantage of the ERM approach is that we need to store all the samples in order to go through the dataset multiple times. We now study the shift-and-invert algorithms in the streaming setting in which we draw samples from the underlying distribution and process them once. Clearly, the streaming approach requires only memory.
We assume the availability of a , where . Our algorithm is the same as in the ERM case, except that we now directly work with the population covariances through fresh samples instead of their empirical estimates. With slight abuse of notation, we use to denote the population version of :
use to denote the eigensystem of , and use as well as
to denote the iterates of our algorithm. Also, define , and similarly as in Section 4.
Our goal is to achieve high alignment between and , which implies high alignment between and . The following lemma makes this intuition precise.
Lemma 15 (Conversion from joint alignment to separate alignment)
Let . If the output of our online shift-and-invert algorithm satisfy that