1 Introduction
Canonical Correlation Analysis (CCA) is a fundamental statistical technique for characterizing the linear relationships between two^{1}^{1}1CCA can be extended to more than two views, but we don’t pursue this here. multidimensional variables.^{2}^{2}2Furthermore, nonlinear relationships between variables can be uncovered using kernel CCA, or, for largescale data sets, a primal approximation e.g., with randomized feature maps[12] or the Nyström method. First introduced in 1936 by Hotelling[10]
, it has found numerous applications. For the machine learning community, more familiar applications include learning with privileged information
[15][3, 14], monolingual[5] and multilingual[7] word representation learning, locality sensitive hashing[8] and clustering[2]. Because these applications involve unlabeled or partially labeled data, the amount of data available for analysis can be vast, motivating the need for scalable approaches.2 Background
Given two view data, CCA finds a projection of each view into a common latent space which maximizes the crosscorrelation, subject to each view projection having unit variance, and subject to each projection dimension being uncorrelated with other projection dimensions. In matrix form, given two views
and , the CCA projections and are the solution to(1)  
(2) 
The KKT conditions, expressed in terms of the QRdecompositions
and, lead to the following multivariate eigenvalue problem
[4](3) 
subject to , , , and .
Equation (3) leads to several solution strategies. For moderate sized design matrices, an SVD of directly reveals the solution in the coordinate system [1]. The transformation from to can be obtained from either the SVD or QRdecompositions of and .
For larger design matrices lacking special structure, SVD and QRdecompositions are prohibitively expensive, necessitating other techniques. Large scale solutions are possible via Horst iteration [4], the analog of orthogonal power iteration for the multivariate eigenvalue problem, in which each block of variables is individually normalized following matrix multiplication [17]. For CCA, the matrix multiplication step of Horst iteration can be done directly in the coordinate system via solving a leastsquares problem. Furthermore, the least squares solutions need only be done approximately to ensure convergence [13]. Unfortunately, Horst iteration still requires many passes over the data for good results.
3 Algorithm
Our proposal is RandomizedCCA outlined in Algorithm 1. For ease of exposition, we elide mean shifting of each design matrix, which is a rank one update, and can be done in extra space without introducing additional data passes and preserving sparsity. Line numbers 3 through 13 constitute a standard randomized range finder [9] with power iteration for the left and right singular spaces of . If we consider and as providing a rank approximation to the top range of , then analysis of randomized range finding indicates and analogously for [9]. Intuition about the relevant value of can be determined by considering the effect of regularization. To prevent overfitting, equations (1) and (2) are regularized with and respectively, hence the canonical correlations possible in the space orthogonal to the top range of are at most . When this quantity is below the canonical correlation, the top range of is the only relevant subspace and the question then becomes the extent to which the randomized range finder is approximating this space well.
In practice is unknown and thus relative to , RandomizedCCA effectively requires large amounts of oversampling (e.g., ) to achieve good results. Nonetheless, when iterations over the data are expensive, this level of oversampling can be more computationally attractive than alternative approaches. This is because typically CCA is used to find a low dimensional embedding (e.g., ), whereas the final exact SVD and Cholesky factorizations in lines 20 through 23 can be done using a single commodity machine as long as . Therefore there is computational headroom available for large oversampling. Ultimately the binding constraint is the utility of storing and in main memory.
4 Experimental Results
Europarl is a collection of simultaneous translated documents extracted from the proceedings of the European parliament [11]. Multilingual alignment is available at the individual sentence level. We used a single random 9:1 split of sentences into train and test sets for all experiments. We processed each sentence into a fixed dimensional representation using a bag of words representation composed with innerproduct preserving hashing [16]. For these experiments we used hash slots.^{3}^{3}3The hashing strategy generates a feature space in which many features never occur. To reduce memory requirements, we lazily materialize the rows of and . We used English for the design matrix and Greek for the design matrix, resulting in and . Note the ultimate embedding produced by this procedure is the composition of the hashing strategy with the projections found by RandomizedCCA.
figure The top2000 spectrum of
, as estimated by twopass randomized SVD, is shown in figure
1. This provides some intuition as to why the top range of should generate an excellent approximation to the optimal CCA solution, as the spectrum exhibits powerlaw decay and ultimately decreases to a point which is comparable to a plausible regularization parameter setting.Figure 1(a)
shows the sum of the first 60 canonical correlations found by RandomizedCCA as the hyperparameters of the algorithm (oversampling
and number of passes ) are varied. and are set using the scalefree parameterization and , with . Figure 1(a) indicates that with sufficient oversampling RandomizedCCA can achieve an objective value close to that achieved with Horst iteration. Note in all cases the solutions found are feasible to machine precision, i.e., each projection has (regularized) identity covariance and the cross covariance is diagonal.Table 0(a) shows singlenode running times^{4}^{4}4Not including I/O (all data fits in core) and preprocessing. and objective values for RandomizedCCA with selected values of hyperparameters and for Horst iteration. This table indicates that, when iteration is inexpensive (such as when all data fits in core on a single node), Horst iteration^{5}^{5}5GaussSeidel variant with approximate least squares solves and Gaussian random initializer. is more efficient when a highprecision result is desired. Under these conditions RandomizedCCA is complementary to Horst iteration, as it provides an inexpensive initialization strategy, indicated in the table as Horst+rcca, where we initialized Horst iteration using the solution from RandomizedCCA with and . The overall running time to achieve the same accuracy, including the time for computing the initializer, is lower for Horst+rcca. Furthermore, the total number of data passes to achieve the same accuracy is reduced from 120 to 34.
figure If we view RandomizedCCA as a learning algorithm, rather than an optimization algorithm, then the additional precision that Horst iteration provides may no longer be relevant, as it may not generalize to novel data. Alternatively, if sufficiently strong regularization is required for good generalization the approximations inherent in RandomizedCCA are more accurate. In table 0(a) both training and test set objectives are shown. When Horst in run with the same regularization as RandomizedCCA, training objective is better but test objective is dramatically worse. By increasing this can be mitigated, but empirically Horst iteration is more sensitive to the choice of , as indicated in figure 3. This suggests that RandomizedCCA is providing inherent regularization by virtue of focusing the optimization on the top range of
, analogous to the difference between ridge regression and PCA regression
[6].5 Conclusion
We have presented RandomizedCCA, a fast approximate CCA solver which optimizes over the top range of the cross correlation matrix. RandomizedCCA is highly amenable to distributed implementation, delivering comparable accuracy to Horst iteration while requiring far less data passes. Furthermore, for configurations where iteration is not expensive, RandomizedCCA provides an inexpensive initializer for Horst iteration. Finally, when generalization is considered, preliminary experiments suggest RandomizedCCA provides beneficial regularization.
References
References
 [1] Ake Björck and Gene H Golub. Numerical methods for computing angles between linear subspaces. Mathematics of computation, 27(123):579–594, 1973.

[2]
Matthew B Blaschko and Christoph H Lampert.
Correlational spectral clustering.
In Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on, pages 1–8. IEEE, 2008.  [3] Kamalika Chaudhuri, Sham M Kakade, Karen Livescu, and Karthik Sridharan. Multiview clustering via canonical correlation analysis. In Proceedings of the 26th annual international conference on machine learning, pages 129–136. ACM, 2009.
 [4] Moody T Chu and J Loren Watterson. On a multivariate eigenvalue problem, part i: Algebraic theory and a power method. SIAM Journal on Scientific Computing, 14(5):1089–1106, 1993.
 [5] Paramveer Dhillon, Dean P Foster, and Lyle H Ungar. Multiview learning of word embeddings via cca. In Advances in Neural Information Processing Systems, pages 199–207, 2011.

[6]
Paramveer S Dhillon, Dean P Foster, Sham M Kakade, and Lyle H Ungar.
A risk comparison of ordinary least squares vs ridge regression.
The Journal of Machine Learning Research, 14(1):1505–1511, 2013. 
[7]
Manaal Faruqui and Chris Dyer.
Improving vector space word representations using multilingual correlation.
Proc. of EACL. Association for Computational Linguistics, 2014. 
[8]
Yunchao Gong, Svetlana Lazebnik, Albert Gordo, and Florent Perronnin.
Iterative quantization: A procrustean approach to learning binary codes for largescale image retrieval.
Pattern Analysis and Machine Intelligence, IEEE Transactions on, 35(12):2916–2929, 2013.  [9] Nathan Halko, PerGunnar Martinsson, and Joel A Tropp. Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions. SIAM review, 53(2):217–288, 2011.
 [10] Harold Hotelling. Relations between two sets of variates. Biometrika, pages 321–377, 1936.
 [11] Philipp Koehn. Europarl: A parallel corpus for statistical machine translation. In MT summit, volume 5, pages 79–86, 2005.
 [12] David LopezPaz, Suvrit Sra, Alex Smola, Zoubin Ghahramani, and Bernhard Schölkopf. Randomized nonlinear component analysis. arXiv preprint arXiv:1402.0119, 2014.
 [13] Yichao Lu and Dean P Foster. Large scale canonical correlation analysis with iterative least squares. arXiv preprint arXiv:1407.4508, 2014.
 [14] Brian McWilliams, David Balduzzi, and Joachim Buhmann. Correlated random features for fast semisupervised learning. In Advances in Neural Information Processing Systems, pages 440–448, 2013.
 [15] Vladimir Vapnik and Akshay Vashist. A new learning paradigm: Learning using privileged information. Neural Networks, 22(5):544–557, 2009.
 [16] Kilian Weinberger, Anirban Dasgupta, John Langford, Alex Smola, and Josh Attenberg. Feature hashing for large scale multitask learning. In Proceedings of the 26th Annual International Conference on Machine Learning, pages 1113–1120. ACM, 2009.
 [17] LeiHong Zhang and Moody T Chu. Computing absolute maximum correlation. IMA Journal of Numerical Analysis, page drq029, 2011.
Comments
There are no comments yet.