1 Introduction
As the volume of data collected in the social and natural sciences increases, the computational cost of learning from large datasets has become an important consideration. For learning nonlinear relationships, kernel methods achieve excellent performance but naïvely require operations cubic in the number of training points.
Randomization has recently been considered as an alternative to optimization that, surprisingly, can yield comparable generalization performance at a fraction of the computational cost [1, 2]. Random features have been introduced to approximate kernel machines when the number of training examples is very large, rendering exact kernel computation intractable. Among several different approaches, the Nyström method for lowrank kernel approximation [1] exhibits good theoretical properties and empirical performance [3, 4, 5].
A second problem arising with large datasets concerns obtaining labels, which often requires a domain expert to manually assign a label to each instance which can be very expensive – requiring significant investments of both time and money – as the size of the dataset increases. Semisupervised learning aims to improve prediction by extracting useful structure from the unlabeled data points and using this in conjunction with a function learned on a small number of labeled points.
Contribution.
This paper proposes a new semisupervised algorithm for regression and classification, Correlated Nyström Views (XNV), that addresses both problems simultaneously. The method consists in essentially two steps. First, we construct two “views” using random features. We investigate two ways of doing so: one based on the Nyström method and another based on random Fourier features (socalled kitchen sinks) [6, 2]. It turns out that the Nyström method almost always outperforms Fourier features by a quite large margin, so we only report these results in the main text.
The second step, following [7], uses Canonical Correlation Analysis (CCA, [8, 9]) to bias the optimization procedure towards features that are correlated across the views. Intuitively, if both views contain accurate estimators, then penalizing uncorrelated features reduces variance without increasing the bias by much. Recent theoretical work by Bach [5] shows that Nyström views can be expected to contain accurate estimators.
We perform an extensive evaluation of XNV on 18 realworld datasets, comparing against a modified version of the SSSL (simple semisupervised learning) algorithm introduced in [10]. We find that XNV outperforms SSSL by around 1015% on average, depending on the number of labeled points available, see §3. We also find that the performance of XNV exhibits dramatically less variability than SSSL, with a typical reduction of 30%.
We chose SSSL since it was shown in [10] to outperform a state of the art algorithm, Laplacian Regularized Least Squares [11]. However, since SSSL does not scale up to large sets of unlabeled data, we modify SSSL by introducing a Nyström approximation to improve runtime performance. This reduces runtime by a factor of on points, with further improvements as increases. Our approximate version of SSSL
outperforms kernel ridge regression (KRR) by
on the 18 datasets on average, in line with the results reported in [10], suggesting that we lose little by replacing the exact SSSL with our approximate implementation.Related work.
Multiple view learning was first introduced in the cotraining method of [12] and has also recently been extended to unsupervised settings [13, 14]. Our algorithm builds on an elegant proposal for multiview regression introduced in [7]. Surprisingly, despite guaranteeing improved prediction performance under a relatively weak assumption on the views, CCA regression has not been widely used since its proposal – to the best of our knowledge this is first empirical evaluation of multiview regression’s performance. A possible reason for this is the difficulty in obtaining naturally occurring data equipped with multiple views that can be shown to satisfy the multiview assumption. We overcome this problem by constructing random views that satisfy the assumption by design.
2 Method
This section introduces XNV, our semisupervised learning method. The method builds on two main ideas. First, given two equally useful but sufficiently different views on a dataset, penalizing regression using the canonical norm (computed via CCA), can substantially improve performance [7]. The second is the Nyström method for constructing random features [1], which we use to construct the views.
2.1 Multiview regression
Suppose we have data for and
, sampled according to joint distribution
. Further suppose we have two views on the dataWe make the following assumption about linear regressors which can be learned on these views.
Assumption 1 (Multiview assumption [7]).
Define meansquared error loss function
and let . Further let denote the space of linear maps from a linear space to the reals, and define:The multiview assumption is that
(1) 
In short, the best predictor in each view is within of the best overall predictor.
Canonical correlation analysis.
Canonical correlation analysis [8, 9]
extends principal component analysis (PCA) from one to two sets of variables. CCA finds bases for the two sets of variables such that the correlation between projections onto the bases are maximized.
Subsequent pairs are found by maximizing correlations subject to being orthogonal to previously found pairs. The result of performing CCA is two sets of bases, for , such that the projection of onto which we denote satisfies

Orthogonality: , where is the Kronecker delta, and

Correlation: where w.l.o.g. we assume .
is referred to as the canonical correlation coefficient.
Definition 1 (canonical norm).
Given vector in the canonical basis, define its canonical norm as
Canonical ridge regression.
Assume we observe pairs of views coupled with real valued labels , canonical ridge regression finds coefficients such that
(3) 
The resulting estimator, referred to as the canonical shrinkage estimator, is
(4) 
Penalizing with the canonical norm biases the optimization towards features that are highly correlated across the views. Good regressors exist in both views by Assumption 1. Thus, intuitively, penalizing uncorrelated features significantly reduces variance, without increasing the bias by much. More formally:
Theorem 1 (canonical ridge regression, [7]).
The first term, , bounds the bias of the canonical estimator, whereas the second, bounds the variance. The can be thought of as a measure of the “intrinsic dimensionality” of the unlabeled data, which controls the rate of convergence. If the canonical correlation coefficients decay sufficiently rapidly, then the increase in bias is more than made up for by the decrease in variance.
2.2 Constructing random views
We construct two views satisfying Assumption 1 in expectation, see Theorem 3 below. To ensure our method scales to large sets of unlabeled data, we use random features generated using the Nyström method [1].
Suppose we have data . When is very large, constructing and manipulating the Gram matrix is computationally expensive. Where here, defines a mapping from to a high dimensional feature space and is a positive semidefinite kernel function.
The idea behind random features is to instead define a lowerdimensional mapping, through a random sampling scheme such that [15, 6]. Thus, using random features, nonlinear functions in can be learned as linear functions in leading to significant computational speedups. Here we give a brief overview of the Nyström method, which uses random subsampling to approximate the Gram matrix.
The Nyström method.
Fix an and randomly (uniformly) sample a subset of points from the data . Let denote the Gram matrix where . The Nyström method [1, 3] constructs a lowrank approximation to the Gram matrix as
(5) 
where is the pseudoinverse of . Vectors of random features can be constructed as
where the columns of
are the eigenvectors of
withthe diagonal matrix whose entries are the corresponding eigenvalues. Constructing features in this way reduces the time complexity of learning a nonlinear prediction function from
to [15].An alternative perspective on the Nyström approximation, that will be useful below, is as follows. Consider integral operators
(6) 
and introduce Hilbert space where is the rank of and the are the first eigenfunctions of
. Then the following proposition shows that using the Nyström approximation is equivalent to performing linear regression in the feature space (“view”)
spanned by the eigenfunctions of linear operator in Eq. (6):Proposition 2 (random Nyström view, [3]).
Solving
(7) 
is equivalent to solving
(8) 
2.3 The proposed algorithm: Correlated Nyström Views (Xnv)
Algorithm 1 details our approach to semisupervised learning based on generating two views consisting of Nyström random features and penalizing features which are weakly correlated across views. The setting is that we have labeled data and a large amount of unlabeled data .
Input: Labeled data: and unlabeled data:
(9) 
Output:
Step 1 generates a set of random features. The next two steps implement multiview regression using the randomly generated views and . Eq. (9) yields a solution for which unimportant features are heavily downweighted in the CCA basis without introducing an additional tuning parameter. The further penalty on the norm (in the CCA basis) is introduced as a practical measure to control the variance of the estimator which can become large if there are many highly correlated features (i.e. the ratio for large ). In practice most of the shrinkage is due to the CCA norm: crossvalidation obtains optimal values of in the range .
Computational complexity.
XNV is extremely fast. Nyström sampling, step 1, reduces the operations required for kernel learning to . Computing the CCA basis, step 2, using standard algorithms is in . However, we reduce the runtime to by applying a recently proposed randomized CCA algorithm of [16]
. Finally, step 3 is a computationally cheap linear program on
samples and features.Performance guarantees.
The quality of the kernel approximation in (5) has been the subject of detailed study in recent years leading to a number of strong empirical and theoretical results [3, 4, 5, 15]. Recent work of Bach [5] provides theoretical guarantees on the quality of Nyström estimates in the fixed design setting that are relevant to our approach.^{1}^{1}1Extending to a random design requires techniques from [17].
Theorem 3 (Nyström generalization bound, [5]).
Let be a random vector with finite variance and zero mean, , and define smoothed estimate and smoothed Nyström estimate , both computed by minimizing the MSE with ridge penalty . Let . For sufficiently large (depending on , see [5]), we have
where refers to the expectation over subsampled columns used to construct .
Random Fourier Features.
An alternative approach to constructing random views is to use Fourier features instead of Nyström features in Step 1. We refer to this approach as Correlated Kitchen Sinks (XKS) after [2]. It turns out that the performance of XKS is consistently worse than XNV, in line with the detailed comparison presented in [3]. We therefore do not discuss Fourier features in the main text, see §SI.3 for details on implementation and experimental results.
2.4 A fast approximation to Sssl
The SSSL (simple semisupervised learning) algorithm proposed in [10] finds the first eigenfunctions of the integral operator in Eq. (6) and then solves
(10) 
where is set by the user. SSSL outperforms Laplacian Regularized Least Squares [11], a state of the art semisupervised learning method, see [10]. It also has good generalization guarantees under reasonable assumptions on the distribution of eigenvalues of . However, since SSSL requires computing the full Gram matrix, it is extremely computationally intensive for large . Moreover, tuning is difficult since it is discrete.
We therefore propose SSSL, an approximation to SSSL. First, instead of constructing the full Gram matrix, we construct a Nyström approximation by sampling points from the labeled and unlabeled training set. Second, instead of thresholding eigenfunctions, we use the easier to tune ridge penalty which penalizes directions proportional to the inverse square of their eigenvalues [18].
As justification, note that Proposition 2 states that the Nyström approximation to kernel regression actually solves a ridge regression problem in the span of the eigenfunctions of . As increases, the span of tends towards that of [15]. We will also refer to the Nyström approximation to SSSL using features as SSSL. See experiments below for further discussion of the quality of the approximation.
3 Experiments
Setup.
We evaluate the performance of XNV on 18 realworld datasets, see Table 1. The datasets cover a variety of regression (denoted by R) and twoclass classification (C) problems. The sarcos dataset involves predicting the joint position of a robot arm; following convention we report results on the 1st, 5th and 7th joint positions.
Set  Name  Task  N  D  Set  Name  Task  N  D 

1  abalone^{2}^{2}2Taken from the UCI repository http://archive.ics.uci.edu/ml/datasets.html  C  10  elevators^{4}  R  
2 
adult^{2}  C  11  HIVa^{3}^{3}3Taken from http://www.causality.inf.ethz.ch/activelearning.php  C  
3 
ailerons^{4}^{4}4Taken from http://www.dcc.fc.up.pt/~ltorgo/Regression/DataSets.html  R  12  house^{4}  R  
4 
bank8^{4}  C  13  ibn Sina^{3}  C  
5 
bank32^{4}  C  14  orange^{3}  C  
6 
cal housing^{4}  R  15  sarcos 1^{5}^{5}5Taken from http://www.gaussianprocess.org/gpml/data/  R  
7 
census^{2}  R  16  sarcos 5^{5}  R  
8 
CPU^{2}  R  17  sarcos 7^{5}  R  
9 
CT^{2}  R  18  sylva^{3}  C  

The SSSL algorithm was shown to exhibit stateoftheart performance over fully and semisupervised methods in scenarios where few labeled training examples are available [10]. However, as discussed in §2.2, due to its computational cost we compare the performance of XNV to the Nyström approximations SSSL and SSSL.
We used a Gaussian kernel for all datasets. We set the kernel width, and the regularisation strength, , for each method using 5fold cross validation with labeled training examples. We trained all methods using a squared error loss function, , with random features, and randomly selected training examples.
Runtime performance.
The SSSL algorithm of [10] is not computationally feasible on large datasets, since it has time complexity . For illustrative purposes, we report run times^{6}^{6}6Computed in Matlab 7.14 on a Core i5 with 4GB memory. in seconds of the SSSL algorithm against SSSL and XNV on three datasets of different sizes.
runtimes  bank8  cal housing  sylva 

SSSL  s  s   
SSSL  s  s  s 
XNV  s  s  s 
For the cal housing dataset, XNV exhibits an almost speed up over SSSL. For the largest dataset, sylva, exact SSSL is computationally intractable. Importantly, the computational overhead of XNV over SSSL is small.
Generalization performance.
We report on the prediction performance averaged over 100 experiments. For regression tasks we report on the mean squared error (MSE) on the testing set normalized by the variance of the test output. For classification tasks we report the percentage of the test set that was misclassified.
The table below shows the improvement in performance of XNV over SSSL and SSSL (taking whichever performs better out of or on each dataset), averaged over all 18 datasets. Observe that XNV is considerably more accurate and more robust than SSSL.
XNV vs SSSL  

Avg reduction in error  11%  16%  15%  12%  9% 
Avg reduction in std err  15%  30%  31%  33%  30% 
The reduced variability is to be expected from Theorem 1.
Comparison of mean prediction error and standard deviation on a selection of datasets.
Table 2 presents more detailed comparison of performance for individual datasets when . The plots in Figure 1 shows a representative comparison of mean prediction errors for several datasets when . Error bars represent one standard deviation. Observe that XNV almost always improves prediction accuracy and reduces variance compared with SSSL and SSSL when the labeled training set contains between 100 and 500 labeled points. A complete set of results is provided in §SI.1.
Discussion of Sssl.
Our experiments show that going from to does not improve generalization performance in practice. This suggests that when there are few labeled points, obtaining a more accurate estimate of the eigenfunctions of the kernel does not necessarily improve predictive performance. Indeed, when more random features are added, stronger regularization is required to reduce the influence of uninformative features, this also has the effect of downweighting informative features. This suggests that the low rank approximation SSSL to SSSL suffices.
Random Fourier features.
Nyström features significantly outperform Fourier features, in line with observations in [3]. The table below shows the relative improvement of XNV over XKS:
XNV vs XKS  

Avg reduction in error  30%  28%  26%  25%  24% 
Avg reduction in std err  36%  44%  34%  37%  36% 
Further results and discussion for XKS are included in the supplementary material.
set  SSSL  SSSL  XNV  set  SSSL  SSSL  XNV 

1  10  
2  11  
3  12  
4  13  
5  14  
6  15  
7  16  
8  17  
9  18  
1  10  
2  11  
3  12  
4  13  
5  14  
6  15  
7  16  
8  17  
9  18 
Performance (normalized MSE/classification error rate). Standard errors in parentheses.
4 Conclusion
We have introduced the XNV algorithm for semisupervised learning. By combining two randomly generated views of Nyström features via an efficient implementation of CCA, XNV outperforms the prior stateoftheart, SSSL, by 1015% (depending on the number of labeled points) on average over 18 datasets. Furthermore, XNV is over 3 orders of magnitude faster than SSSL on medium sized datasets () with further gains as increases. An interesting research direction is to investigate using the recently developed deep CCA algorithm, which extracts higher order correlations between views [19], as a preprocessing step.
In this work we use a uniform sampling scheme for the Nyström method for computational reasons since it has been shown to perform well empirically relative to more expensive schemes [20]. Since CCA gives us a criterion by which to measure the important of random features, in the future we aim to investigate active sampling schemes based on canonical correlations which may yield better performance by selecting the most informative indices to sample.
Acknowledgements.
We thank Haim Avron for help with implementing randomized CCA and Patrick Pletscher for drawing our attention to the Nyström method.
References
 [1] Williams C, Seeger M: Using the Nyström method to speed up kernel machines. In NIPS 2001.
 [2] Rahimi A, Recht B: Weighted sums of random kitchen sinks: Replacing minimization with randomization in learning. In Adv in Neural Information Processing Systems (NIPS) 2008.
 [3] Yang T, Li YF, Mahdavi M, Jin R, Zhou ZH: Nyström Method vs Random Fourier Features: A Theoretical and Empirical Comparison. In NIPS 2012.

[4]
Gittens A, Mahoney MW:
Revisiting the Nyström method for improved largescale machine learning
. In ICML 2013.  [5] Bach F: Sharp analysis of lowrank kernel approximations. In COLT 2013.
 [6] Rahimi A, Recht B: Random Features for LargeScale Kernel Machines. In Adv in Neural Information Processing Systems 2007.
 [7] Kakade S, Foster DP: Multiview Regression Via Canonical Correlation Analysis. In Computational Learning Theory (COLT) 2007.
 [8] Hotelling H: Relations between two sets of variates. Biometrika 1936, 28:312–377.
 [9] Hardoon DR, Szedmak S, ShaweTaylor J: Canonical Correlation Analysis: An Overview with Application to Learning Methods. Neural Comp 2004, 16(12):2639–2664.
 [10] Ji M, Yang T, Lin B, Jin R, Han J: A Simple Algorithm for Semisupervised Learning with Improved Generalization Error Bound. In ICML 2012.
 [11] Belkin M, Niyogi P, Sindhwani V: Manifold regularization: A geometric framework for learning from labeled and unlabeled examples. JMLR 2006, 7:2399–2434.
 [12] Blum A, Mitchell T: Combining labeled and unlabeled data with cotraining. In COLT 1998.
 [13] Chaudhuri K, Kakade SM, Livescu K, Sridharan K: Multiview clustering via Canonical Correlation Analysis. In ICML 2009.
 [14] McWilliams B, Montana G: Multiview predictive partitioning in high dimensions. Statistical Analysis and Data Mining 2012, 5:304–321.
 [15] Drineas P, Mahoney MW: On the Nyström Method for Approximating a Gram Matrix for Improved KernelBased Learning. JMLR 2005, 6:2153–2175.
 [16] Avron H, Boutsidis C, Toledo S, Zouzias A: Efficient Dimensionality Reduction for Canonical Correlation Analysis. In ICML 2013.
 [17] Hsu D, Kakade S, Zhang T: An Analysis of Random Design Linear Regression. In COLT 2012.

[18]
Dhillon PS, Foster DP, Kakade SM, Ungar LH:
A Risk Comparison of Ordinary Least Squares vs Ridge Regression
. Journal of Machine Learning Research 2013, 14:1505–1511.  [19] Andrew G, Arora R, Bilmes J, Livescu K: Deep Canonical Correlation Analysis. In ICML 2013.
 [20] Kumar S, Mohri M, Talwalkar A: Sampling methods for the Nyström method. JMLR 2012, 13:981–1006.
si.1 Complete Xnv results
set  SSSL  SSSL  XNV  set  SSSL  SSSL  XNV 

1  10  
2  11  
3  12  
4  13  
5  14  
6  15  
7  16  
8  17  
9  18  
1  10  
2  11  
3  12  
4  13  
5  14  
6  15  
7  16  
8  17  
9  18  
1  10  
2  11  
3  12  
4  13  
5  14  
6  15  
7  16  
8  17  
9  18  
1  10  
2  11  
3  12  
4  13  
5  14  
6  15  
7  16  
8  17  
9  18  
1  10  
2  11  
3  12  
4  13  
5  14  
6  15  
7  16  
8  17  
9  18 
si.2 Comparison with Kernel Ridge Regression
W
Comments
There are no comments yet.