Correlated random features for fast semi-supervised learning

06/24/2013 ∙ by Brian McWilliams, et al. ∙ ETH Zurich 0

This paper presents Correlated Nystrom Views (XNV), a fast semi-supervised algorithm for regression and classification. The algorithm draws on two main ideas. First, it generates two views consisting of computationally inexpensive random features. Second, XNV applies multiview regression using Canonical Correlation Analysis (CCA) on unlabeled data to bias the regression towards useful features. It has been shown that, if the views contains accurate estimators, CCA regression can substantially reduce variance with a minimal increase in bias. Random views are justified by recent theoretical and empirical work showing that regression with random features closely approximates kernel regression, implying that random views can be expected to contain accurate estimators. We show that XNV consistently outperforms a state-of-the-art algorithm for semi-supervised learning: substantially improving predictive performance and reducing the variability of performance on a wide variety of real-world datasets, whilst also reducing runtime by orders of magnitude.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

As the volume of data collected in the social and natural sciences increases, the computational cost of learning from large datasets has become an important consideration. For learning non-linear relationships, kernel methods achieve excellent performance but naïvely require operations cubic in the number of training points.

Randomization has recently been considered as an alternative to optimization that, surprisingly, can yield comparable generalization performance at a fraction of the computational cost [1, 2]. Random features have been introduced to approximate kernel machines when the number of training examples is very large, rendering exact kernel computation intractable. Among several different approaches, the Nyström method for low-rank kernel approximation [1] exhibits good theoretical properties and empirical performance [3, 4, 5].

A second problem arising with large datasets concerns obtaining labels, which often requires a domain expert to manually assign a label to each instance which can be very expensive – requiring significant investments of both time and money – as the size of the dataset increases. Semi-supervised learning aims to improve prediction by extracting useful structure from the unlabeled data points and using this in conjunction with a function learned on a small number of labeled points.

Contribution.

This paper proposes a new semi-supervised algorithm for regression and classification, Correlated Nyström Views (XNV), that addresses both problems simultaneously. The method consists in essentially two steps. First, we construct two “views” using random features. We investigate two ways of doing so: one based on the Nyström method and another based on random Fourier features (so-called kitchen sinks) [6, 2]. It turns out that the Nyström method almost always outperforms Fourier features by a quite large margin, so we only report these results in the main text.

The second step, following [7], uses Canonical Correlation Analysis (CCA, [8, 9]) to bias the optimization procedure towards features that are correlated across the views. Intuitively, if both views contain accurate estimators, then penalizing uncorrelated features reduces variance without increasing the bias by much. Recent theoretical work by Bach [5] shows that Nyström views can be expected to contain accurate estimators.

We perform an extensive evaluation of XNV on 18 real-world datasets, comparing against a modified version of the SSSL (simple semi-supervised learning) algorithm introduced in [10]. We find that XNV outperforms SSSL by around 10-15% on average, depending on the number of labeled points available, see §3. We also find that the performance of XNV exhibits dramatically less variability than SSSL, with a typical reduction of 30%.

We chose SSSL since it was shown in [10] to outperform a state of the art algorithm, Laplacian Regularized Least Squares [11]. However, since SSSL does not scale up to large sets of unlabeled data, we modify SSSL by introducing a Nyström approximation to improve runtime performance. This reduces runtime by a factor of on points, with further improvements as increases. Our approximate version of SSSL

outperforms kernel ridge regression (KRR) by

on the 18 datasets on average, in line with the results reported in [10], suggesting that we lose little by replacing the exact SSSL with our approximate implementation.

Related work.

Multiple view learning was first introduced in the co-training method of [12] and has also recently been extended to unsupervised settings [13, 14]. Our algorithm builds on an elegant proposal for multi-view regression introduced in [7]. Surprisingly, despite guaranteeing improved prediction performance under a relatively weak assumption on the views, CCA regression has not been widely used since its proposal – to the best of our knowledge this is first empirical evaluation of multi-view regression’s performance. A possible reason for this is the difficulty in obtaining naturally occurring data equipped with multiple views that can be shown to satisfy the multi-view assumption. We overcome this problem by constructing random views that satisfy the assumption by design.

2 Method

This section introduces XNV, our semi-supervised learning method. The method builds on two main ideas. First, given two equally useful but sufficiently different views on a dataset, penalizing regression using the canonical norm (computed via CCA), can substantially improve performance [7]. The second is the Nyström method for constructing random features [1], which we use to construct the views.

2.1 Multi-view regression

Suppose we have data for and

, sampled according to joint distribution

. Further suppose we have two views on the data

We make the following assumption about linear regressors which can be learned on these views.

Assumption 1 (Multi-view assumption [7]).

Define mean-squared error loss function

and let . Further let denote the space of linear maps from a linear space to the reals, and define:

The multi-view assumption is that

(1)

In short, the best predictor in each view is within of the best overall predictor.

Canonical correlation analysis.

Canonical correlation analysis [8, 9]

extends principal component analysis (PCA) from one to two sets of variables. CCA finds bases for the two sets of variables such that the correlation between projections onto the bases are maximized.

The first pair of canonical basis vectors,

is found by solving:

(2)

Subsequent pairs are found by maximizing correlations subject to being orthogonal to previously found pairs. The result of performing CCA is two sets of bases, for , such that the projection of onto which we denote satisfies

  1. Orthogonality: , where is the Kronecker delta, and

  2. Correlation: where w.l.o.g. we assume .

is referred to as the canonical correlation coefficient.

Definition 1 (canonical norm).

Given vector in the canonical basis, define its canonical norm as

Canonical ridge regression.

Assume we observe pairs of views coupled with real valued labels , canonical ridge regression finds coefficients such that

(3)

The resulting estimator, referred to as the canonical shrinkage estimator, is

(4)

Penalizing with the canonical norm biases the optimization towards features that are highly correlated across the views. Good regressors exist in both views by Assumption 1. Thus, intuitively, penalizing uncorrelated features significantly reduces variance, without increasing the bias by much. More formally:

Theorem 1 (canonical ridge regression, [7]).

Assume and that Assumption 1 holds. Let denote the estimator constructed with the canonical shrinkage estimator, Eq. (4), on training set , and let denote the best linear predictor across both views. For we have

where the expectation is with respect to training sets sampled from .

The first term, , bounds the bias of the canonical estimator, whereas the second, bounds the variance. The can be thought of as a measure of the “intrinsic dimensionality” of the unlabeled data, which controls the rate of convergence. If the canonical correlation coefficients decay sufficiently rapidly, then the increase in bias is more than made up for by the decrease in variance.

2.2 Constructing random views

We construct two views satisfying Assumption 1 in expectation, see Theorem 3 below. To ensure our method scales to large sets of unlabeled data, we use random features generated using the Nyström method [1].

Suppose we have data . When is very large, constructing and manipulating the Gram matrix is computationally expensive. Where here, defines a mapping from to a high dimensional feature space and is a positive semi-definite kernel function.

The idea behind random features is to instead define a lower-dimensional mapping, through a random sampling scheme such that [15, 6]. Thus, using random features, non-linear functions in can be learned as linear functions in leading to significant computational speed-ups. Here we give a brief overview of the Nyström method, which uses random subsampling to approximate the Gram matrix.

The Nyström method.

Fix an and randomly (uniformly) sample a subset of points from the data . Let denote the Gram matrix where . The Nyström method [1, 3] constructs a low-rank approximation to the Gram matrix as

(5)

where is the pseudo-inverse of . Vectors of random features can be constructed as

where the columns of

are the eigenvectors of

with

the diagonal matrix whose entries are the corresponding eigenvalues. Constructing features in this way reduces the time complexity of learning a non-linear prediction function from

to [15].

An alternative perspective on the Nyström approximation, that will be useful below, is as follows. Consider integral operators

(6)

and introduce Hilbert space where is the rank of and the are the first eigenfunctions of

. Then the following proposition shows that using the Nyström approximation is equivalent to performing linear regression in the feature space (“view”)

spanned by the eigenfunctions of linear operator in Eq. (6):

Proposition 2 (random Nyström view, [3]).

Solving

(7)

is equivalent to solving

(8)

2.3 The proposed algorithm: Correlated Nyström Views (Xnv)

Algorithm 1 details our approach to semi-supervised learning based on generating two views consisting of Nyström random features and penalizing features which are weakly correlated across views. The setting is that we have labeled data and a large amount of unlabeled data .

Input:  Labeled data: and unlabeled data:

1:  Generate features. Sample uniformly from the dataset, compute the eigendecompositions of the sub-sampled kernel matrices and which are constructed from the samples and respectively, and featurize the input:
2:  Unlabeled data. Compute CCA bases , and canonical correlations for the two views and set
3:  Labeled data. Solve
(9)

Output:

Algorithm 1 Correlated Nyström Views (XNV).

Step 1 generates a set of random features. The next two steps implement multi-view regression using the randomly generated views and . Eq. (9) yields a solution for which unimportant features are heavily downweighted in the CCA basis without introducing an additional tuning parameter. The further penalty on the norm (in the CCA basis) is introduced as a practical measure to control the variance of the estimator which can become large if there are many highly correlated features (i.e. the ratio for large ). In practice most of the shrinkage is due to the CCA norm: cross-validation obtains optimal values of in the range .

Computational complexity.

XNV is extremely fast. Nyström sampling, step 1, reduces the operations required for kernel learning to . Computing the CCA basis, step 2, using standard algorithms is in . However, we reduce the runtime to by applying a recently proposed randomized CCA algorithm of [16]

. Finally, step 3 is a computationally cheap linear program on

samples and features.

Performance guarantees.

The quality of the kernel approximation in (5) has been the subject of detailed study in recent years leading to a number of strong empirical and theoretical results [3, 4, 5, 15]. Recent work of Bach [5] provides theoretical guarantees on the quality of Nyström estimates in the fixed design setting that are relevant to our approach.111Extending to a random design requires techniques from [17].

Theorem 3 (Nyström generalization bound, [5]).

Let be a random vector with finite variance and zero mean, , and define smoothed estimate and smoothed Nyström estimate , both computed by minimizing the MSE with ridge penalty . Let . For sufficiently large (depending on , see [5]), we have

where refers to the expectation over subsampled columns used to construct .

In short, the best smoothed estimators in the Nyström views are close to the optimal smoothed estimator. Since the kernel estimate is consistent, as . Thus, Assumption 1 holds in expectation and the generalization performance of XNV is controlled by Theorem 1.

Random Fourier Features.

An alternative approach to constructing random views is to use Fourier features instead of Nyström features in Step 1. We refer to this approach as Correlated Kitchen Sinks (XKS) after [2]. It turns out that the performance of XKS is consistently worse than XNV, in line with the detailed comparison presented in [3]. We therefore do not discuss Fourier features in the main text, see §SI.3 for details on implementation and experimental results.

2.4 A fast approximation to Sssl

The SSSL (simple semi-supervised learning) algorithm proposed in [10] finds the first eigenfunctions of the integral operator in Eq. (6) and then solves

(10)

where is set by the user. SSSL outperforms Laplacian Regularized Least Squares [11], a state of the art semi-supervised learning method, see [10]. It also has good generalization guarantees under reasonable assumptions on the distribution of eigenvalues of . However, since SSSL requires computing the full Gram matrix, it is extremely computationally intensive for large . Moreover, tuning is difficult since it is discrete.

We therefore propose SSSL, an approximation to SSSL. First, instead of constructing the full Gram matrix, we construct a Nyström approximation by sampling points from the labeled and unlabeled training set. Second, instead of thresholding eigenfunctions, we use the easier to tune ridge penalty which penalizes directions proportional to the inverse square of their eigenvalues [18].

As justification, note that Proposition 2 states that the Nyström approximation to kernel regression actually solves a ridge regression problem in the span of the eigenfunctions of . As increases, the span of tends towards that of [15]. We will also refer to the Nyström approximation to SSSL using features as SSSL. See experiments below for further discussion of the quality of the approximation.

3 Experiments

Setup.

We evaluate the performance of XNV on 18 real-world datasets, see Table 1. The datasets cover a variety of regression (denoted by R) and two-class classification (C) problems. The sarcos dataset involves predicting the joint position of a robot arm; following convention we report results on the 1st, 5th and 7th joint positions.

Set Name Task N D Set Name Task N D
1 abalone222Taken from the UCI repository http://archive.ics.uci.edu/ml/datasets.html C 10 elevators4 R

2
adult2 C 11 HIVa333Taken from http://www.causality.inf.ethz.ch/activelearning.php C

3
ailerons444Taken from http://www.dcc.fc.up.pt/~ltorgo/Regression/DataSets.html R 12 house4 R

4
bank84 C 13 ibn Sina3 C

5
bank324 C 14 orange3 C

6
cal housing4 R 15 sarcos 1555Taken from http://www.gaussianprocess.org/gpml/data/ R

7
census2 R 16 sarcos 55 R

8
CPU2 R 17 sarcos 75 R


9
CT2 R 18 sylva3 C

Table 1: Datasets used for evaluation.

The SSSL algorithm was shown to exhibit state-of-the-art performance over fully and semi-supervised methods in scenarios where few labeled training examples are available [10]. However, as discussed in §2.2, due to its computational cost we compare the performance of XNV to the Nyström approximations SSSL and SSSL.

We used a Gaussian kernel for all datasets. We set the kernel width, and the regularisation strength, , for each method using 5-fold cross validation with labeled training examples. We trained all methods using a squared error loss function, , with random features, and randomly selected training examples.

Runtime performance.

The SSSL algorithm of [10] is not computationally feasible on large datasets, since it has time complexity . For illustrative purposes, we report run times666Computed in Matlab 7.14 on a Core i5 with 4GB memory. in seconds of the SSSL algorithm against SSSL and XNV on three datasets of different sizes.

runtimes bank8 cal housing sylva
SSSL s s -
SSSL s s s
XNV s s s

For the cal housing dataset, XNV exhibits an almost speed up over SSSL. For the largest dataset, sylva, exact SSSL is computationally intractable. Importantly, the computational overhead of XNV over SSSL is small.

Generalization performance.

We report on the prediction performance averaged over 100 experiments. For regression tasks we report on the mean squared error (MSE) on the testing set normalized by the variance of the test output. For classification tasks we report the percentage of the test set that was misclassified.

The table below shows the improvement in performance of XNV over SSSL and SSSL (taking whichever performs better out of or on each dataset), averaged over all 18 datasets. Observe that XNV is considerably more accurate and more robust than SSSL.

XNV vs SSSL
Avg reduction in error 11% 16% 15% 12% 9%
Avg reduction in std err 15% 30% 31% 33% 30%

The reduced variability is to be expected from Theorem 1.

(a) adult
(b) cal housing
(c) census
(d) elevators
(e) ibn Sina
(f) sarcos 5
Figure 1:

Comparison of mean prediction error and standard deviation on a selection of datasets.

Table 2 presents more detailed comparison of performance for individual datasets when . The plots in Figure 1 shows a representative comparison of mean prediction errors for several datasets when . Error bars represent one standard deviation. Observe that XNV almost always improves prediction accuracy and reduces variance compared with SSSL and SSSL when the labeled training set contains between 100 and 500 labeled points. A complete set of results is provided in §SI.1.

Discussion of Sssl.

Our experiments show that going from to does not improve generalization performance in practice. This suggests that when there are few labeled points, obtaining a more accurate estimate of the eigenfunctions of the kernel does not necessarily improve predictive performance. Indeed, when more random features are added, stronger regularization is required to reduce the influence of uninformative features, this also has the effect of downweighting informative features. This suggests that the low rank approximation SSSL to SSSL suffices.

Finally, §SI.2 compares the performance of SSSL and XNV to fully supervised kernel ridge regression (KRR). We observe dramatic improvements, between 48% and 63%, consistent with the results observed in [10] for the exact SSSL algorithm.

Random Fourier features.

Nyström features significantly outperform Fourier features, in line with observations in [3]. The table below shows the relative improvement of XNV over XKS:

XNV vs XKS
Avg reduction in error 30% 28% 26% 25% 24%
Avg reduction in std err 36% 44% 34% 37% 36%

Further results and discussion for XKS are included in the supplementary material.

set SSSL SSSL XNV set SSSL SSSL XNV
1 10
2 11
3 12
4 13
5 14
6 15
7 16
8 17
9 18
1 10
2 11
3 12
4 13
5 14
6 15
7 16
8 17
9 18
Table 2:

Performance (normalized MSE/classification error rate). Standard errors in parentheses.

4 Conclusion

We have introduced the XNV algorithm for semi-supervised learning. By combining two randomly generated views of Nyström features via an efficient implementation of CCA, XNV outperforms the prior state-of-the-art, SSSL, by 10-15% (depending on the number of labeled points) on average over 18 datasets. Furthermore, XNV is over 3 orders of magnitude faster than SSSL on medium sized datasets () with further gains as increases. An interesting research direction is to investigate using the recently developed deep CCA algorithm, which extracts higher order correlations between views [19], as a preprocessing step.

In this work we use a uniform sampling scheme for the Nyström method for computational reasons since it has been shown to perform well empirically relative to more expensive schemes [20]. Since CCA gives us a criterion by which to measure the important of random features, in the future we aim to investigate active sampling schemes based on canonical correlations which may yield better performance by selecting the most informative indices to sample.

Acknowledgements.

We thank Haim Avron for help with implementing randomized CCA and Patrick Pletscher for drawing our attention to the Nyström method.

References

  • [1] Williams C, Seeger M: Using the Nyström method to speed up kernel machines. In NIPS 2001.
  • [2] Rahimi A, Recht B: Weighted sums of random kitchen sinks: Replacing minimization with randomization in learning. In Adv in Neural Information Processing Systems (NIPS) 2008.
  • [3] Yang T, Li YF, Mahdavi M, Jin R, Zhou ZH: Nyström Method vs Random Fourier Features: A Theoretical and Empirical Comparison. In NIPS 2012.
  • [4] Gittens A, Mahoney MW:

    Revisiting the Nyström method for improved large-scale machine learning

    . In ICML 2013.
  • [5] Bach F: Sharp analysis of low-rank kernel approximations. In COLT 2013.
  • [6] Rahimi A, Recht B: Random Features for Large-Scale Kernel Machines. In Adv in Neural Information Processing Systems 2007.
  • [7] Kakade S, Foster DP: Multi-view Regression Via Canonical Correlation Analysis. In Computational Learning Theory (COLT) 2007.
  • [8] Hotelling H: Relations between two sets of variates. Biometrika 1936, 28:312–377.
  • [9] Hardoon DR, Szedmak S, Shawe-Taylor J: Canonical Correlation Analysis: An Overview with Application to Learning Methods. Neural Comp 2004, 16(12):2639–2664.
  • [10] Ji M, Yang T, Lin B, Jin R, Han J: A Simple Algorithm for Semi-supervised Learning with Improved Generalization Error Bound. In ICML 2012.
  • [11] Belkin M, Niyogi P, Sindhwani V: Manifold regularization: A geometric framework for learning from labeled and unlabeled examples. JMLR 2006, 7:2399–2434.
  • [12] Blum A, Mitchell T: Combining labeled and unlabeled data with co-training. In COLT 1998.
  • [13] Chaudhuri K, Kakade SM, Livescu K, Sridharan K: Multiview clustering via Canonical Correlation Analysis. In ICML 2009.
  • [14] McWilliams B, Montana G: Multi-view predictive partitioning in high dimensions. Statistical Analysis and Data Mining 2012, 5:304–321.
  • [15] Drineas P, Mahoney MW: On the Nyström Method for Approximating a Gram Matrix for Improved Kernel-Based Learning. JMLR 2005, 6:2153–2175.
  • [16] Avron H, Boutsidis C, Toledo S, Zouzias A: Efficient Dimensionality Reduction for Canonical Correlation Analysis. In ICML 2013.
  • [17] Hsu D, Kakade S, Zhang T: An Analysis of Random Design Linear Regression. In COLT 2012.
  • [18] Dhillon PS, Foster DP, Kakade SM, Ungar LH:

    A Risk Comparison of Ordinary Least Squares vs Ridge Regression

    . Journal of Machine Learning Research 2013, 14:1505–1511.
  • [19] Andrew G, Arora R, Bilmes J, Livescu K: Deep Canonical Correlation Analysis. In ICML 2013.
  • [20] Kumar S, Mohri M, Talwalkar A: Sampling methods for the Nyström method. JMLR 2012, 13:981–1006.

si.1 Complete Xnv results

set SSSL SSSL XNV set SSSL SSSL XNV
1 10
2 11
3 12
4 13
5 14
6 15
7 16
8 17
9 18
1 10
2 11
3 12
4 13
5 14
6 15
7 16
8 17
9 18
1 10
2 11
3 12
4 13
5 14
6 15
7 16
8 17
9 18
1 10
2 11
3 12
4 13
5 14
6 15
7 16
8 17
9 18
1 10
2 11
3 12
4 13
5 14
6 15
7 16
8 17
9 18
Table 3: Performance (normalized MSE/classification error rate). Standard errors in parentheses.
(a) abalone
(b) adult
(c) ailerons
(d) bank8
(e) bank32
(f) cal housing
(g) census
(h) CPU
(i) CT
(j) elevators
(k) HIVa
(l) house
(m) ibn Sina
(n) orange
(o) sarcos 1
(p) sarcos 5
(q) sarcos 7
(r) sylva
Figure 2: Comparison of mean prediction error and standard deviation on all 18 datasets.

si.2 Comparison with Kernel Ridge Regression

We compare SSSL and XNV to kernel ridge regression (KRR). The table below reports the percentage improvement in mean error of both of these methods against KRR, averaged over the 18 datasets according to the experimental procedure detailed in §3. Parameters (kernel width) and (ridge penalty) for KRR were chosen by 5-fold cross validation. We observe that both SSSL and XNV far outperform KRR, by . Importantly, this shows our approximation to SSSL far outperfor