As a flexible nonparametric tool, kernel ridge regression (KRR) has gained popularity in many application fields, such as machine learning, and visualization; see e.g.,. Besides the estimation of the predictive mean, an exploration of the predictive variance
is also important for statistical inference. Predictive variances can be used for inference, for example, to build confidence intervals; or to select the most informative data points in active learning. There are two sources of uncertainty in the predictive variance: the noise in the data and the uncertainty in the estimation of the target function. However, calculating the second uncertainty is challenging in KRR on a large data set, since the computational burden increases dramatically with respect to the size of the training set. For example, fordata points, the time and space complexity of kernel ridge regression (KRR) are of and respectively. The above can potentially limit the applicability of KRR to big data scenarios.
An efficient way to break the computational bottleneck is low-rank approximation of kernel matrices. Existing methods include dimension reduction ([16, 3, 2]), Nystrm approximation (, , ), and random projections of large kernel matrices (). Indeed, low-rank approximation strategies effectively reduce the size of large matrices such that the reduced matrices can be conveniently stored and processed.
In this paper, we propose a randomly sketched predictive variance to reduce the computational complexity. Theoretically, we show that given a lower bound of the projection dimension, our approach leads to a consistent estimator of the true variance. Furthermore, our variance estimator is optimal for a large family of kernel matrices with polynomially and exponentially decaying eigenvalues. This includes, for instance, cubic splines and Gaussian kernels.
To illustrate the applicability of our theorical contribution, we describe an application of our variance estimator in active learning. In many scenarios, the task of manually labeling (unlabeled) data points is expensive and time-consuming. Therefore it is very important to minimize the number of training examples needed to estimate a particular type of regression function. Suppose we have a set of training examples and labels (responses), and we are permitted to actively choose future unlabeled examples based on the data that we have previously seen. Active learning aims at solving this problem and has been used with various learners such as neural networks[9, 5], mixture models 13] and kernel ridge regression. In active learning, the predictive variance can be viewed as an uncertainty score to iteratively select the most informative unlabeled data points. That is, the largest predictive variance corresponds to the highest uncertainty in for unlabeled points, which indicates that we may need more information regarding those points.
In this section, we introduce kernel ridge regression, its mean prediction and the conditional covariance. Let be the training examples, and be the corresponding training labels, where with distribution and for all . Consider the following nonparametric regression model
’s are independent random variables with meanand variance . Hereafter, we assume that is a reproducing kernel Hilbert space (RKHS) associated with a reproducing kernel function defined from to . Let denote the inner product of associated with , then the reproducing kernel property states that
The corresponding norm is defined as for any .
The classic kernel ridge regression (KRR) estimate is obtained via minimizing a penalized likelihood function:
Let be the -dimensional kernel matrix with entries for . By the representer theorem, has the form
for a real vector , equation (2) reduces to solving the following optimization problem:
Thus, the KRR estimator is expressed as , where .
For a new testing data point , let . It is easy to calculate its mean prediction and variance given and as follows:
Analyzing the conditional variance for the testing data is very important in active learning, since it can act as a guide to select the efficient information we need. However, the time and space taken for solving is of order . This cost is expensive especially when the kernel matrix is dense and the sample size is large.
Iii Randomly Projected Variance
In this section, we introduce our randomly projected conditional covariance and our main assumptions. Note that by the Binomial Inverse Theorem, we have
To reduce the computational cost, now we propose to replace by using a randomly projected version as follows
is a random matrix where each row is independently distributed and sub-Gaussian. Then the conditional variance with the randomly projected matrix can be written as
The definition of is also our contribution. The variance is different from the variance that could be derived from the results in , which is:
Note that calculating only takes the order of , which enhances the computational efficiency greatly. In Section 4, we will provide a lower bound for that guarantees stability of the variance after random projection.
Here we introduce some notations to study the dimension of the random matrix. Define the efficiency dimension as
where is the -th highest eigenvalue of the kernel matrix . More formally, let , where is an orthonormal matrix, i.e., is an identity matrix, and is a diagonal matrix with diagonal elements .
In this paper, we consider random matrices with independent sub-Gaussian rows. For the random matrix , the row is sub-Gaussian if for all , are sub-Gaussian random variables, i.e.,
Matrices fulfilling the above condition include all matrices with independent sub-Gaussian entries as a particular instance. The class of sub-Gaussian variates includes for instance Gaussian variables, any bounded random variable (e.g. Bernoulli, multinomial, uniform), any random variable with strongly log-concave density, and any finite mixture of sub-Gaussian variables. In the following of the paper, we scale the random matrix by for analyzing convenience.
Next, we state our main assumption and some useful results related to the randomly projected kernel matrix.
Let be a sub-Gaussian random matrix with independent rows. Let and . Set the projection dimension , where is an absolute constant. For , let with , and . Let with and . We assume that satisfies the following conditions:
with probability greater than, where is an absolute constant independent of .
with probability greater than , where and are constants independent of .
In Assumption A1, the kernel matrix is partitioned into a summation of the form where contains the first columns of the orthonormal matrix , which correspond to the first leading eigenvalues of the kernel matrix; and contains the rest of the columns of the , which correspond to the smallest eigenvalues. In most cases, the smallest eigenvalues are neglectable due to a fast decaying rate of the eigenvalues. Assumption A1
(i) ensures that the randomly projected eigenvectors corresponding to the leading eigenvalues still preserve the distance between each other approximately; AssumptionA1 (ii) ensures that the operator norm of the lowest “neglectable” part would not change too much after the random projection. Random matrices satisfying Assumption A1 include sub-Gaussian random matrices (see detailed proof in ), as well as matrices constructed by randomly sub-sampling and rescaling the rows of a fixed orthonormal matrix. We refer the interested reader to ,  for more details.
Our work differs from  in several fundamental ways.  focuses on the (mean) prediction error on a training set, and it is unclear how this relates to a prediction error on a testing set. In contrast to , we target the variance of the prediction error, and focus on prediction on a test set. Additionally, note that we define by the tuning parameter as in eq. (7), which is different from .
Iv Main Results
Recall that for a new testing data , the conditional variance given training data and is . In this section, we will show that for the new data , our new proposed randomly sketched conditional variance can provide a stable approximation for the original conditional covariance .
Under Assumption A1, suppose as , , and the projection dimension . Then with probability at least , with respect to the random choice of , we have
where and are absolute constants independent of .
As shown in Theorem IV.1, the convergence rate involves directly. Normally, we choose as the optimal one to achieve minimax optimal estimation. Next, we provide some examples to show how to choose the lower bound of the projection dimension for the random matrix and the corresponding optimal .
Let with , then we have that . We denote , and let . Then can be written as
by eq.(4), where is the Euclidean norm. Furthermore
where in the last step, we used Cauchy Schwarz inequality and the triangle inequality. In the above, and are defined as follows:
Next, we prove that
Before the proof of eq.(9), we first consider the optimization problem
which has the solution . In this case .
Therefore, to prove , we only need to find a vector , such that . This will imply
By definition of , when then and when then . Let , where , and . Let correspondingly. Also, divide into , where , are and dimension diagonal matrix, respectively. Let , with as the left block and as the right block. We construct a vector by setting . By plugging into eq.(10), we have that
Clearly, in our construction , and thus we focus on analyzing . For any , there exists a vector , such that , where , and such that is orthogonal to the span of . Therefore, , and . Thus , where is the empirical kernel matrix. Assume that , then
Then, we have the ellipse constraint that , where .
Since we have , , which implies , we have that
Therefore, we have . For the penalty term,
where is a constant. Finally, by eq.(IV), we can claim that
where is some constant.
Similarly, to prove , we can treat as an identity matrix. Consider the following optimization problem
which has the solution . In this case . Therefore, we only need to find a , such that . This will imply
Here we construct ,
For the penalty term, . Therefore, combining with eq.(12), we have
Finally, by eq.(4.1) , we have
Example 1: Consider kernels with polynomially decaying eigenvalues for . Such kernels include the order periodic Sobolev space, for , which corresponds to the cubic spline. Since the optimal rate of to achieve the minimax estimation error is of order , we get the corresponding optimal lower bound for the projection dimension . Furthermore, the difference between original conditional variance and randomly sketched conditional variance can be bounded by the order of .
Example 2: Consider kernels with exponentially decaying eigenvalues for , which include the Gaussian kernel with . Since the optimal rate of to achieve the minimax estimation rate is of order , we get the corresponding lower bound , and .
In this section, we verify the validity of our theoretical contribution (Theorem IV.1) through synthetic experiments.
Data were generated based on eq.(1) with the predictor
following a uniform distribution on, , and . We used Gaussian random projection matrices.
For the polynomial kernel, Figure 1 (a) shows the gap with the training sample size ranging from to , while fixing , and the projection dimension with . Note that with the increase of the sample size , the gap decreases with the rate as predicted by Theorem IV.1. For Figure 1 (c), we fix the sample size as , and , while varying the projection dimension with ranging from to . Note that an increase of leads to a smaller gap , but this improvement is no longer obvious when (or equivalently when c = 1), which is the optimal projection dimension demonstrated in Example 1. In Figure 1 (e), we vary from to , while fixing sample size and projection dimension as with ; Note that increases almost linearly with respect to , which is consistent with our theory.
For the Gaussian kernel, Figure 1 (b) shows gap with the training sample size ranging from to , while fixing and the projection dimension . Note that the gap decreases with the rate as predicted by Theorem IV.1. For Figure 1 (d), we fix the sample size as , and , while varying the projection dimension with ranging from 0.3 to 1.8. Note that an increase of leads to a smaller gap , but this improvement is no longer obvious when (or equivalently when c = 1), which is the optimal projection dimension demonstrated in Example 2. In Figure 1 (f), we fix and , but vary from to . As in the previous experiment, note that increases almost linearly with respect to , which is consistent with our theory.
In Appendix VI-B, we show additional synthetic experiments verifying our theoretical contribution. We further illustrate the use of our projected variance estimator in active learning, in synthetic data as well as two real-world datasets. In this illustrative application, by using the randomly sketched predictive variance, the computational complexity is reduced from to , where is the projection dimension. Given our theoretical finding, our variance estimator does not sacrifice statistical accuracy.
Vi Concluding Remarks
There are several ways of extending this research. While we focused on kernel ridge regression, it would be interesting to propose a statistically and computationally efficient variance estimator for Gaussian processes as well. Additionally, currently in Assumption A1, we only considered sub-Gaussian random matrices, for theoretical convenience. However, the property in Assumption A1 might also hold for subsampled Fourier and Hadamard random matrices, but with a different relationship between and . For sub-Gaussian random matrices, we only need . But for Hadamard random matrices, is needed for estimation of the (mean) prediction error as in . This might also likely happen in our predictive variance. But note that our definition of is different from . The analysis of different random matrices is appealing for future work. However, our general results on sub-Gaussian matrices should be seen as a necessary first step towards this endeavor.
-  Ahmed Alaoui and Michael W Mahoney. Fast randomized kernel ridge regression with statistical guarantees. pages 775–783, 2015.
-  Maria-Florina Balcan, Avrim Blum, and Santosh Vempala. Kernels as features: On kernels, margins, and low-dimensional mappings. Machine Learning, 65(1):79–94, 2006.
-  Mikio L Braun, Joachim M Buhmann, and Klaus-Robert MÃžller. On relevant dimensions in kernel feature spaces. Journal of Machine Learning Research, 9(Aug):1875–1908, 2008.
-  Klaus Brinker. Incorporating diversity in active learning with support vector machines. In ICML, volume 3, pages 59–66, 2003.
-  David A Cohn. Neural network exploration using optimal experiment design. Advances in neural information processing systems, pages 679–679, 1994.
David A Cohn, Zoubin Ghahramani, and Michael I Jordan.
Active learning with statistical models.
Journal of artificial intelligence research, 4(1):129–145, 1996.
-  James Hensman, Nicolo Fusi, and Neil D Lawrence. Gaussian processes for big data. UAI, 2013.
-  Meimei Liu, Zuofeng Shang, and Guang Cheng. Nonparametric testing under random projection. arXiv preprint arXiv:1802.06308, 2018.
-  David JC MacKay. Information-based objective functions for active data selection. Neural computation, 4(4):590–604, 1992.
-  Cameron Musco and Christopher Musco. Recursive sampling for the nyström method. arXiv preprint arXiv:1605.07583, 2016.
-  Carlotta Orsenigo and Carlo Vercellis. Kernel ridge regression for out-of-sample mapping in supervised manifold learning. Expert Systems with Applications, 39(9):7757–7762, 2012.
-  Alessandro Rudi, Raffaello Camoriano, and Lorenzo Rosasco. Less is more: Nyström computational regularization. pages 1657–1665, 2015.
-  Simon Tong and Daphne Koller. Support vector machine active learning with applications to text classification. Journal of machine learning research, 2(Nov):45–66, 2001.
-  Roman Vershynin. Introduction to the non-asymptotic analysis of random matrices. arXiv preprint arXiv:1011.3027, 2010.
-  Yun Yang, Mert Pilanci, and Martin J Wainwright. Randomized sketches for kernels: Fast and optimal non-parametric regression. arXiv preprint arXiv:1501.06195, 2015.
-  Tong Zhang. Learning bounds for kernel regression using effective data dimensionality. Neural Computation, 17(9):2077–2098, 2005.
Vi-a Illustrative Application: Randomly Sketched Active Learning Algorithm
Active learning has been successfully applied to classification as well as regression problems . Most active learning algorithms need to iteratively compute a score for each unlabeled samples. Specifically, the kernel ridge regression approach needs to evaluate the prediction variance for the unlabeled samples. As the size of the training data increases, the cost of computation increases cubically. An additional aspect that increases the computational cost is the use of cross validation to select the tuning parameters at each iteration, followed by computing the score for each unlabeled subject. Our randomly sketched active learning is aimed to reduce the computational cost for both model fitting and score calculation.
The main computational cost for randomly sketched KRR lies in computing the matrix multiplication of the sketch matrix and the kernel matrix. Suppose the current training set has data points and the projection dimension of the random matrix is . The computational complexity of the matrix multiplication is of the order of . In the next iteration, data points are added to the training set. Instead of calculating the matrix multiplication for all data points, we only need to calculate the entries corresponding to the updated data points. We partition the new kernel matrix as
where is the kernel matrix of the current training set, while , and . Correspondingly, we partition the new random projection matrix as,
where is the sketch matrix from the current step, while , , , and and are the projections dimension for the current and new sketch matrices correspondingly. Then the matrix multiplication can be written as,
Since has already been calculated in the previous step, we only need to calculate the remaining terms. The computational complexity is thus reduced from to . The size increases at each iteration, but the step size is fixed. Thus, is at most . The reduction of computational complexity is significant for large training sets.
The difference between Algorithm 1 and the classical active learning algorithm is that we use as weights to randomly sample data points from the unlabeled training data instead of deterministically selecting the data points with largest scores. If there is a small cluster of data with large scores, the deterministic method tends to add all of them into the training set initially. Suppose these data points are clustered together and outside of the majority of data points. Once they are all selected in the first few iterations, they may become the majority in the training set, and since the total size of labeled data is very small in early iterations, this will add an extra bias in the prediction. Thus we use the weighted random sampling strategy to ensure a substantial probability to select data points with large score while avoiding to select too many of them at once.
In this section, we evaluate the performance of our proposed random projection approach. We run experiments on synthetic data as well as on real-world data sets.
Vi-B1 Confirming our theoretical contribution
Through synthetic experiments, we first verify the validity of our theoretical contribution (Theorem IV.1). Here training samples were generated based on eq.(1) with . We use the Gaussian kernel function , where . Next, we generated
testing samples following the same distribution. Here we generated a random projection matrix with Gaussian distributed entries, and the projection dimension is chosen as, with respectively ( approximately). We observe that, with the increase of , the randomly projected variance performs similar to the original conditional variance, which confirmed the validity of our approach. Also, as for the computational time, the time for calculating the original variance for a new sample takes , but our proposed new randomly projected variance only takes , showing the practical advantage of our method.
Vi-B2 Illustrative application on active learning: synthetic experiments
Next, we illustrate the use of our variance estimator in active learning with synthetic data. (Appendix VI-A provides details of a simple algorithm that uses of our variance estimator and attains time.) For comparison, we simulated data points as the training set and data points as the testing set. The initial training set was selected by randomly sampling data points from the training set. In the simulation settings, we use the Gaussian kernel,
with bandwidth . We report the mean squared error(MSE) at each iteration.
Simulation Setting 1. We simulate the predictor from a uniform distribution on and . The response was generated as where are i.i.d. standard Gaussian noise.
Gaussian random projection matrix is used in this setting, and we choose the sketch dimension . As shown in Figure 2, the randomly sketched active learning algorithm has the smallest mean squared error after iteration. Also, the mean squared error of randomly sketched active learning algorithm converges as fast as the active learning with the original KRR and random sampling with original KRR.
Simulation Setting 2. We simulate the predictor from the following distribution
where and . For this experiment, we make . The response was generated as , where are i.i.d. standard Gaussian noise.
|Simulation setting 1||Simulation setting 2|
Same as the Setting 1, we also use a Gaussian random matrix with sketch dimension . As shown in Figure 3, the original active learning method shows faster convergence rate and achieves lower MSE after iterations compared to the random sampling algorithm. For this unevenly distributed data, it is unlikely to select the data outside the majority for the random sampling strategy. However, the minority data with large prediction variance tends to be selected by the active learning algorithm. Thus the randomly sketched active learning algorithm is comparable with active learning with the original KRR after iterations and converges to a similar MSE.
Vi-B3 Illustrative application on active learning: real-word experiments
Next, we illustrate the use of our variance estimator in active learning with real-world data. (Appendix VI-A provides details of a simple algorithm that uses of our variance estimator and attains time.)
Flight Delay Data. Here, we evaluate our randomly sketched active learning algorithm on the US flight dataset 
that contains up to 2 million points. We use a subset of the data with flight arrival and departure times for commercial flights in 2008. The flight delay was used as our response variable and we included 8 of the many variables from this dataset: the age of the aircraft, distance that needs to be covered, airtime, departure time, arrival time, day of the week, day of the month and month.
|Flight delay data||World weather data|
We randomly selected data points, using as the training set and as the testing set. We first randomly selected data points as labeled data. Then we sequentially added data points from the unlabeled training data at each iteration. We use the Gaussian random matrix with projection dimension . Here we only use the randomly sketched KRR since the computational cost and required RAM of the original KRR is too large. To compare the performance of active learning and uniform sampling, we calculate the RMSE(root mean squared error) 30 times using the prediction on the testing set. In Figure 4, the active learning algorithm achieves the RMSE of the full data faster than the uniform random sampling method.
World Weather Data. In what follows, we examined our method on another real world dataset. The world weather dataset contains monthly measurements of temperature, precipitation, vapor, cloud cover, wet days and frost days from Jan 1990 to Dec 2002 on a degree grid that covers the entire world. In our experiments, the response variable is temperature. We use the Gaussian random matrix with projection dimension . We use samples for training and samples for testing. We start with an initial set of labeled points, and add points at each iteration. As we can observe in Figure 5, our method compares favorably.