1 Introduction
Gaussian Processes (GPs) rasmussen2004gaussian are powerful tools for regression and classification problems as these models are able to learn complex representation of data through expressive covariance kernels. However, the application of GPs in realworld is limited due to their poor scalability during inference time. For a training data of size , GPs requires computation and storage for inference a single test point. Previous way for scaling GP inference is either through inducing points methods seeger2003fast ; titsias2009variational ; lawrence2003fast or structure exploitation saatcci2012scalable ; wilson2014fast
. More recently, the structured kernel interpolation (SKI) framework and KISSGP
wilson2015kernel further improve the scalability of GPs by unifying inducing points methods and structure exploitation. These methods can suffer from degradation of test performance or require input data to have special grid structure.All the previous solutions for scaling GPs focus on training GPs from scratch. In this paper, we focus on a different setting where we have enough resource for training exact GPs and want to apply the trained model for inference on resourcelimited devices such as mobile phone or robotics deisenroth2015gaussian . We wish to investigate the possibility of compressing a large trained exact GP model to a smaller and faster approximate GP model while preserve the predictive power of the exact model. This paper proposes kernel distillation, a general framework to approximate a trained GP model. Kernel distillation extends inducing point methods with insights from SKI framework and utilizes the knowledges from a trained model.
In particular, we approximate the exact kernel matrix with a sparse and lowrank structured matrix. We formulate kernel distillation as a constrained norm minimization problem, leading to more accurate kernel approximation compared to previous approximation approaches. Our method is a general purpose kernel approximation method and does not require kernel function to be separable or stationary and input data to have any special structure. We evaluate our approach on various realworld datasets, and the empirical results evidence that kernel distillation can better preserving the predictive power of a fully trained GP model and improving the speed simultaneously compared to other alternatives.
2 Kernel Distillation
Background.
We focus on GP regression problem. Denote the dataset as
which consists of input feature vectors
and realvalue targets . GP models a distribution over functions, where any set of function values forms a joint Gaussian distribution characterized by mean function
and kernel function whereis the set of hyperparameters to be trained. Based on Gaussian Identity, we can arrive at posterior predictive distribution for inference
rasmussen2006gaussian :The matrix is the covariance measured between and
. The prediction for mean and variance cost
in time and in storage per test point.The computational and storage bottleneck is the exact kernel matrix . KISSGP wilson2015kernel is a inducing point method for approximating the kernel matrix and thus scaling training of GPs. Given a set of inducing points , KISSGP approximates the kernel matrix where is locally interpolated with and is the interpolation weights.
Formulation.
The goal of kernel distillation is to compress a fully trained GP model to an approximate GP model to be used for inference on a resourcelimited devices. We assume that we have access to a trained exact GP with full kernel matrix and all the training data during distillation. Algorithm 1 in Appendix A outlines our distillation procedure.
We propose to use a student kernel matrix with a sparse and lowrank structure, to approximate a fully trained kernel matrix . is a sparse matrix and is the covariance evaluated at a set of inducing points . Similar to KISSGP wilson2015kernel , we approximate with . In KISSGP, is calculated using cubic interpolation on gridstructured inducing points. The number of inducing points grows exponentially as the dimension of input data grows, limiting KISSGP applicable to lowdimensional data. Instead of enforcing inducing points to be on grid, we choose
centroids from the results of Kmeans clustering
as the inducing points . In addition, we store in KDtree for fast nearest neighbor search which will be used in later optimization.In kernel distillation, we find optimal through a constrained optimization problem. We constrain each row of to have at most nonzero entries. We set the objective function to be the norm error between teacher kernel and student kernel:
subject to 
where denotes the number of nonzero entries at row of .
Initializing .
The initial values of are crucial for the later optimization. We initialize with optimal solution to with the sparsity constraint. More specifically, for each in , we find its nearest points in by querying . We denote the indices of these neighbors as . We then initialize each row of by solving the following linear least square problem:
where denotes the entries in row indexed by and denotes the rows of indexed by . The entries in with index not in are set to zero.
Optimizing .
After is initialized, we solve the norm minimization problem using standard gradient descent. To satisfy the sparsity constraint, in each iteration, we project each row of the gradient to sparse space according the indices , and then update accordingly.
Fast prediction.
One direct application of kernel distillation is for fast prediction with approximated kernel matrix. Given a test point , we follow similar approximation scheme in the distillation at test time where we try to approximate :
where is forced to be sparse for efficiency. Then the mean and variance prediction can be approximated by:
where both and can be precomputed during distillation.
To compute efficiently, we start by finding nearest neighbors of in (indexed by ) and set the entries in whose indices are not in to 0. For entries with indices in , we solve the following least square problem to get the optimal values for :
It takes to query the nearest neighbors, to get and and for mean and variance prediction respectively. The prediction time complexity is in total. As for storage complexity, we need to store precomputed vector for mean prediction and diagonal of matrix for variance prediction which cost . Table 1 provides comparison of time and storage complexity for different GP approximation approaches.
Methods  Mean Prediction  Variance Prediction  Storage 

FITC snelson2005sparse  
KISSGP wilson2015kernel  
Kernel distillation (this work) 
3 Experiments
We evaluate kernel distillation on the ability to approximate the exact kernel, the predictive power and the speed at inference time. In particular, we compare our approach to FITC and KISSGP as they are the most popular approaches and are closely related to kernel distillation. The simulation experiments for reconstructing kernel and comparing predictive power are demonstrated in Appendix B.
Empirical Study.
We evaluate the performance of kernel distillation on several benchmark regression data sets. A summary of the datasets is given in Table 2. The detailed setup of experiments is in Appendix C.
We start by evaluating how well kernel distillation can preserve the predictive performance of the teacher kernel. The metrics we use for evaluation is the standardized mean square error (SMSE) defined as for true label and model prediction . Table 2 summarizes the results. We can see that exact GPs achieve lowest errors on all of the datasets. FITC gets second lowest error on almost all datasets except for Boston Housing. Errors with kernel distillation are very close to FITC while KISSGP has the largest errors on every dataset. The poor performance of KISSGP might be resulted from the loss of information through the projection of input data to low dimension.
Dataset  # train  # test  Exact  FITC  KISSGP  Distill  

Boston Housing  13  455  51  0.076  0.103  0.095  0.091 
Abalone  8  3,133  1,044  0.434  0.438  0.446  0.439 
PUMADYM32N  32  7,168  1,024  0.044  0.044  1.001  0.069 
KIN40K  8  10,000  30,000  0.013  0.030  0.386  0.173 
(a) Boston Mean  (b) Boston Variance  (c) Abalone Mean  (d) Abalone Variance 
Dataset  FITC  KISSGP  Distill 

Boston Housing  0.0081  0.00061  0.0017 
Abalone  0.0631  0.00018  0.0020 
PUMADYM32N  1.3414  0.0011  0.0035 
KIN40K  1.7606  0.0029  0.0034 
We further study the effects of sparsity on predictive performance. We choose to be range from [5, 10, , 40] and compare the test error and variance prediction for KISSGP and kernel distillation on Boston Housing and Abalone datasets. The results are shown in Figure 1. As expected, the error for kernel distillation decreases as the sparsity increases and we only need to be 15 or 20 to outperform KISSGP. As for variance prediction, we plot the error between outputs from exact GPs and approximate GPs. We can see that kernel distillation always provides more reliable variance output than KISSGP on every level of sparsity.
Finally, we evaluate the speed of prediction with kernel distillation. Again, we compare the speed with FITC and KISSGP. The setup for the approximate models is the same as the predictive performance comparison experiment. For each dataset, we run the test prediction on 1000 points and report the average prediction time in seconds. Table 3 summarizes the results on speed. It shows that both KISSGP and kernel distillation are much faster in prediction time compared to FITC for all datasets. Though kernel distillation is slightly slower than KISSGP, considering the improvement in accuracy and more reliable uncertainty measurements, the cost in prediction time is acceptable. Also, though KISSGP claims to have constant prediction time complexity in theory wilson2015thoughts , the actual implementation still is datadependent and the speed varies on different datasets. In general, kernel distillation provides a better tradeoff between predictive power and scalability than its alternatives.
Conclusion.
We proposed a general framework, kernel distillation, for compressing a trained exact GPs kernel into a student kernel with lowrank and sparse structure. Our framework does not assume any special structure on input data or kernel function, and thus can be applied "outofbox" on any datasets. Kernel distillation framework formulates the approximation as a constrained norm minimization between exact teacher kernel and approximate student kernel.
The distilled kernel matrix reduces the storage cost to compared to for other inducing point methods. Moreover, we show one application of kernel distillation is for fast and accurate GP prediction. Kernel distillation can produce more accurate results than KISSGP and the prediction time is much faster than FITC. Overall, our method provide a better balance between speed and predictive performance than other approximate GP approaches.
References

[1]
Carl Edward Rasmussen.
Gaussian processes in machine learning.
In Advanced lectures on machine learning, pages 63–71. Springer, 2004.  [2] Matthias Seeger, Christopher Williams, and Neil Lawrence. Fast forward selection to speed up sparse gaussian process regression. In Artificial Intelligence and Statistics 9, number EPFLCONF161318, 2003.
 [3] Michalis K Titsias. Variational learning of inducing variables in sparse gaussian processes. In AISTATS, volume 12, pages 567–574, 2009.
 [4] Neil Lawrence, Matthias Seeger, and Ralf Herbrich. Fast sparse gaussian process methods: The informative vector machine. In Proceedings of the 16th Annual Conference on Neural Information Processing Systems, number EPFLCONF161319, pages 609–616, 2003.
 [5] Yunus Saatçi. Scalable inference for structured Gaussian process models. PhD thesis, University of Cambridge, 2012.
 [6] Andrew Wilson, Elad Gilboa, John P Cunningham, and Arye Nehorai. Fast kernel learning for multidimensional pattern extrapolation. In Advances in Neural Information Processing Systems, pages 3626–3634, 2014.
 [7] Andrew Wilson and Hannes Nickisch. Kernel interpolation for scalable structured gaussian processes (kissgp). In Proceedings of The 32nd International Conference on Machine Learning, pages 1775–1784, 2015.
 [8] Marc Peter Deisenroth, Dieter Fox, and Carl Edward Rasmussen. Gaussian processes for dataefficient learning in robotics and control. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(2):408–423, 2015.
 [9] Carl Edward Rasmussen and Christopher KI Williams. Gaussian processes for machine learning. 2006.
 [10] Edward Snelson and Zoubin Ghahramani. Sparse gaussian processes using pseudoinputs. In Advances in neural information processing systems, pages 1257–1264, 2005.
 [11] Andrew Gordon Wilson, Christoph Dann, and Hannes Nickisch. Thoughts on massively scalable gaussian processes. arXiv preprint arXiv:1511.01870, 2015.
Appendix A Sparse Lowrank Kernel Approximation
Algorithm 1 outlines our distillation approach.
Appendix B Simulation Experiment
Kernel Reconstruction.
(a)  (b)  (c)  (d) Error v.s. 
We first study how well can kernel distillation reconstruct the full teacher kernel matrix. We generate a 1000 1000 kernel matrix from RBF kernel evaluated at (sorted) inputs randomly sampled from . We compare kernel distillation against KISSGP and SoR (FITC is essentially SoR with diagonal correction as mentioned in Section 2). We set number of grid points for KISSGP as 400 and number of inducing points for SoR is set to 200 and kernel distillation to 100. We set the sparsity to 6 for kernel distillation.
The norm for errors for are , , for KISSGP, SoR and kernel distillation respectively. Kernel distillation achieves lowest norm error compared to FITC and KISSGP even the number of inducing points is much fewer for kernel distillation. Moreover, from the absolute error matrices (Figure 2 ac), we can see errors are more evenly distributed for kernel distillation, while there seems to exist a strong error pattern for the other two.
We also show how the sparsity parameter affect the approximation quality. We evaluate the error with different choices for as shown in Figure 2 (d). We observe that the error converges when the sparsity is above 5 in this example. This shows our structured student kernel can approximate the full teacher kernel reasonably well even when is extremely sparse.
Toy 1D Example.
(a) Mean  (b) Variance 
To evaluate our distilled model’s predictive ability, we set up the following experiment. We sample data points uniformly from [10, 10]. We set our response with . We train an exact GP with RBF kernel as teacher first then apply kernel distillation with number of inducing points set to 100 and sparsity set to 10. We compare mean and variance prediction of kernel distillation with KISSGP trained with 400 grid inducing points.
The results are showed in Figure 3. As we can see, mean predictions of kernel distillation are indistinguishable from exact GP and KISSGP. As for variance, kernel distillation’s predictions are much closer to the variance outputs from exact GP, while the variance outputs predicted by KISSGP are far away from the exact solution.
This experiment shows a potential problem in KISSGP, where it sacrifices its ability to provide uncertainty measurements, which is a crucial property of Bayesian modeling, for exchanging massive scalability. On the other hand, kernel distillation can honestly provide uncertainty prediction close to the exact GP model.
Appendix C Experiment Setup
We compare kernel distillation with teacher kernel (exact GP), FITC as well as KISSGP. We use the same inducing points selected by KMeans for both FITC and kernel distillation. For KISSGP, as all the datasets do not lie in lower dimension, we project the input to 2D and construct 2D grid data as the inducing points. Number of inducing points (on 2D grid) for KISSGP are set to 4,900 (70 per grid dimension) for Boston Housing, 10K for Abalone, 90K for PUMADYM32N, 250K for KIN40K. The number of inducing points for FITC and kernel distillation are 70 for Boston Housing, 200 for Abalone, 1k for PUMADYM32N and KINK40K. The sparsity in kernel distillation is set to 20 for Boston Housing and 30 for other datasets. For all methods, we choose ARD kernel as the kernel function, which is defined as:
where is the dimension of the input data and ’s are the hyperparameters to learn.
All the experiments were conducted on a PC laptop with Intel Core(TM) i76700HQ CPU @ 2.6GHZ and 16.0 GB RAM.
Comments
There are no comments yet.