Kernel Distillation for Gaussian Processes
Gaussian processes (GPs) are flexible models that can capture complex structure in large-scale dataset due to their non-parametric nature. However, the usage of GPs in real-world application is limited due to their high computational cost at inference time. In this paper, we introduce a new framework, kernel distillation, for kernel matrix approximation. The idea adopts from knowledge distillation in deep learning community, where we approximate a fully trained teacher kernel matrix of size n× n with a student kernel matrix. We combine inducing points method with sparse low-rank approximation in the distillation procedure. The distilled student kernel matrix only cost O(m^2) storage where m is the number of inducing points and m ≪ n. We also show that one application of kernel distillation is for fast GP prediction, where we demonstrate empirically that our approximation provide better balance between the prediction time and the predictive performance compared to the alternatives.
READ FULL TEXT