Bayesian methods are well suited for learning tasks in the low-data limit, because they offer a principled way to include prior knowledge about the problem . If the prior knowledge is suitable for the task, only very few observations may be needed to fit the posterior well. However, the acquisition and effective encoding of this prior knowledge has been a longstanding challenge in the field .
A powerful framework to acquire prior knowledge about a task is meta-learning . Meta-learning refers to a method in which knowledge is gained by solving a set of specific tasks (the meta-tasks) and subsequently used to improve the model performance on a different task (the target task) . The method is therefore concerned with making use of a set of tasks in order to approach another task better. This can be seen as gaining knowledge (so called meta-knowledge) on the meta-tasks and incorporating it as prior knowledge into the learning model to solve the target task. The incorporation of this prior knowledge into the model is called inductive transfer .
The largest benefits can be gained from meta-learning in a setting in which there is a large amount of data on the meta-tasks, but only little data on the target task. This leads to the hypothesis that successful meta-learning approaches should utilize two different kinds of models: one model that is able to handle large amounts of data, to perform the actual meta-learning (the meta-model); and one model that is able to deal with small quantities of data and into which the meta-knowledge can be effectively integrated (the target model). In this work, we propose to use deep neural networks as meta-models and Gaussian processes (GPs) as target models.
The GP lends itself particularly easily to the desired purpose of a target model, because it offers a way to perform nonlinear regression on small amounts of data, while being able to incorporate prior knowledge in a Bayesian manner . This prior knowledge (which can be acquired by a meta-learning procedure) can be encoded into the GP model by modifying the parameters of either its kernel function or its mean function. Both of these options might yield a way to successfully perform meta-learning in GPs. In this work, we present the first evidence that learning a GP’s mean function can outperform kernel learning on meta-learning tasks.
We make the following contributions:
Formalize meta-learning in Gaussian Processes via mean function learning.
Present an analytical argument for why mean function learning can be superior to kernel learning in this setting.
Empirically validate this argument on synthetic function regression and MNIST image reconstruction tasks.
In the following, we are going to formalize the idea of GP prior learning in a meta-learning setting (Sec. 2), present an analytical argument for why mean function learning can be superior to kernel learning under certain conditions (Sec. 3.1), discuss the potential risk of overfitting when learning a GP prior (Sec. 3.2) and motivate why deep neural networks are suitable meta-models to learn GP mean functions (Sec. 3.3). Thereafter, we are going to provide empirical evidence of our claims (Sec. 4), a review of the related literature (Sec. 5), and a discussion of our work (Sec. 6). An overview of our proposed framework is depicted in Figure 1.
2 Meta-learning in Gaussian Processes
In order to explain the setting of this work, we are first going to define meta-learning more formally. As mentioned above, we define meta-learning as a setting where we have access to a set of meta-tasks and a related target task from the same general domain. The set of meta-tasks consists of data sets , with one data set for each meta-task. Each of these data sets contains observations , where and for tasks in which the respective functions to be learned are defined as . In our setting, all meta-tasks share the same input and output dimensionalities and , but they can have different numbers of observations .
Additionally to the meta-tasks, we have a target task with a data set , where and are the training points and their respective values and and are the test points and their respective values. As mentioned above, we assume there to be much more data in the meta-tasks than in the target task, that is, .
In order to predict on , we want to fit a GP to with a prior
where the mean and kernel functions are parameterized by sets of parameters and respectively. These parameters can now be optimized on the meta-task set, that is,
with a suitable loss functionfor parameters . This approach can be seen as gaining knowledge from solving the meta-tasks and using the parameters and to encode the thus acquired meta-knowledge into the GP prior for the task on .
In GP regression, the loss function is often chosen to be the negative log marginal likelihood (LML) . The LML on a meta-task can be computed as
where all learnable parameters of the GP prior (i.e., the mean parameters, kernel parameters, or both) are denoted as and
is a vector of the function values of the latent functionevaluated at the points .
Given this loss function over a single meta-task, we define the loss over all meta-tasks as a sum over their individual losses, that is,
This loss can then be optimized using any general-purpose optimization method. In this work, we use Stochastic Gradient Descent (SGD). Algorithm1 outlines the procedure to optimize the GP prior parameters (which can stand for the mean function parameters , the kernel function parameters , or both; see Eqs. (1), (2)).
Once the parameters of the GP prior are optimized, we can use the prior to fit a GP to and predict on . If we evaluate the predictive posterior of the GP on the test points, it yields 
where denotes the Gram matrix (also known as the kernel matrix) with and similarly for and .
3 Learning deep mean functions for meta-learning GPs
In this section, we are going to lay out the main theoretical arguments for our approach. Firstly, we are going to present an analytical treatment of mean function and kernel function learning in a meta-learning setting and show why mean function learning can be superior under certain conditions. Secondly, we are going to give an intuition for why overfitting can be a risk in conventional mean function learning, but not in a meta-learning setting. Lastly, we are going to argue why deep neural networks form a function class that is well suited for the use as a GP mean function in our meta-learning framework.
3.1 Mean functions can be superior to kernel functions
While learning kernel functions for GPs is a popular area of research, it entails a number of challenges. The most fundamental challenge is that the kernel function has to be positive definite in order to define an inner product in a suitable Hilbert space . To ensure this property, many approaches resort to learning an explicit feature mapping into such a Hilbert space . This strategy guarantees the correctness of the method, but it also sacrifices some of the computational benefits of the original kernel trick . GP mean functions, in contrast, do not suffer from any such constraints and can therefore be learned more freely without taking any special precautions.
The question remains whether we can incorporate more prior knowledge into or into , that is, whether we should rather learn the mean or the kernel function. Based on the observation from Equation (4), we make the following proposition:
Given a Gaussian Process prior , there exist types of prior knowledge that can be encoded into the mean function parameters when a naïve kernel (e.g., ) is used, but not into the kernel parameters when a naïve mean function (i.e., ) is used.
Let us assume for simplicity that we generate data from a known noiseless process with nonzero mean, that is, , and .
If we want to fit a GP to these data and want it to yield correct predictions, we need the posterior to satisfy
where and are defined as above (Sec. 2).
Let us now assume that we only happen to observe training points where the process is zero, but that it is nonzero for some test points, that is, and (this assumption is more likely to hold for small ). If we try to encode all our prior knowledge into and choose an uninformative , that is, , the LHS of Equation 5 will always be 0, regardless of the choice of , while the RHS will sometimes be nonzero. However, if we encode our prior knowledge into , that is, choose , Equation 5 will always hold, regardless of the choice of . ∎
Note that this proposition assumes a finite – and potentially quite small – value for , such that the assumptions about and (see above) are not too unlikely. For , one can show that under mild assumptions we can always find a kernel that will make the posterior approach arbitrarily closely . However, in the case of small , which we assume to be more common in the meta-learning setting (see Sec. 2), we believe Proposition 1 to be more relevant than the asymptotic kernel universality results.
Given these insights and the fact that learning mean functions poses a less constrained problem than kernel learning, we propose that mean functions are a better and more natural choice than kernels for meta-learning in GPs under certain conditions. We will empirically validate this statement in the experimental section (Sec. 4).
3.2 On the risk of overfitting in GP prior learning
Optimizing hyperparameters in machine learning models always brings about a certain risk of overfitting. The extent of this risk depends on the data set on which different hyperparameters are compared, on the objective function that is optimized, and on the procedure that is used to optimize this function.
One of the most principled ways of choosing the hyperparameters of the GP prior is Bayesian model selection 
where the denominator is given by
Instead of this maximum a posteriori (MAP) inference over
, practitioners often resort to a maximum likelihood estimate (MLE) for reasons of tractability, by optimizing the likelihood term in the numerator of Equation (6) with respect to . Note that the negative logarithm of this term is exactly the LML from Equation (2).
with hyperparameters and where denotes the determinant of matrix .
It is evident that this LML naturally decomposes into three terms. The first term depends on the kernel parameters, the mean function parameters, and the data. It can be seen as measuring the goodness-of-fit of the model to the data. The second term only depends on the kernel parameters and can be seen as a complexity penalty. The third term normalizes the likelihood and is constant with respect to the data and the parameters.
Notice that this objective function contains an automatic tradeoff between data-fit and model complexity when it comes to the kernel parameters, but it does not contain any complexity penalty on the mean function parameters. Optimizing the mean function parameters w.r.t. the LML directly on the training data can therefore lead to overfitting [c.f. 17]. It has been noted that even for the kernel parameters, the penalty term might sometimes not be strong enough to consistently avoid overfitting , even though this effect can be additionally combated with regularization [22, 6].
In this work, we hence refrain entirely from optimizing any parameters using the LML on the target training data , and only optimize the parameters on the meta-tasks . Thus, there is no way for the training data of the target task to inform the GP prior and therefore no possibility of overfitting to the training data. Our GP prior is still a strict prior in the Bayesian sense, that is, it describes our belief about the target task before actually seeing the data.
While overfitting to the is impossible in our setting, it is quite possible to overfit to the meta-tasks . Such overfitting could lead to an increase in generalization error and therefore a lower performance on the target task, rendering the learned prior less informative. In this work, we perform the meta-learning on sufficiently heterogeneous meta-task sets , such that the optimization procedure (Alg. 1) is less prone to overfit on any single
. Since we are using neural networks to parameterize the mean and kernel functions, overfitting to the meta-tasks can additionally be inhibited by the standard techniques (weight decay, dropout, batch normalization, early stopping, etc.).
3.3 Deep mean functions for GPs
Given the insight that meta-learning the GP’s mean function might help us in solving our target task, we are still faced with the problem of choosing a suitable parameterization, that is, a suitable function class for this mean function. Since we expect to have a reasonably large amount of data from the meta-tasks (, see Sec. 2), we want to choose a function class that can scale easily to such amounts of data and that is flexible enough to incorporate the meta-knowledge. A class of parametric functions that exhibit these two properties are deep neural networks [19, 27, 13].
As an illustrative example, let us assume that we want to parameterize the mean function as a feed-forward neural network with two hidden layers. The mean value at a pointwill then be given by
where the are the weight matrices of the layers, the are their biases, and, that is, .
Given this functional form, we can train the mean function according to Algorithm 1. We compute the gradients with respect toTensorFlow  and PyTorch ).
It has been shown that using deep neural networks as kernel functions for GPs yields neural networks with nonparametric (or equivalently, “infinitely wide”) last layers . Similarly, using deep mean functions amounts to fitting a neural network to the data and then modeling the residuals with a GP . It thus offers a natural way to combine the predictive power of neural networks with the calibrated uncertainties of GPs and can therefore be seen as an avenue of Bayesian Deep Learning .
In the following section, we will proceed by empirically testing the hypotheses derived from our theoretical analysis of the problem.
|zero mean||-1.31 0.07||1.21 0.04||-5.68 0.13||1.02 0.15||19.99 0.22||0.00 0.00|
|true mean||-0.74 0.08||0.79 0.03||-4.32 0.14||0.53 0.04||21.55 0.27||0.00 0.00|
|learned mean||-0.75 0.08||0.80 0.03||-4.43 0.13||0.52 0.03||20.76 0.22||0.00 0.00|
Performance comparison of GP regression with a zero mean function, a learned mean function and the true mean function of the generating process on synthetic sinusoid function data for different numbers of training points. The performance is measured in terms of likelihood and mean squared error. The reported values are means and their respective standard errors of 200 runs.
|vanilla||0.28 0.00||0.335 0.006||0.35 0.00||0.113 0.003||0.47 0.01||0.026 0.001|
|learned mean||0.29 0.00||0.196 0.002||0.36 0.00||0.095 0.002||0.46 0.01||0.027 0.001|
|learned kern||0.54 0.00||0.323 0.007||0.61 0.00||0.093 0.004||0.68 0.01||0.024 0.001|
|learned both||0.55 0.00||0.186 0.002||0.61 0.00||0.078 0.002||0.69 0.01||0.022 0.001|
|vanilla||-0.52 0.00||0.112 0.000||-0.50 0.00||0.082 0.000||-0.36 0.00||0.022 0.000|
|learned mean||-0.52 0.00||0.090 0.000||-0.49 0.00||0.068 0.000||-0.36 0.00||0.021 0.000|
|learned kernel||-0.09 0.00||0.166 0.001||-0.08 0.00||0.075 0.000||-0.07 0.00||0.067 0.000|
|learned both||-0.01 0.00||0.151 0.001||0.00 0.00||0.073 0.000||0.02 0.00||0.059 0.000|
In order to test the performance of mean function learning in general and compare it to kernel learning, we performed experiments on synthetic data and on MNIST handwritten digits . We implemented the algorithms in Python, using the GPflow package  and the GPyTorch  package.
4.1 Performance measures
In our experiments, we assess the performance of the models on the target task with two different measures: the test mean squared error (MSE) and the test data log likelihood (often just denoted as likelihood). Note that the likelihood depends on the whole predictive posterior, while the MSE only depends on its mean. Since the mean function of the GP only affects the posterior by shifting its mean, it can be hypothesized that learning a good mean function for the GP will affect the MSE more strongly than the likelihood. Similarly, since the kernel function parameterizes the covariance of the process, it could be expected to affect the likelihood more strongly than the MSE.
When interpreting the results of our experiments, one should therefore keep in mind that the MSE slightly favors GPs with a good mean function, while the likelihood slightly favors good kernel functions. However, the decision which one of the two measures is more important depends on the intended use of the GP in the target task. If the GP is used to predict values at test points from its predictive posterior mean, the MSE is the more relevant measure. If it is instead used to draw multiple samples from the whole posterior or to estimate the probability of different outcomes, the likelihood is more relevant. We do not make any limiting assumptions on the use cases in this work and the ultimate decision for a measure (and therefore potentially a preferred model) is hence left to the prospective user of our method.
4.2 Sinusoid function regression
In a first experiment, we aim to assess the general performance of mean function meta-learning. To this end, we simulated functions from a known generating Gaussian process. The process had a sinusoid mean and a radial basis function (RBF) covariance. For each function, we sampled the value at 50 equally spaced points in theinterval. Samples from the process are depicted in Figure 2.
We trained a deep feedforward neural network on 1000 sampled functions and used it as a mean function for GP regression on 200 unseen functions that were sampled from the same process. We used a neural network with two hidden layers of size 64 each with sigmoid activation functions. It was trained for 100 epochs using stochastic gradient descent (SGD). We compared it against a GP with zero mean function and one with the true sinusoid mean function for different numbers of training points (Tab.1).
It can be seen that the learned mean function performs comparably with the true mean function and significantly better than the zero mean function in terms of likelihood and mean squared error (MSE). It also becomes evident that this effect diminishes when the number of training points increases, following the intuition that the prior becomes less important once there are enough observations (see Sec. 3.1). These findings support the efficacy of the mean function learning approach in the low-data limit.
4.3 Step function regression
To assess the performance of the mean function learning approach on a traditionally more challenging problem for GPs and compare it to kernel learning, we chose the task of step function regression. Step functions are hard to fit for GPs due to their discontinuity 
. We hypothesize that this discontinuity can be modeled more easily by a mean function than by a kernel function, since the kernel function always interpolates between neighboring points to some extent and therefore implicitly assumes continuity.
A step function is simply defined as
with the step location and the two function values before and after the step, respectively. For our set of tasks, we choose a dataset of different step functions, namely the Heaviside step function  (i.e., ) and its mirrored version along the -axis (i.e., ). We sample the step location uniformly at random from the interval. The whole function consists of 50 evenly spaced points in the domain.
We compare a vanilla GP (zero mean and RBF kernel) with a learned deep mean function, a learned deep kernel function (following Wilson et al. ) and a GP with deep mean and deep kernel function learned concurrently. The learned mean and kernel functions are parameterized by the same deep feed-forward neural network architecture (except for the dimension of the last layer). We use neural networks with two hidden layers of size 128 and 64 with sigmoid activation functions. We train each model on 10,000 randomly sampled step functions from our function space using stochastic gradient descent.
The respective performances of the methods in terms of likelihood and MSE are reported in Table 2 and some example regression outputs on the standard Heaviside step function for the different models and numbers of training points are depicted in Figure 3.
It can be seen that in the low-data regime, mean function learning outperforms kernel learning in terms of MSE, while kernel learning yields a better likelihood. However, the gap between mean function learning and kernel learning in terms of MSE narrows with an increasing number of training points, following our intuition from the theoretical analysis (Sec. 3.1).
It can also be seen that learning both the mean and the kernel function consistently performs best in all data regimes. Especially in the low-data regime, we can see that this approach combines the high likelihood of the kernel learning with the low MSE of the mean function learning. It stands to question, however, whether this effect is due to the relative simplicity of the task or whether it also holds for more complex data.
4.4 MNIST image completion
To compare mean function learning to kernel learning in a more realistic meta-learning setting, we performed an experiment with an image completion task on MNIST handwritten digits . Some example inputs and reconstructions are depicted in Figure 2. We view the task of image completion, that is, inferring the pixel values in test locations given their values in some random context locations, as a regression task (similar to Garnelo et al. ). We try to learn a function that maps from pixel coordinates to the pixel value, that is, .
The learned mean and kernel functions are parameterized by the same deep feed-forward neural network architecture (except for the dimension of the last layer). We use neural networks with two hidden layers of size 128 and 64 with sigmoid activation functions. We train each model on the MNIST training set of images using stochastic gradient descent. We again compare the two approaches against a “vanilla” GP (zero mean and RBF kernel) and a GP where both functions are learned concurrently. We report performances on the MNIST test set with different numbers of training points (Tab. 3).
It can be seen that the deep kernel learning performs better than the vanilla GP and the mean function learning in terms of likelihood, but the mean function learning significantly outperforms the other methods in terms of MSE. Moreover, learning both mean function and kernel yields the best performance in terms of likelihood.
Since many pixels in MNIST images are zero-valued, the performance results fit the intuition of Proposition 1. We can therefore confirm empirically that mean function learning is superior to kernel learning in terms of MSE under certain conditions and that it can also improve the performance in terms of likelihood when combined with kernel learning.
5 Related work
Kernel learning for Gaussian Processes has a long history . It can be seen as an extension to Maximum-Likelihood-II (ML-II) parameter optimization, where the kernel parameters are optimized with respect to the log marginal likelihood of the GP . The flexibility of these learned kernels grows with the number of parameters, culminating in deep kernel learning, where the kernel is parameterized by a deep neural network [32, 33]. While kernels have thus enjoyed a lot of attention, the research into learning mean functions is still in its infancy. The only documented deep mean function learning for GPs  deals with standard optimization of the mean function parameters on the training set and not with a meta-learning setting.
Meta-learning has recently been explored in great detail , especially in the case of deep neural networks [2, 8]. It has been shown that these approaches can be seen as inference in a hierarchical Bayesian model . These explorations extend to GP-like neural network models [11, 12], but the setting has been underexplored when it comes to classical GPs. While there has been some work on meta-learning kernel functions [3, 31, 28]
, parameters for Bayesian linear regression, and parameters in hierarchical Bayesian models , the aforementioned GP mean function learning approach has not been studied in this setting yet.
It should be noted here again that using the training data itself to optimize the kernel function  or mean function  can lead to overfitting, since it violates the Bayesian assumption. In contrast, this assumption is not violated when using meta-tasks to learn the kernel or mean function (as considered in this work), thus minimizing the risk for overfitting (see Sec. 3.2).
We have shown that meta-learning in Gaussian processes is an area that can benefit from learning the mean function of the process instead of (or combined with) the kernel function. We have provided an analytical argument for why mean function learning can be superior under certain conditions and have validated it empirically on two synthetic data regression and an MNIST image completion task. This extends previous work in which mean function learning has not been considered for meta-learning purposes.
It has been shown previously that some kernels are asymptotically universal approximators. In contrast, we show that mean functions seem to excel especially in the low-data limit. Our experiments have shown that this benefit seems to decay with increasing amounts of data. It would be an interesting direction for future work to explore more thoroughly under which conditions mean function learning is beneficial. Moreover, combining mean function and kernel learning in even smarter ways could be a promising avenue of research.
It will also be interesting to see in future work whether our approach can be successfully applied to challenging real world data, such as medical time series, where one often has access to a lot of prior knowledge from previous patients, but only very few measurements from any patient who is currently being treated. With some modification, the approach could also potentially be extended to other popular meta-learning scenarios, such as few-shot learning or optimal control in adaptive environments.
Given the results of this work, we would advise practitioners to consider mean function learning for meta-learning in Gaussian processes if the transfer data set is reasonably large and the target training data set is very small. Depending on the complexity of the data and the intended use case of the GP’s predictive posterior, it can be advisable to combine mean function and kernel learning for maximum performance.
- Abadi et al.  Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. Tensorflow: a system for large-scale machine learning. In OSDI, volume 16, pages 265–283, 2016.
Deep learning of representations for unsupervised and transfer learning.In Proceedings of ICML Workshop on Unsupervised and Transfer Learning, pages 17–36. jmlr.org, 2012.
- Bonilla et al.  Edwin V Bonilla, Kian M Chai, and Christopher Williams. Multi-task gaussian process prediction. In J C Platt, D Koller, Y Singer, and S T Roweis, editors, Advances in Neural Information Processing Systems 20, pages 153–160. Curran Associates, Inc., 2008.
Ronald Newbold Bracewell.
The Fourier transform and its applications, volume 31999. McGraw-Hill New York, 1986.
- Carlin and Louis  Bradley P Carlin and Thomas A Louis. Bayesian methods for data analysis. CRC Press, 2008.
- Cawley and Talbot  Gavin C Cawley and Nicola LC Talbot. Preventing over-fitting during model selection via bayesian regularisation of the hyper-parameters. Journal of Machine Learning Research, 8(Apr):841–861, 2007.
- de G. Matthews et al.  Alexander G de G. Matthews, Mark van der Wilk, Tom Nickson, Keisuke Fujii, Alexis Boukouvalas, Pablo León-Villagrá, Zoubin Ghahramani, and James Hensman. GPflow: A gaussian process library using TensorFlow. J. Mach. Learn. Res., 18(40):1–6, 2017. ISSN 1532-4435.
- Finn et al.  Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-Agnostic Meta-Learning for fast adaptation of deep networks. March 2017.
- Gal  Yarin Gal. Uncertainty in Deep Learning. PhD thesis, 2017.
- Gardner et al.  Jacob R Gardner, Geoff Pleiss, David Bindel, Kilian Q Weinberger, and Andrew Gordon Wilson. GPyTorch: Blackbox Matrix-Matrix gaussian process inference with GPU acceleration. September 2018.
- Garnelo et al. [2018a] Marta Garnelo, Dan Rosenbaum, Chris J Maddison, Tiago Ramalho, David Saxton, Murray Shanahan, Yee Whye Teh, Danilo J Rezende, and S M Ali Eslami. Conditional neural processes. July 2018a.
- Garnelo et al. [2018b] Marta Garnelo, Jonathan Schwarz, Dan Rosenbaum, Fabio Viola, Danilo J Rezende, S M Ali Eslami, and Yee Whye Teh. Neural processes. July 2018b.
- Goodfellow et al.  Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. 2016. ISBN 9780521835688. doi: 10.1038/nmeth.3707.
- Grant et al.  Erin Grant, Chelsea Finn, Sergey Levine, Trevor Darrell, and Thomas Griffiths. Recasting Gradient-Based Meta-Learning as hierarchical bayes. January 2018.
- Harrison et al.  James Harrison, Apoorva Sharma, and Marco Pavone. Meta-Learning priors for efficient online bayesian regression. July 2018.
- Hofmann et al.  Thomas Hofmann, Bernhard Schölkopf, and Alexander J Smola. Kernel methods in machine learning. January 2007.
- Iwata and Ghahramani  Tomoharu Iwata and Zoubin Ghahramani. Improving output uncertainty estimation and generalization in deep learning via neural network gaussian processes. July 2017.
- LeCun et al.  Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-Based learning applied to document recognition. Proc. IEEE, 86(11):2278–2324, 1998. ISSN 0018-9219. doi: 10.1109/5.726791.
- LeCun et al.  Yann LeCun, Yoshua Bengio, Geoffrey Hinton, Lecun Y., Bengio Y., and Hinton G. Deep learning. Nature, 521(7553):436–444, 2015. ISSN 0028-0836. doi: 10.1038/nature14539.
- McNeish  Daniel McNeish. On using bayesian methods to address small sample problems. Structural Equation Modeling: A Multidisciplinary Journal, 23(5):750–773, 2016.
- Mercer  James Mercer. Functions of positive and negative type, and their connection the theory of integral equations. Philos. Trans. R. Soc. Lond. A, 209(441-458):415–446, January 1909. ISSN 0080-4614. doi: 10.1098/rsta.1909.0016.
- Micchelli and Pontil  Charles A Micchelli and Massimiliano Pontil. Learning the kernel function via regularization. Journal of machine learning research, 6(Jul):1099–1125, 2005.
- Micchelli et al.  Charles A Micchelli, Yuesheng Xu, and Haizhang Zhang. Universal kernels. J. Mach. Learn. Res., 7(Dec):2651–2667, 2006. ISSN 1532-4435, 1533-7928.
- Paszke et al.  Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch. 2017.
- Platt et al.  John C Platt, Christopher J C Burges, Steven Swenson, Christopher Weare, and Alice Zheng. Learning a gaussian process prior for automatically generating music playlists. In T G Dietterich, S Becker, and Z Ghahramani, editors, Advances in Neural Information Processing Systems 14, pages 1425–1432. MIT Press, 2002.
- Rasmussen and Williams  Carl Edward Rasmussen and Christopher K I Williams. Gaussian Processes for Machine Learning. MIT Press, new edition, January 2006. ISBN 9780262182539.
- Schmidhuber  Jürgen Schmidhuber. Deep learning in neural networks: An overview. Neural Netw., 61:85–117, 2015. ISSN 0893-6080, 1879-2782. doi: 10.1016/j.neunet.2014.09.003.
- Skolidis  Grigorios Skolidis. Transfer learning with Gaussian processes. PhD thesis, June 2012.
- Thrun and Pratt  Sebastian Thrun and Lorien Pratt. Learning to learn. Springer Science & Business Media, 2012.
- Vilalta and Drissi  Ricardo Vilalta and Youssef Drissi. A perspective view and survey of Meta-Learning. Artificial Intelligence Review, 18(2):77–95, June 2002. ISSN 1573-7462. doi: 10.1023/A:1019956318069.
- Widmer et al.  Christian Widmer, Nora C Toussaint, Yasemin Altun, and Gunnar Rätsch. Inferring latent task structure for multitask learning by multiple kernel learning. BMC bioinformatics, 11(8):S5, 2010.
- Wilson et al.  Andrew Gordon Wilson, Zhiting Hu, Ruslan Salakhutdinov, and Eric P Xing. Deep kernel learning. November 2015.
- Wilson et al.  Andrew Gordon Wilson, Zhiting Hu, Ruslan Salakhutdinov, and Eric P Xing. Stochastic variational deep kernel learning. November 2016.
- Yu et al.  K Yu, V Tresp, and A Schwaighofer. Learning gaussian processes from multiple tasks. International Conference on Machine Learning, 2005.