1 Introduction
Recent studies exploring fewshot learning have led to the creation of new algorithms that learn efficiently with very small samples and generalize beyond the training data [Wang and Yao, 2019, Chen et al., 2019]. Most of these algorithms have adopted the metalearning paradigm [Thrun and Pratt, 1998, Vilalta and Drissi, 2002]
, where some prior knowledge is learned across a large collection of diverse tasks and then transferred to new tasks for efficient learning with a limited amount of data. Despite many fewshot learning algorithms reporting improved performance over the stateoftheart, considerable progress must still be made before such approaches can be adopted on a larger scale and in more practical settings. Notably, much of the work in this field has focused on classification and reinforcement learning
Wang and Yao [2019], leaving the problem of fewshot regression largely unaddressed Li et al. [2017], Kim et al. [2018], Yi Loo [2019]. To the best of our knowledge, no realworld fewshot regression (FSR) benchmarks have been established in the literature. FSR methods, however, show great promise in addressing important problems encountered in fields plagued by small sample sizes such as drug discovery, where data acquisition is expensive and time consuming, limiting the amount of examples available to each study.Metalearningbased fewshot algorithms differ on two crucial aspects: the nature of the metaknowledge captured and the amount of adaptation performed at testtime for new tasks/datasets. First, metric learning methods [Koch et al., 2015, Vinyals et al., 2016, Snell et al., 2017, Garcia and Bruna, 2017, Bertinetto et al., 2018] accumulate metaknowledge in high capacity covariance/distance functions and then combine them with simple base learners to produce the outputs. However, they do not adapt these covariance functions at testtime. Hence, few of the currentlyused base learners have enough capacity to truly adapt [Bertinetto et al., 2018, Triantafillou et al., 2019]. Second, initialization and optimization based methods Finn et al. [2017], Kim et al. [2018], Ravi and Larochelle [2016] that learn the initialization point for gradient descent algorithms allow for more adaptation on new tasks, but remain time consuming and memory inefficient. To ensure optimal performance on FSR problems, it is crucial to combine the strengths of both types of methods.
In this study, we frame FSR as a deep kernel learning (DKL) problem, opposed to one in metric learning, allowing us to derive new algorithms. DKL methods combine the nonparametric flexibility of kernel methods with the structural properties of deep neural networks, which yields a more powerful way of learning input covariance functions and more adaptation capacity at testtime. We further improve over the general DKL algorithm for FSR by learning a family of covariance functions instead of a single one, resulting in greater adaptability. Our method selects an appropriate covariance function from this family at testtime, which makes it as adaptive as optimization and initializationbased methods.
Our Contributions: We first frame fewshot regression as a deep kernel learning problem and show why and how it is more expressive than classical metric learning methods. Next, we derive two DKL algorithms by combining set embedding techniques Zaheer et al. [2017] and kernel methods Scholkopf and Smola [2001], Williams and Rasmussen [1996] to learn a family of kernels, allowing more adaptation at testtime while being sample efficient. We then propose two new realworld datasets for FSR, drawn from the drug discovery domain. Performance on these datasets as well as synthetic data shows that our model allows greater test adjustment than classical methods.
2 Preliminaries
In this section, we describe the DKL framework (introduced by Wilson et al. [2016]) in more depth and show that it can be adapted to learn a covariance/kernel function for fewshot learning tasks.
DKL Framework:
Let , a training dataset available for learning the regression task ( is the input space and is the output space). A DKL algorithm aims to obtain a nonlinear embedding of inputs in the embedding space , using a deep neural network of parameters . Then, it finds the minimal norm regressor in the reproducing kernel Hilbert space (RKHS) on , that fits the training data, i.e.
(1) 
where
is a nonnegative loss function that measures the loss of a regressor
; and weighs the importance of the norm minimization against the training loss. Following the representer theorem [Scholkopf and Smola, 2001, Steinwart and Christmann, 2008], can be written as a finite linear combination of kernel evaluations on training inputs, i.e.(2) 
where are the combination weights; and is a reproducing kernel of
with hyperparameters
. Candidates include the radial basis, polynomial, and linear kernels. Depending on the loss function , the weights can be obtained by using a differentiable kernel method enabling to compute the gradients of the loss w.r.t. the parameters. Such methods include Gaussian Process (GP), Kernel Ridge Regression (KRR), and Logistic Regression (LR).
DKL inherits benefits and drawbacks from deep learning and kernel methods. It follows that gradient descent algorithms are required to optimize which can be high dimensional such that seeing a significant amount of training samples is thus essential to avoid overfitting. However, once the latter condition is met, scalability of the kernel method becomes limiting as the running time of kernel methods scales approximately in for a training set of samples. Some approximations of the kernel [Williams and Seeger, 2001, Wilson and Nickisch, 2015] are thus needed for the scalability of the DKL method.
FewShot DKL:
In the setup of episodic meta learning, also known as fewshot learning, one has access to a metatraining collection , of tasks to learn how to learn from few datapoints.
Each task has its own training (or support) set and validation (or query) set .
A metatesting collection is also available to assess the generalization performances of the fewshot algorithm across unseen tasks.
To learn a fewshot DKL method for FSR in such settings, one can share the parameters of across all tasks, similar to metric learning algorithms.
Hence, for a given task , the
inputs are first transformed by the function and then a kernel method is used to obtain the regressor , which will be evaluated on .
Krr:
Using the squared loss and the L2norm to compute , KRR gives the optimal regressor for a task and its validation loss as follows:
(3)  
(4) 
where is the matrix of kernel evaluations and each entry for . An equivalent definition applies to .
Gp:
Using the negative log likelihood loss function instead, the GP algorithm gives a probabilistic regressor for which predictive mean is given by Equation 3 and the loss for a task is:
(5)  
(6)  
(7) 
Finally, the parameters of the neural network, along with and the hyperparameters of the kernel, are optimized using the expected loss on all tasks:
(8) 
For tractability of this expectation, we use a MonteCarlo approximation with tasks. Unless otherwise specified, we use and , yielding a minibatch of 320 samples. In our experiments, we have fixed to .
To summarize, this algorithm finds a representation common to all tasks such that the kernel method (in our case, GP and KRR) will generalize well from a small amount of samples. Interestingly, this alleviates two of the main limitations of single task DKL: i) the scalability of the kernel method is no longer an issue since we are in the fewshot learning regime^{1}^{1}1Even with several hundred samples, the computational cost of embedding each example is usually higher than inverting the Gram matrix., and ii) the parameters (and ) are learned across a potentially large amount of tasks and samples, providing the opportunity to learn a complex representation without overfitting.
It is worth mentioning that using the linear kernel and the KRR algorithm, we recover the fewshot classification algorithm R2D2 proposed by Bertinetto et al. [2018]. Their intent was to show that KRR can be used for fast adaptation at testtime in classification settings as it is differentiable. In contrast, our intent is to formalize and adapt the DKL framework to FSR and justify how this powerful combination of kernel methods and deep networks can learn covariance functions.
3 Proposed Method
3.1 Adaptive Deep Kernel Method
As described earlier, a DKL algorithm for FSR learns a fixed kernel function shared across all tasks of interest. While a regressor on a given task can obtain an arbitrarily small loss as the training set size increases, a fixed kernel might not learn well in the fewshot regime. To verify this hypothesis, we propose an adaptive deep kernel learning (ADKL) algorithm illustrated by Figure 1. It learns an adaptive, taskdependent, kernel for fewshot learning instead of a single fixed kernel. We define the adaptive kernel as follows:
(9) 
where represents a task embedding obtained by transforming the training set with the task encoding network . We now describe in more detail the task encoding network and the architecture of the adapted kernel for a given task .
Task Encoding
The challenge of the network is to capture complex dependencies in the training set to provide a useful task encoding . Furthermore, the task encoder should be invariant to permutations of the training set and be able to encode a variable amount of samples. After exploring a variety of architectures, we found that more complex ones such as Transformers Vaswani et al. [2017] tend to underperform. This is possibly due to overfitting or the sensitivity of training such architectures.
Consequently, we introduce slight modifications to DeepSets, an order invariant network proposed by Zaheer et al. [2017]. It begins with the computation of the representation of each inputtarget pair for all , using neural networks . The captures nonlinear interactions between the inputs and the targets if is a nonlinear transformation. Then, by computing and
, the empirical mean and standard deviation of the set
, respectively, we obtain the task representation as follows:(10) 
where is a also a neural network. As and are invariant to permutations in , it follows that is also permutation invariant. Overall,
is just a nonlinear mapping of the first and second moment of the sample representations which were also nonlinear transformations of the original inputs and targets. The learnable parameters
of the task encoder include all the parameters of the networks , and are shared across all tasks.Adapted Kernel Computation
Once the task representation is obtained, we compute the conditional input embedding using the function . Let be the nonconditional embedding of the input using a neural network , whose parameters are shared with the network within the task encoder. We simply compute the conditional embedding of inputs as:
(11) 
where is a nonlinear neural network that allows for capturing complex interactions between the task and the input representations. The adapted kernel for a given task is then obtained by combining Equations 9, 10, 11. The learnable parameters of and together constitute and are shared across all tasks. Alternatively, different architectures such as Featurewise Linear Modulation (FiLM) [Perez et al., 2018], Hypernetwork [Ha et al., 2016], or Bilinear transformation [Tenenbaum and Freeman, 2000] could be used to compute , though we found that a simple concatenation was sufficient for our applications.
Kernel Methods
3.2 MetaRegularization
To help the training of the ADKL algorithm, we maximize the mutual information between and . This serves as a regularizer that helps the encoder learn a useful representation for describing the task. This is done using the MINE algorithm Belghazi et al. [2018]
, which optimizes a lower bound on the mutual information. For two random variables
and a similarity measure between and , parameterized by , the following inequality holds:(12) 
Using a minibatch approximation of the expectations^{2}^{2}2This yields a small bias on the gradient since the right hand side takes the log of the expectations. Since we are not interested in the precise value of the mutual information, this does not constitute a problem., and the cosine distance as the similarity measure between the two sets, this yields , where
(13) 
When adding this to our training objective with as a tradeoff hyperparameter, we have:
(14) 
4 Related Work
Our study spans the research areas of deep kernel learning and fewshot learning. For a comprehensive overview of fewshot learning methods, we refer the reader to Wang and Yao [2019], Chen et al. [2019], as we focus on work related to DKL herein.
Across the spectrum of learning approaches, DKL methods lie between neural networks and kernel methods. While neural networks can learn from a very large amount of data without much prior knowledge, kernel methods learn from fewer data when given an appropriate covariance function that accounts for prior knowledge of the relevant task. In the first DKL attempt, Wilson et al. [2016] combined GP with CNN to learn a covariance function adapted to a task from large amounts of data, though the large time and space complexity of kernel methods forced the approximation of the exact kernel using KISSGP Wilson and Nickisch [2015]. Dasgupta et al. have demonstrated that such approximation is not necessary using finite rank kernels. Here, we also show that learning from a collection of tasks (FSR mode) does not require any approximation when the covariance function is shared across tasks. This is an important distinction between our study and other existing studies in DKL, which learn their kernel from single tasks instead of task collections.
On the spectrum between NNs and kernel methods, metric learning also bears mention. Metric learning algorithms learn an input covariance function shared across tasks, but rely only on the expressive power of DNNs. First, stochastic kernels are built out of shared feature extractors, simple pairwise metrics (e.g. cosine similarity
Vinyals et al. [2016], Euclidean distance Snell et al. [2017]), or parametric functions (e.g. relation modules Sung et al. [2018], graph neural networks Garcia and Bruna [2017]). Then, within tasks, the predictions consist of a distanceweighted combination of the training sample labels with the stochastic kernel evaluations—no adaptation is done. The recently introduced ProtoMAML Triantafillou et al. [2019] method, which captures the best of Prototypical Networks Snell et al. [2017] and MAML Finn et al. [2017], allows withintask adaptation using MAML on a network built on top of the kernel function. Similarly, Kim et al. [2018] have proposed a Bayesian version of MAML where a feature extractor is shared across tasks, while multiple MAML particles are used for the tasklevel adaptation. Bertinetto et al. [2018] have also tackled this lack of adaptation for new tasks by using KRR and Logistic Regression to find the appropriate weighting of the training samples. This study can be considered the first application of DKL to fewshot learning. However, its contribution was limited to showing that simple differentiable learning algorithms can increase adaptation in the metric learning framework. Our work extends beyond by formalizing fewshot learning in the deep kernel learning framework where testtime adaptation is achieved through kernel methods. We also create another layer of adaptation by allowing taskspecific kernels that are created at testtime.Since ADKLGP uses GPs, it has relations to neural processes [Garnelo et al., 2018a], which proposes a scalable alternative to learning regression functions by performing inference on stochastic processes. Furthermore, in this family of methods, Conditional Neural Processes (CNP) [Garnelo et al., 2018b] and Attentive Neural Processes (ANP) [Kim et al., 2019] are even more relevant to our study as both methods learn conditional stochastic processes parameterized by conditions derived from training data points. While ANP imposes consistency with respect to some prior process, CNP does not and thus does not have the mathematical guarantees associated with stochastic processes. By comparison, our proposed ADKLGP algorithm also learns conditional stochastic processes, but within the GP framework, thus benefiting from the associated mathematical guarantees.
5 Experiments
5.1 Datasets
Previous work on fewshot regression has relied on toy datasets to evaluate performance. We instead introduce two realworld benchmarks drawn from the field of drug discovery. These benchmarks will allow us to measure the ability of fewshot learning algorithms to adapt in settings where tasks require a considerably different measure of similarity between the inputs. For instance, when predicting binding affinities between small molecules, the covariance function must learn characteristics of a binding site that changes from task to task. We describe each dataset used in our experiments below and, unless stated otherwise, their metatraining, metavalidation, and metatesting contain, respectively, 56.25%, 18.75% and 25% of all of their tasks. A preprocessed version of these datasets is available with this work (URL to appear in cameraready).

Sinusoids: This metadataset was recently proposed by Kim et al. [2018] as a challenging fewshot synthetic regression benchmark. It consists of 5,000 tasks defined by a sinusoidal functions of the form: . The parameters characterize each task and are drawn from the following intervals: , , . Samples for each task are generated by sampling inputs and observational noise from . Every model on this task collection uses a fully connected network of 2 layers of 120 hidden units as its input feature extractor. Visuals of ground truths and predictions for some functions are shown in Appendix 3.

Binding: The goal here is to predict the binding affinity of small, druglike molecules to proteins. This task collection was extracted from the public database BindingDB^{3}^{3}3Original data available at www.bindingdb.org and encompasses 7,620 tasks, each containing between 7 and 9,000 samples. Each task corresponds to a protein sequence, which thus defines a separate distribution on the input space of molecules and the output space of (realvalued) binding readouts.

Antibacterial: The goal here is to predict the antimicrobial activity of small molecules for various bacteria. The task collection was extracted from the public database PubChem^{4}^{4}4Available at https://pubchem.ncbi.nlm.nih.gov/ and contains 3,842 tasks, each consisting of 5 to 225 samples. A task corresponds to a combination of a bacterial strain and an experimental setting, which define different data distributions.
For both realworld datasets, the molecules are represented by their SMILES^{5}^{5}5See https://en.wikipedia.org/wiki/Simplified_molecularinput_lineentry_system encoding, which are descriptions of the molecular structure using short ASCII strings. All models evaluated on these collections share the same input feature extractor configuration: a 1D CNN of 2 layers of 128 hidden units each and a kernel size of 5. We use CNN instead of LSTM or advanced graph convolutions methods for scalability reasons. Moreover, the targets were scaled linearly between 0 and 1.
5.2 Benchmarking analysis
We evaluate model performance against R2D2 Bertinetto et al. [2018], CNPGarnelo et al. [2018b], and MAMLFinn et al. [2017]. R2D2 is a natural comparison to ADKLKRR (when the latter uses the linear kernel) to show whether the adapted deep kernel provides more testtime adaptation. CNP is also a natural comparison to ADKLGP and will help measure performance differences between the tasklevel Bayesian models generated within the GP and CNP frameworks. MAML is considered herein for its fastadaptation at testtime and as the representative of initialization and optimization based models. In the following experiments, all DKL methods use the linear kernel.
Our first set of experiments evaluates performance on both the realworld and toy tasks. We train each method using support and query sets of size . During metatesting, the support set size is also , but the query set consists of the remaining samples of each task. For the datasets lacking sufficient samples (the Binding and Antibacterial collections), we use half of the samples in the support set and the remaining in the query set. For each task, during metatesting, we average the Mean Squared Error (MSE) over 20 random partitions of the query and support sets. We refer to this value as the task MSE. Figure 2 illustrates the task MSE distributions over tasks for each collection and algorithm (the best hyperparameters were chosen on the metavalidation set). In general, we observe that the realworld datasets are challenging for all methods but ADKL methods consistently outperform R2D2 and CNP. The gap between ADKLKRR and R2D2 shows the importance of adapting the kernel to each task rather than sharing a single kernel. Furthermore, that ADKLGP outperforms CNP shows the effectiveness of the ADKL approach in comparison with conditional neural processes. Finally, both adaptive deep kernel methods (ADKLKRR and ADKLGP) seem to perform comparably, despite different objective functions.
A second experimental set is used to measure the acrosstask generalization and withintask adaptation capabilities of our methods relative to others. We do so by controlling the number of training tasks () and the size of the support sets during metatraining and metatesting (). Only the Sinusoids collection was used, as experiments with the realworld collections were deemed too timeconsuming for the scope of this study. One would expect the algorithms to generalize poorly to new tasks for lower values of , and their tasklevel models to adapt poorly to new samples for small values of . However, as illustrated in Figure 3, all DKL methods generalize better across tasks than others, as their overall performance is robust against the number of training tasks. They also demonstrate improved withintask generalization using as few as 15 samples, while other methods require more samples to achieve the same. Moreover, for small support sets, ADKLKRR shows better withintask generalization than ADKLGP and R2D2. Once again, the difference in performance between ADKLKRR and R2D2 can be attributed to the kernel adaptation at testtime as it is the only difference between both methods. This difference for small between ADKLGP and ADKLKRR can be attributed to larger predictive uncertainty in GP as the number samples gets smaller.
5.3 Active Learning
Here we report the results of active learning experiments. Our intent is to measure the effectiveness of the uncertainty captured by the predictive distribution of ADKLGP for active learning. CNP, in comparison, serves to measure which of CNP and GP better captures the data uncertainty for improving FSR under active sample selection. For this purpose, we metatrain both algorithms using support and query sets of size
(and for the Sinusoids collection). During metatest time, five samples are randomly selected to constitute the support set and build the initial hypothesis for each task. Then, from a pool of unlabeled data, we choose the input of maximum predictive entropy, i.e., . The latter is removed from and added to with its predicted label. The withintask adaptation is performed on the new support set to obtain a new hypothesis which is evaluated on the query set of the task. This process is repeated until we reach the allowed budget of queries.Figure 4 highlights that, in the active learning setting, ADKLGP consistently outperforms CNP. Very few samples are queried by ADKLGP to capture the data distribution, while CNP performance is far from optimal, even when allowed the maximum number of queries. Also, since using the maximum predictive entropy strategy is better than querying samples at random for ADKLGP (solid vs. dashed line), these results suggest that the predictive uncertainty obtained with GP is informative and more accurate than that of CNP. Moreover, when the number of queries is greater than , we observe a performance degradation for CNP, while ADKLGP remains consistent. This observation highlights the generalization capacity of DKL methods, even outside the fewshot regime where they have been trained — this same property does not hold true for CNP. We attribute this property of DKL methods to their use of kernel methods. In fact, their role in adaptation and generalization increases as we move away from the fewshot training regime.
5.4 Regularization and Kernel Impact
In our final set of experiments, we take a closer look at the impact of the base kernel and the metaregularization factor on the generalization during metatesting. We do so by evaluating ADKLKRR on the Sinusoids collection with different hyperparameter combinations. Figure 5 (left) shows the performances obtained by varying the metaregularization parameter over different hyperparameter configurations (listed in Appendix 1). Looking at , one can observe that nonzero values help in most cases, meaning that our added regularizer slightly improves the task encoder learning as intended. We also measure how different base kernels impact the learning process, as choosing the appropriate kernel function is crucial for kernel methods. We test the linear and RBF kernels and their normalized versions, where the normalized version of a kernel is given by: . Over different hyperparameter combinations, one observes that the linear kernel yields better generalization performances than the RBF kernel (see Figure 5 right). Such a result is within expectation as the scaling parameter of the RBF kernel is shared across tasks, making it more difficult to adapt the deep kernel. We could explore learning the base kernel hyperparameters using a network similar to in future work. It is also worth nothing that although the kernel normalization impacts outcomes, there is no clear conclusion to be drawn. We therefore advise treating the kernel function and its normalization as hyperparameters of the DKL methods.
6 Conclusion
In this paper, we investigate the benefits of DKL methods for FSR. By comparing methods on both realworld and toy task collections, we have demonstrated the effectiveness of the DKL framework in FSR. Both ADKLGP and ADKLKRR outperform the single kernel DKL method, providing evidence that they add more adaptation capacity at testtime through adaptation of the kernel. Given its Bayesian nature, ADKLGP also allows for improvement of the learned models at testtime, providing great value in settings such as drug discovery. By making our drug discovery task collections publicly available, we hope that the community will leverage these advances to propose FSR algorithms that are ready to be deployed in reallife settings, in turn having a positive impact on the drug discovery process.
References
 Wang and Yao [2019] Yaqing Wang and Quanming Yao. Fewshot learning: A survey. CoRR, abs/1904.05046, 2019. URL http://arxiv.org/abs/1904.05046.
 Chen et al. [2019] WeiYu Chen, YenCheng Liu, Zsolt Kira, YuChiang Frank Wang, and JiaBin Huang. A closer look at fewshot classification. arXiv preprint arXiv:1904.04232, 2019.
 Thrun and Pratt [1998] Sebastian Thrun and Lorien Pratt. Learning to learn: Introduction and overview. In Learning to learn, pages 3–17. Springer, 1998.
 Vilalta and Drissi [2002] Ricardo Vilalta and Youssef Drissi. A perspective view and survey of metalearning. Artificial intelligence review, 18(2):77–95, 2002.
 Li et al. [2017] Zhenguo Li, Fengwei Zhou, Fei Chen, and Hang Li. Metasgd: Learning to learn quickly for fewshot learning. arXiv preprint arXiv:1707.09835, 2017.
 Kim et al. [2018] Taesup Kim, Jaesik Yoon, Ousmane Dia, Sungwoong Kim, Yoshua Bengio, and Sungjin Ahn. Bayesian modelagnostic metalearning. arXiv preprint arXiv:1806.03836, 2018.
 Yi Loo [2019] Gemma Roig NgaiMan Cheung Yi Loo, Swee Kiat Lim. Fewshot regression via learned basis functions. open review preprint:r1ldYi9rOV, 2019.
 Koch et al. [2015] Gregory Koch, Richard Zemel, and Ruslan Salakhutdinov. Siamese neural networks for oneshot image recognition. In ICML deep learning workshop, volume 2, 2015.
 Vinyals et al. [2016] Oriol Vinyals, Charles Blundell, Timothy Lillicrap, Daan Wierstra, et al. Matching networks for one shot learning. In Advances in neural information processing systems, pages 3630–3638, 2016.
 Snell et al. [2017] Jake Snell, Kevin Swersky, and Richard Zemel. Prototypical networks for fewshot learning. In Advances in Neural Information Processing Systems, pages 4077–4087, 2017.
 Garcia and Bruna [2017] Victor Garcia and Joan Bruna. Fewshot learning with graph neural networks. arXiv preprint arXiv:1711.04043, 2017.
 Bertinetto et al. [2018] Luca Bertinetto, Joao F Henriques, Philip HS Torr, and Andrea Vedaldi. Metalearning with differentiable closedform solvers. arXiv preprint arXiv:1805.08136, 2018.
 Triantafillou et al. [2019] Eleni Triantafillou, Tyler Zhu, Vincent Dumoulin, Pascal Lamblin, Kelvin Xu, Ross Goroshin, Carles Gelada, Kevin Swersky, PierreAntoine Manzagol, and Hugo Larochelle. Metadataset: A dataset of datasets for learning to learn from few examples. arXiv preprint arXiv:1903.03096, 2019.

Finn et al. [2017]
Chelsea Finn, Pieter Abbeel, and Sergey Levine.
Modelagnostic metalearning for fast adaptation of deep networks.
In
Proceedings of the 34th International Conference on Machine LearningVolume 70
, pages 1126–1135. JMLR. org, 2017.  Ravi and Larochelle [2016] Sachin Ravi and Hugo Larochelle. Optimization as a model for fewshot learning. 2016.
 Zaheer et al. [2017] Manzil Zaheer, Satwik Kottur, Siamak Ravanbakhsh, Barnabas Poczos, Ruslan R Salakhutdinov, and Alexander J Smola. Deep sets. In Advances in neural information processing systems, pages 3391–3401, 2017.

Scholkopf and Smola [2001]
Bernhard Scholkopf and Alexander J Smola.
Learning with kernels: support vector machines, regularization, optimization, and beyond
. MIT press, 2001.  Williams and Rasmussen [1996] Christopher KI Williams and Carl Edward Rasmussen. Gaussian processes for regression. In Advances in neural information processing systems, pages 514–520, 1996.
 Wilson et al. [2016] Andrew Gordon Wilson, Zhiting Hu, Ruslan Salakhutdinov, and Eric P Xing. Deep kernel learning. In Artificial Intelligence and Statistics, pages 370–378, 2016.
 Steinwart and Christmann [2008] Ingo Steinwart and Andreas Christmann. Support vector machines. Springer Science & Business Media, 2008.
 Williams and Seeger [2001] Christopher KI Williams and Matthias Seeger. Using the nyström method to speed up kernel machines. In Advances in neural information processing systems, pages 682–688, 2001.

Wilson and Nickisch [2015]
Andrew Wilson and Hannes Nickisch.
Kernel interpolation for scalable structured gaussian processes (kissgp).
In International Conference on Machine Learning, pages 1775–1784, 2015.  Vaswani et al. [2017] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural information processing systems, pages 5998–6008, 2017.
 Perez et al. [2018] Ethan Perez, Florian Strub, Harm De Vries, Vincent Dumoulin, and Aaron Courville. Film: Visual reasoning with a general conditioning layer. In ThirtySecond AAAI Conference on Artificial Intelligence, 2018.
 Ha et al. [2016] David Ha, Andrew Dai, and Quoc V Le. Hypernetworks. arXiv preprint arXiv:1609.09106, 2016.
 Tenenbaum and Freeman [2000] Joshua B Tenenbaum and William T Freeman. Separating style and content with bilinear models. Neural computation, 12(6):1247–1283, 2000.
 Belghazi et al. [2018] Mohamed Ishmael Belghazi, Aristide Baratin, Sai Rajeswar, Sherjil Ozair, Yoshua Bengio, Aaron Courville, and R Devon Hjelm. Mine: mutual information neural estimation. arXiv preprint arXiv:1801.04062, 2018.
 [28] Sambarta Dasgupta, Kumar Sricharan, and Ashok Srivastava. Finite rank deep kernel learning.

Sung et al. [2018]
Flood Sung, Yongxin Yang, Li Zhang, Tao Xiang, Philip HS Torr, and Timothy M
Hospedales.
Learning to compare: Relation network for fewshot learning.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pages 1199–1208, 2018.  Garnelo et al. [2018a] Marta Garnelo, Jonathan Schwarz, Dan Rosenbaum, Fabio Viola, Danilo J Rezende, SM Eslami, and Yee Whye Teh. Neural processes. arXiv preprint arXiv:1807.01622, 2018a.
 Garnelo et al. [2018b] Marta Garnelo, Dan Rosenbaum, Chris J Maddison, Tiago Ramalho, David Saxton, Murray Shanahan, Yee Whye Teh, Danilo J Rezende, and SM Eslami. Conditional neural processes. arXiv preprint arXiv:1807.01613, 2018b.
 Kim et al. [2019] Hyunjik Kim, Andriy Mnih, Jonathan Schwarz, Marta Garnelo, Ali Eslami, Dan Rosenbaum, Oriol Vinyals, and Yee Whye Teh. Attentive neural processes. arXiv preprint arXiv:1901.05761, 2019.
7 Appendix
Appendix A Regularization impact
Table 1 presents the hyperparameter combinations used in the experiments to assess the impact of the joint training parameter .
Note that the performance is significantly worse when using RBF kernel.
Appendix B Kernel impact
Table 2 shows the hyperparameter combinations we used to assess the effect of using different kernels, as well as the impact of normalizing them.
kernel  Linear  RBF  
normalized  False  True  False  True  
Run  encoder arch  target FE  input FE  
1  CNP  [32, 32]  [128, 128, 128]  0.279  0.274  0.786  0.795 
2  CNP  [32, 32]  [128, 128]  0.299  0.300  0.789  0.799 
3  CNP  []  [128, 128, 128]  0.304  0.289  0.761  0.755 
4  CNP  []  [128, 128]  0.295  0.293  0.742  0.757 
5  DeepSet  [32, 32]  [128, 128, 128]  0.313  0.269  0.804  0.877 
6  DeepSet  [32, 32]  [128, 128]  0.301  0.277  0.849  0.856 
7  DeepSet  []  [128, 128, 128]  0.292  0.273  0.764  0.754 
8  DeepSet  []  [128, 128]  0.303  0.298  0.788  0.763 
9  KRR  [32, 32]  [128, 128, 128]  0.296  0.246  0.815  0.815 
10  KRR  [32, 32]  [128, 128]  0.289  0.290  0.824  0.824 
11  KRR  []  [128, 128, 128]  0.279  0.291  0.690  0.694 
12  KRR  []  [128, 128]  0.275  0.299  0.685  0.622 
Appendix C Prediction curves on the Sinusoids collection
Figure C.6 presents a visualization of the results obtained by each model on three tasks taken from the metatest set. We provide the model with ten examples from an unseen task consisting of a slightly noisy sine function (shown in blue), and present in orange the the approximation made by the network based on these ten examples.
Note that contrary to others, CNP and ADKLGP give us access to the uncertainty.