Recent studies exploring few-shot learning have led to the creation of new algorithms that learn efficiently with very small samples and generalize beyond the training data [Wang and Yao, 2019, Chen et al., 2019]. Most of these algorithms have adopted the meta-learning paradigm [Thrun and Pratt, 1998, Vilalta and Drissi, 2002]
, where some prior knowledge is learned across a large collection of diverse tasks and then transferred to new tasks for efficient learning with a limited amount of data. Despite many few-shot learning algorithms reporting improved performance over the state-of-the-art, considerable progress must still be made before such approaches can be adopted on a larger scale and in more practical settings. Notably, much of the work in this field has focused on classification and reinforcement learningWang and Yao , leaving the problem of few-shot regression largely unaddressed Li et al. , Kim et al. , Yi Loo . To the best of our knowledge, no real-world few-shot regression (FSR) benchmarks have been established in the literature. FSR methods, however, show great promise in addressing important problems encountered in fields plagued by small sample sizes such as drug discovery, where data acquisition is expensive and time consuming, limiting the amount of examples available to each study.
Meta-learning-based few-shot algorithms differ on two crucial aspects: the nature of the meta-knowledge captured and the amount of adaptation performed at test-time for new tasks/datasets. First, metric learning methods [Koch et al., 2015, Vinyals et al., 2016, Snell et al., 2017, Garcia and Bruna, 2017, Bertinetto et al., 2018] accumulate meta-knowledge in high capacity covariance/distance functions and then combine them with simple base learners to produce the outputs. However, they do not adapt these covariance functions at test-time. Hence, few of the currently-used base learners have enough capacity to truly adapt [Bertinetto et al., 2018, Triantafillou et al., 2019]. Second, initialization and optimization based methods Finn et al. , Kim et al. , Ravi and Larochelle  that learn the initialization point for gradient descent algorithms allow for more adaptation on new tasks, but remain time consuming and memory inefficient. To ensure optimal performance on FSR problems, it is crucial to combine the strengths of both types of methods.
In this study, we frame FSR as a deep kernel learning (DKL) problem, opposed to one in metric learning, allowing us to derive new algorithms. DKL methods combine the non-parametric flexibility of kernel methods with the structural properties of deep neural networks, which yields a more powerful way of learning input covariance functions and more adaptation capacity at test-time. We further improve over the general DKL algorithm for FSR by learning a family of covariance functions instead of a single one, resulting in greater adaptability. Our method selects an appropriate covariance function from this family at test-time, which makes it as adaptive as optimization and initialization-based methods.
Our Contributions: We first frame few-shot regression as a deep kernel learning problem and show why and how it is more expressive than classical metric learning methods. Next, we derive two DKL algorithms by combining set embedding techniques Zaheer et al.  and kernel methods Scholkopf and Smola , Williams and Rasmussen  to learn a family of kernels, allowing more adaptation at test-time while being sample efficient. We then propose two new real-world datasets for FSR, drawn from the drug discovery domain. Performance on these datasets as well as synthetic data shows that our model allows greater test adjustment than classical methods.
In this section, we describe the DKL framework (introduced by Wilson et al. ) in more depth and show that it can be adapted to learn a covariance/kernel function for few-shot learning tasks.
Let , a training dataset available for learning the regression task ( is the input space and is the output space). A DKL algorithm aims to obtain a non-linear embedding of inputs in the embedding space , using a deep neural network of parameters . Then, it finds the minimal norm regressor in the reproducing kernel Hilbert space (RKHS) on , that fits the training data, i.e.
is a non-negative loss function that measures the loss of a regressor; and weighs the importance of the norm minimization against the training loss. Following the representer theorem [Scholkopf and Smola, 2001, Steinwart and Christmann, 2008], can be written as a finite linear combination of kernel evaluations on training inputs, i.e.
where are the combination weights; and is a reproducing kernel of
with hyperparameters. Candidates include the radial basis, polynomial, and linear kernels. Depending on the loss function , the weights can be obtained by using a differentiable kernel method enabling to compute the gradients of the loss w.r.t. the parameters
DKL inherits benefits and drawbacks from deep learning and kernel methods. It follows that gradient descent algorithms are required to optimize which can be high dimensional such that seeing a significant amount of training samples is thus essential to avoid overfitting. However, once the latter condition is met, scalability of the kernel method becomes limiting as the running time of kernel methods scales approximately in for a training set of samples. Some approximations of the kernel [Williams and Seeger, 2001, Wilson and Nickisch, 2015] are thus needed for the scalability of the DKL method.
In the setup of episodic meta learning, also known as few-shot learning, one has access to a meta-training collection , of tasks to learn how to learn from few datapoints.
Each task has its own training (or support) set and validation (or query) set .
A meta-testing collection is also available to assess the generalization performances of the few-shot algorithm across unseen tasks.
To learn a few-shot DKL method for FSR in such settings, one can share the parameters of across all tasks, similar to metric learning algorithms.
Hence, for a given task , the
inputs are first transformed by the function and then a kernel method is used to obtain the regressor , which will be evaluated on .
Using the squared loss and the L2-norm to compute , KRR gives the optimal regressor for a task and its validation loss as follows:
where is the matrix of kernel evaluations and each entry for . An equivalent definition applies to .
Using the negative log likelihood loss function instead, the GP algorithm gives a probabilistic regressor for which predictive mean is given by Equation 3 and the loss for a task is:
Finally, the parameters of the neural network, along with and the hyperparameters of the kernel, are optimized using the expected loss on all tasks:
For tractability of this expectation, we use a Monte-Carlo approximation with tasks. Unless otherwise specified, we use and , yielding a mini-batch of 320 samples. In our experiments, we have fixed to .
To summarize, this algorithm finds a representation common to all tasks such that the kernel method (in our case, GP and KRR) will generalize well from a small amount of samples. Interestingly, this alleviates two of the main limitations of single task DKL: i) the scalability of the kernel method is no longer an issue since we are in the few-shot learning regime111Even with several hundred samples, the computational cost of embedding each example is usually higher than inverting the Gram matrix., and ii) the parameters (and ) are learned across a potentially large amount of tasks and samples, providing the opportunity to learn a complex representation without overfitting.
It is worth mentioning that using the linear kernel and the KRR algorithm, we recover the few-shot classification algorithm R2-D2 proposed by Bertinetto et al. . Their intent was to show that KRR can be used for fast adaptation at test-time in classification settings as it is differentiable. In contrast, our intent is to formalize and adapt the DKL framework to FSR and justify how this powerful combination of kernel methods and deep networks can learn covariance functions.
3 Proposed Method
3.1 Adaptive Deep Kernel Method
As described earlier, a DKL algorithm for FSR learns a fixed kernel function shared across all tasks of interest. While a regressor on a given task can obtain an arbitrarily small loss as the training set size increases, a fixed kernel might not learn well in the few-shot regime. To verify this hypothesis, we propose an adaptive deep kernel learning (ADKL) algorithm illustrated by Figure 1. It learns an adaptive, task-dependent, kernel for few-shot learning instead of a single fixed kernel. We define the adaptive kernel as follows:
where represents a task embedding obtained by transforming the training set with the task encoding network . We now describe in more detail the task encoding network and the architecture of the adapted kernel for a given task .
The challenge of the network is to capture complex dependencies in the training set to provide a useful task encoding . Furthermore, the task encoder should be invariant to permutations of the training set and be able to encode a variable amount of samples. After exploring a variety of architectures, we found that more complex ones such as Transformers Vaswani et al.  tend to underperform. This is possibly due to overfitting or the sensitivity of training such architectures.
Consequently, we introduce slight modifications to DeepSets, an order invariant network proposed by Zaheer et al. . It begins with the computation of the representation of each input-target pair for all , using neural networks . The captures nonlinear interactions between the inputs and the targets if is a nonlinear transformation. Then, by computing and
, the empirical mean and standard deviation of the set, respectively, we obtain the task representation as follows:
where is a also a neural network. As and are invariant to permutations in , it follows that is also permutation invariant. Overall,
is just a nonlinear mapping of the first and second moment of the sample representations which were also nonlinear transformations of the original inputs and targets. The learnable parametersof the task encoder include all the parameters of the networks , and are shared across all tasks.
Adapted Kernel Computation
Once the task representation is obtained, we compute the conditional input embedding using the function . Let be the non-conditional embedding of the input using a neural network , whose parameters are shared with the network within the task encoder. We simply compute the conditional embedding of inputs as:
where is a nonlinear neural network that allows for capturing complex interactions between the task and the input representations. The adapted kernel for a given task is then obtained by combining Equations 9, 10, 11. The learnable parameters of and together constitute and are shared across all tasks. Alternatively, different architectures such as Feature-wise Linear Modulation (FiLM) [Perez et al., 2018], Hypernetwork [Ha et al., 2016], or Bilinear transformation [Tenenbaum and Freeman, 2000] could be used to compute , though we found that a simple concatenation was sufficient for our applications.
To help the training of the ADKL algorithm, we maximize the mutual information between and . This serves as a regularizer that helps the encoder learn a useful representation for describing the task. This is done using the MINE algorithm Belghazi et al. 
, which optimizes a lower bound on the mutual information. For two random variablesand a similarity measure between and , parameterized by , the following inequality holds:
Using a mini-batch approximation of the expectations222This yields a small bias on the gradient since the right hand side takes the log of the expectations. Since we are not interested in the precise value of the mutual information, this does not constitute a problem., and the cosine distance as the similarity measure between the two sets, this yields , where
When adding this to our training objective with as a tradeoff hyperparameter, we have:
4 Related Work
Our study spans the research areas of deep kernel learning and few-shot learning. For a comprehensive overview of few-shot learning methods, we refer the reader to Wang and Yao , Chen et al. , as we focus on work related to DKL herein.
Across the spectrum of learning approaches, DKL methods lie between neural networks and kernel methods. While neural networks can learn from a very large amount of data without much prior knowledge, kernel methods learn from fewer data when given an appropriate covariance function that accounts for prior knowledge of the relevant task. In the first DKL attempt, Wilson et al.  combined GP with CNN to learn a covariance function adapted to a task from large amounts of data, though the large time and space complexity of kernel methods forced the approximation of the exact kernel using KISS-GP Wilson and Nickisch . Dasgupta et al. have demonstrated that such approximation is not necessary using finite rank kernels. Here, we also show that learning from a collection of tasks (FSR mode) does not require any approximation when the covariance function is shared across tasks. This is an important distinction between our study and other existing studies in DKL, which learn their kernel from single tasks instead of task collections.
On the spectrum between NNs and kernel methods, metric learning also bears mention. Metric learning algorithms learn an input covariance function shared across tasks, but rely only on the expressive power of DNNs. First, stochastic kernels are built out of shared feature extractors, simple pairwise metrics (e.g. cosine similarityVinyals et al. , Euclidean distance Snell et al. ), or parametric functions (e.g. relation modules Sung et al. , graph neural networks Garcia and Bruna ). Then, within tasks, the predictions consist of a distance-weighted combination of the training sample labels with the stochastic kernel evaluations—no adaptation is done. The recently introduced Proto-MAML Triantafillou et al.  method, which captures the best of Prototypical Networks Snell et al.  and MAML Finn et al. , allows within-task adaptation using MAML on a network built on top of the kernel function. Similarly, Kim et al.  have proposed a Bayesian version of MAML where a feature extractor is shared across tasks, while multiple MAML particles are used for the task-level adaptation. Bertinetto et al.  have also tackled this lack of adaptation for new tasks by using KRR and Logistic Regression to find the appropriate weighting of the training samples. This study can be considered the first application of DKL to few-shot learning. However, its contribution was limited to showing that simple differentiable learning algorithms can increase adaptation in the metric learning framework. Our work extends beyond by formalizing few-shot learning in the deep kernel learning framework where test-time adaptation is achieved through kernel methods. We also create another layer of adaptation by allowing task-specific kernels that are created at test-time.
Since ADKL-GP uses GPs, it has relations to neural processes [Garnelo et al., 2018a], which proposes a scalable alternative to learning regression functions by performing inference on stochastic processes. Furthermore, in this family of methods, Conditional Neural Processes (CNP) [Garnelo et al., 2018b] and Attentive Neural Processes (ANP) [Kim et al., 2019] are even more relevant to our study as both methods learn conditional stochastic processes parameterized by conditions derived from training data points. While ANP imposes consistency with respect to some prior process, CNP does not and thus does not have the mathematical guarantees associated with stochastic processes. By comparison, our proposed ADKL-GP algorithm also learns conditional stochastic processes, but within the GP framework, thus benefiting from the associated mathematical guarantees.
Previous work on few-shot regression has relied on toy datasets to evaluate performance. We instead introduce two real-world benchmarks drawn from the field of drug discovery. These benchmarks will allow us to measure the ability of few-shot learning algorithms to adapt in settings where tasks require a considerably different measure of similarity between the inputs. For instance, when predicting binding affinities between small molecules, the covariance function must learn characteristics of a binding site that changes from task to task. We describe each dataset used in our experiments below and, unless stated otherwise, their meta-training, meta-validation, and meta-testing contain, respectively, 56.25%, 18.75% and 25% of all of their tasks. A pre-processed version of these datasets is available with this work (URL to appear in camera-ready).
Sinusoids: This meta-dataset was recently proposed by Kim et al.  as a challenging few-shot synthetic regression benchmark. It consists of 5,000 tasks defined by a sinusoidal functions of the form: . The parameters characterize each task and are drawn from the following intervals: , , . Samples for each task are generated by sampling inputs and observational noise from . Every model on this task collection uses a fully connected network of 2 layers of 120 hidden units as its input feature extractor. Visuals of ground truths and predictions for some functions are shown in Appendix 3.
Binding: The goal here is to predict the binding affinity of small, drug-like molecules to proteins. This task collection was extracted from the public database BindingDB333Original data available at www.bindingdb.org and encompasses 7,620 tasks, each containing between 7 and 9,000 samples. Each task corresponds to a protein sequence, which thus defines a separate distribution on the input space of molecules and the output space of (real-valued) binding readouts.
Antibacterial: The goal here is to predict the antimicrobial activity of small molecules for various bacteria. The task collection was extracted from the public database PubChem444Available at https://pubchem.ncbi.nlm.nih.gov/ and contains 3,842 tasks, each consisting of 5 to 225 samples. A task corresponds to a combination of a bacterial strain and an experimental setting, which define different data distributions.
For both real-world datasets, the molecules are represented by their SMILES555See https://en.wikipedia.org/wiki/Simplified_molecular-input_line-entry_system encoding, which are descriptions of the molecular structure using short ASCII strings. All models evaluated on these collections share the same input feature extractor configuration: a 1-D CNN of 2 layers of 128 hidden units each and a kernel size of 5. We use CNN instead of LSTM or advanced graph convolutions methods for scalability reasons. Moreover, the targets were scaled linearly between 0 and 1.
5.2 Benchmarking analysis
We evaluate model performance against R2-D2 Bertinetto et al. , CNPGarnelo et al. [2018b], and MAMLFinn et al. . R2-D2 is a natural comparison to ADKL-KRR (when the latter uses the linear kernel) to show whether the adapted deep kernel provides more test-time adaptation. CNP is also a natural comparison to ADKL-GP and will help measure performance differences between the task-level Bayesian models generated within the GP and CNP frameworks. MAML is considered herein for its fast-adaptation at test-time and as the representative of initialization and optimization based models. In the following experiments, all DKL methods use the linear kernel.
Our first set of experiments evaluates performance on both the real-world and toy tasks. We train each method using support and query sets of size . During meta-testing, the support set size is also , but the query set consists of the remaining samples of each task. For the datasets lacking sufficient samples (the Binding and Antibacterial collections), we use half of the samples in the support set and the remaining in the query set. For each task, during meta-testing, we average the Mean Squared Error (MSE) over 20 random partitions of the query and support sets. We refer to this value as the task MSE. Figure 2 illustrates the task MSE distributions over tasks for each collection and algorithm (the best hyperparameters were chosen on the meta-validation set). In general, we observe that the real-world datasets are challenging for all methods but ADKL methods consistently outperform R2-D2 and CNP. The gap between ADKL-KRR and R2-D2 shows the importance of adapting the kernel to each task rather than sharing a single kernel. Furthermore, that ADKL-GP outperforms CNP shows the effectiveness of the ADKL approach in comparison with conditional neural processes. Finally, both adaptive deep kernel methods (ADKL-KRR and ADKL-GP) seem to perform comparably, despite different objective functions.
A second experimental set is used to measure the across-task generalization and within-task adaptation capabilities of our methods relative to others. We do so by controlling the number of training tasks () and the size of the support sets during meta-training and meta-testing (). Only the Sinusoids collection was used, as experiments with the real-world collections were deemed too time-consuming for the scope of this study. One would expect the algorithms to generalize poorly to new tasks for lower values of , and their task-level models to adapt poorly to new samples for small values of . However, as illustrated in Figure 3, all DKL methods generalize better across tasks than others, as their overall performance is robust against the number of training tasks. They also demonstrate improved within-task generalization using as few as 15 samples, while other methods require more samples to achieve the same. Moreover, for small support sets, ADKL-KRR shows better within-task generalization than ADKL-GP and R2-D2. Once again, the difference in performance between ADKL-KRR and R2-D2 can be attributed to the kernel adaptation at test-time as it is the only difference between both methods. This difference for small between ADKL-GP and ADKL-KRR can be attributed to larger predictive uncertainty in GP as the number samples gets smaller.
5.3 Active Learning
Here we report the results of active learning experiments. Our intent is to measure the effectiveness of the uncertainty captured by the predictive distribution of ADKL-GP for active learning. CNP, in comparison, serves to measure which of CNP and GP better captures the data uncertainty for improving FSR under active sample selection. For this purpose, we meta-train both algorithms using support and query sets of size(and for the Sinusoids collection). During meta-test time, five samples are randomly selected to constitute the support set and build the initial hypothesis for each task. Then, from a pool of unlabeled data, we choose the input of maximum predictive entropy, i.e., . The latter is removed from and added to with its predicted label. The within-task adaptation is performed on the new support set to obtain a new hypothesis which is evaluated on the query set of the task. This process is repeated until we reach the allowed budget of queries.
Figure 4 highlights that, in the active learning setting, ADKL-GP consistently outperforms CNP. Very few samples are queried by ADKL-GP to capture the data distribution, while CNP performance is far from optimal, even when allowed the maximum number of queries. Also, since using the maximum predictive entropy strategy is better than querying samples at random for ADKL-GP (solid vs. dashed line), these results suggest that the predictive uncertainty obtained with GP is informative and more accurate than that of CNP. Moreover, when the number of queries is greater than , we observe a performance degradation for CNP, while ADKL-GP remains consistent. This observation highlights the generalization capacity of DKL methods, even outside the few-shot regime where they have been trained — this same property does not hold true for CNP. We attribute this property of DKL methods to their use of kernel methods. In fact, their role in adaptation and generalization increases as we move away from the few-shot training regime.
5.4 Regularization and Kernel Impact
In our final set of experiments, we take a closer look at the impact of the base kernel and the meta-regularization factor on the generalization during meta-testing. We do so by evaluating ADKL-KRR on the Sinusoids collection with different hyperparameter combinations. Figure 5 (left) shows the performances obtained by varying the meta-regularization parameter over different hyperparameter configurations (listed in Appendix 1). Looking at , one can observe that non-zero values help in most cases, meaning that our added regularizer slightly improves the task encoder learning as intended. We also measure how different base kernels impact the learning process, as choosing the appropriate kernel function is crucial for kernel methods. We test the linear and RBF kernels and their normalized versions, where the normalized version of a kernel is given by: . Over different hyperparameter combinations, one observes that the linear kernel yields better generalization performances than the RBF kernel (see Figure 5 right). Such a result is within expectation as the scaling parameter of the RBF kernel is shared across tasks, making it more difficult to adapt the deep kernel. We could explore learning the base kernel hyperparameters using a network similar to in future work. It is also worth nothing that although the kernel normalization impacts outcomes, there is no clear conclusion to be drawn. We therefore advise treating the kernel function and its normalization as hyperparameters of the DKL methods.
In this paper, we investigate the benefits of DKL methods for FSR. By comparing methods on both real-world and toy task collections, we have demonstrated the effectiveness of the DKL framework in FSR. Both ADKL-GP and ADKL-KRR outperform the single kernel DKL method, providing evidence that they add more adaptation capacity at test-time through adaptation of the kernel. Given its Bayesian nature, ADKL-GP also allows for improvement of the learned models at test-time, providing great value in settings such as drug discovery. By making our drug discovery task collections publicly available, we hope that the community will leverage these advances to propose FSR algorithms that are ready to be deployed in real-life settings, in turn having a positive impact on the drug discovery process.
- Wang and Yao  Yaqing Wang and Quanming Yao. Few-shot learning: A survey. CoRR, abs/1904.05046, 2019. URL http://arxiv.org/abs/1904.05046.
- Chen et al.  Wei-Yu Chen, Yen-Cheng Liu, Zsolt Kira, Yu-Chiang Frank Wang, and Jia-Bin Huang. A closer look at few-shot classification. arXiv preprint arXiv:1904.04232, 2019.
- Thrun and Pratt  Sebastian Thrun and Lorien Pratt. Learning to learn: Introduction and overview. In Learning to learn, pages 3–17. Springer, 1998.
- Vilalta and Drissi  Ricardo Vilalta and Youssef Drissi. A perspective view and survey of meta-learning. Artificial intelligence review, 18(2):77–95, 2002.
- Li et al.  Zhenguo Li, Fengwei Zhou, Fei Chen, and Hang Li. Meta-sgd: Learning to learn quickly for few-shot learning. arXiv preprint arXiv:1707.09835, 2017.
- Kim et al.  Taesup Kim, Jaesik Yoon, Ousmane Dia, Sungwoong Kim, Yoshua Bengio, and Sungjin Ahn. Bayesian model-agnostic meta-learning. arXiv preprint arXiv:1806.03836, 2018.
- Yi Loo  Gemma Roig Ngai-Man Cheung Yi Loo, Swee Kiat Lim. Few-shot regression via learned basis functions. open review preprint:r1ldYi9rOV, 2019.
- Koch et al.  Gregory Koch, Richard Zemel, and Ruslan Salakhutdinov. Siamese neural networks for one-shot image recognition. In ICML deep learning workshop, volume 2, 2015.
- Vinyals et al.  Oriol Vinyals, Charles Blundell, Timothy Lillicrap, Daan Wierstra, et al. Matching networks for one shot learning. In Advances in neural information processing systems, pages 3630–3638, 2016.
- Snell et al.  Jake Snell, Kevin Swersky, and Richard Zemel. Prototypical networks for few-shot learning. In Advances in Neural Information Processing Systems, pages 4077–4087, 2017.
- Garcia and Bruna  Victor Garcia and Joan Bruna. Few-shot learning with graph neural networks. arXiv preprint arXiv:1711.04043, 2017.
- Bertinetto et al.  Luca Bertinetto, Joao F Henriques, Philip HS Torr, and Andrea Vedaldi. Meta-learning with differentiable closed-form solvers. arXiv preprint arXiv:1805.08136, 2018.
- Triantafillou et al.  Eleni Triantafillou, Tyler Zhu, Vincent Dumoulin, Pascal Lamblin, Kelvin Xu, Ross Goroshin, Carles Gelada, Kevin Swersky, Pierre-Antoine Manzagol, and Hugo Larochelle. Meta-dataset: A dataset of datasets for learning to learn from few examples. arXiv preprint arXiv:1903.03096, 2019.
Finn et al. 
Chelsea Finn, Pieter Abbeel, and Sergey Levine.
Model-agnostic meta-learning for fast adaptation of deep networks.
Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1126–1135. JMLR. org, 2017.
- Ravi and Larochelle  Sachin Ravi and Hugo Larochelle. Optimization as a model for few-shot learning. 2016.
- Zaheer et al.  Manzil Zaheer, Satwik Kottur, Siamak Ravanbakhsh, Barnabas Poczos, Ruslan R Salakhutdinov, and Alexander J Smola. Deep sets. In Advances in neural information processing systems, pages 3391–3401, 2017.
Scholkopf and Smola 
Bernhard Scholkopf and Alexander J Smola.
Learning with kernels: support vector machines, regularization, optimization, and beyond. MIT press, 2001.
- Williams and Rasmussen  Christopher KI Williams and Carl Edward Rasmussen. Gaussian processes for regression. In Advances in neural information processing systems, pages 514–520, 1996.
- Wilson et al.  Andrew Gordon Wilson, Zhiting Hu, Ruslan Salakhutdinov, and Eric P Xing. Deep kernel learning. In Artificial Intelligence and Statistics, pages 370–378, 2016.
- Steinwart and Christmann  Ingo Steinwart and Andreas Christmann. Support vector machines. Springer Science & Business Media, 2008.
- Williams and Seeger  Christopher KI Williams and Matthias Seeger. Using the nyström method to speed up kernel machines. In Advances in neural information processing systems, pages 682–688, 2001.
Wilson and Nickisch 
Andrew Wilson and Hannes Nickisch.
Kernel interpolation for scalable structured gaussian processes (kiss-gp).In International Conference on Machine Learning, pages 1775–1784, 2015.
- Vaswani et al.  Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural information processing systems, pages 5998–6008, 2017.
- Perez et al.  Ethan Perez, Florian Strub, Harm De Vries, Vincent Dumoulin, and Aaron Courville. Film: Visual reasoning with a general conditioning layer. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
- Ha et al.  David Ha, Andrew Dai, and Quoc V Le. Hypernetworks. arXiv preprint arXiv:1609.09106, 2016.
- Tenenbaum and Freeman  Joshua B Tenenbaum and William T Freeman. Separating style and content with bilinear models. Neural computation, 12(6):1247–1283, 2000.
- Belghazi et al.  Mohamed Ishmael Belghazi, Aristide Baratin, Sai Rajeswar, Sherjil Ozair, Yoshua Bengio, Aaron Courville, and R Devon Hjelm. Mine: mutual information neural estimation. arXiv preprint arXiv:1801.04062, 2018.
-  Sambarta Dasgupta, Kumar Sricharan, and Ashok Srivastava. Finite rank deep kernel learning.
- Sung et al.  Flood Sung, Yongxin Yang, Li Zhang, Tao Xiang, Philip HS Torr, and Timothy M Hospedales. Learning to compare: Relation network for few-shot learning. In
- Garnelo et al. [2018a] Marta Garnelo, Jonathan Schwarz, Dan Rosenbaum, Fabio Viola, Danilo J Rezende, SM Eslami, and Yee Whye Teh. Neural processes. arXiv preprint arXiv:1807.01622, 2018a.
- Garnelo et al. [2018b] Marta Garnelo, Dan Rosenbaum, Chris J Maddison, Tiago Ramalho, David Saxton, Murray Shanahan, Yee Whye Teh, Danilo J Rezende, and SM Eslami. Conditional neural processes. arXiv preprint arXiv:1807.01613, 2018b.
- Kim et al.  Hyunjik Kim, Andriy Mnih, Jonathan Schwarz, Marta Garnelo, Ali Eslami, Dan Rosenbaum, Oriol Vinyals, and Yee Whye Teh. Attentive neural processes. arXiv preprint arXiv:1901.05761, 2019.
Appendix A Regularization impact
Table 1 presents the hyperparameter combinations used in the experiments to assess the impact of the joint training parameter .
Note that the performance is significantly worse when using RBF kernel.
Appendix B Kernel impact
Table 2 shows the hyperparameter combinations we used to assess the effect of using different kernels, as well as the impact of normalizing them.
|Run||encoder arch||target FE||input FE|
|1||CNP||[32, 32]||[128, 128, 128]||0.279||0.274||0.786||0.795|
|2||CNP||[32, 32]||[128, 128]||0.299||0.300||0.789||0.799|
|3||CNP||||[128, 128, 128]||0.304||0.289||0.761||0.755|
|5||DeepSet||[32, 32]||[128, 128, 128]||0.313||0.269||0.804||0.877|
|6||DeepSet||[32, 32]||[128, 128]||0.301||0.277||0.849||0.856|
|7||DeepSet||||[128, 128, 128]||0.292||0.273||0.764||0.754|
|9||KRR||[32, 32]||[128, 128, 128]||0.296||0.246||0.815||0.815|
|10||KRR||[32, 32]||[128, 128]||0.289||0.290||0.824||0.824|
|11||KRR||||[128, 128, 128]||0.279||0.291||0.690||0.694|
Appendix C Prediction curves on the Sinusoids collection
Figure C.6 presents a visualization of the results obtained by each model on three tasks taken from the meta-test set. We provide the model with ten examples from an unseen task consisting of a slightly noisy sine function (shown in blue), and present in orange the the approximation made by the network based on these ten examples.
Note that contrary to others, CNP and ADKL-GP give us access to the uncertainty.