1 Introduction
Sequential decision making encompasses a large range of problems with many decades of research targeted at problems such as Bayesian optimisation, multiarmed contextual bandits or reinforcement learning. Recent years have brought great advances particularly in RL (e.g. Schulman et al.; 2015, Silver et al.; 2016)
, allowing the successful application of RL agents to increasingly complex domains. Yet, most modern machine learning algorithms require multiple magnitudes more experience than humans to solve even relatively simple problems. In this paper we argue that such systems ought to leverage past experience acquired by tackling related problem instances, allowing fast progress on a target task in the limiteddata setting.
Consider the task of designing a motor controller for an array of robot arms in a large factory. The robots vary in age, size and proportions. The objective of the controller is to send motor commands to the robot arms in such a way that allows each one to achieve its designated task. The majority of current methods may tackle the control of each arm as a separate problem, despite similarity between the arms and tasks they may be assigned to. Instead, we argue that availability of data on related problems ought to be harnessed and discuss how learning datadriven priors can allow learning of a controller for additional robot arms in a fraction of the time.
A second question that immediately arises in the design of such a controller is how to deal with uncertainty: uncertainty about the proportions of each robot, uncertainty about the physics of their motor movements, and the state of the environment. We thus argue that a suitable method should allow reasoning about predictive uncertainty, rapidly adjusting these estimates once more data becomes available.
Much recent research has studied the problem of decision making under such uncertainty. Frameworks such as Bayesian Optimisation (e.g. Moćkus et al.; 1978, Schonlau et al.; 1998) and Contextual Bandits (e.g. CesaBianchi and Lugosi; 2006) offer ways of efficiently balancing exploration and exploitation to simultaneously resolve uncertainties while making progress towards each robot arm’s objective. This is often achieved by specifying some sort of prior belief model over the dynamics of the system in advance (e.g. a Gaussian Process), which is then used to balance exploration and exploitation. Importantly, note that significant domainknowledge may be required to make an appropriate choice of a prior.
In this paper, we propose a strategy for combining the efficiency of traditional techniques for sequential decision making with the flexibility of learningbased approaches. We employ the framework of Neural Processes (NPs) (Garnelo et al.; 2018b) to instead learn suitable priors in a datadriven manner, and utilise them as part of the inner loop of established sequential decision making algorithms. For instance, in the factory, a Neural Process learns a prior over the dynamics of robot arms, rapidly adapts this prior at runtime to each individual arm, which is then used by a Bayesian Optimisation algorithm to control the robots. Thus, the Neural Process falls within the ‘Learning to Learn’ or ‘MetaLearning’ paradigm, and allows for sharing of knowledge across the robot arms.
The contributions of the paper are:

We argue for and demonstrate the advantage of datadriven priors obtained through MetaLearning in the sequential decision making paradigm.

The introduction of Neural Processes as a single model family that can be used to tackle a diverse set of such problems.

A range of novel problem instances that can be approached through metalearning, showing new approaches to recommender systems and adversarial task search.

We empirically demonstrate the advantage of learning NPs for decision making, obtaining strong results on the experiments considered.
The remainder of the paper is structured as follows. In Section 2 we introduce Neural Processes (NPs). In Section 3 we show NPs can be used to tackle general problems in (a) Bayesian optimisation (BO), (b) contextual multiarmed bandits (CMBs) and (c) modelbased reinforcement learning (MBRL). In Section 4 we describe the three disjoint problem instances that we consider in this paper: (a) we use BO and NPs to identify failure cases in pretrained RL agents, (b) we use CMBs and NPs to train recommender systems, and (c) we use MBRL with NPs to learn solutions to continuous control problems from only a handful of experience. We report the results of these experiments in Section 5, presented related work in Section 6 and conclude with a discussion.
2 Neural Processes
Neural processes (NPs) are a family of models for fewshot learning (Garnelo et al.; 2018b). Given a number of realisations from some unknown stochastic process , NPs can be used to predict the values of
at some new, unobserved locations. In contrast to standard approaches to supervised learning such as linear regression or standard neural networks, NPs model a distribution over functions that agree with the observations provided so far (similar to e.g. Gaussian Processes
(Rasmussen; 2003)). This is reflected in how NPs are trained: We require a dataset of evaluations of similar functions over the same spaces and . Note however, that we do not assume each function to be evaluated at the same . Examples of such datasets could be the temperature profile over a day in different cities around the world or evaluations of functions generated from a Gaussian process with a fixed kernel. We provide further examples throughout the paper.In order to allow NPs to learn distributions over functions, we split evaluations for each function into two disjoint subsets: a set of context points and a set of targets that contains unobserved points. These data points are then processed by a neural network as follows:
(1)  
(2)  
(3)  
(4) 
First, we use an encoder with parameters, transforming all in the context set to obtain representations . We then aggregate all to a single representation using a permutation invariant operator (such as addition) that captures the information about the underlying function provided by the context points. Later on, we parameterise a distribution over a latent variable , here assumed to be Normal with estimated by an encoder network using parameters . Note that this latent variable is introduced to model uncertainty in function space, extending the Conditional Neural Process (Garnelo et al.; 2018a).
Thereafter, a decoder is used to obtain predictive distributions at target positions . Specifically, we have with parameters depending on the data modelled. In practice, we might decide to share parameters, e.g. by setting or . To reduce notational clutter, we suppress dependencies on parameters from now on.
In order to learn the resulting intractable objective, approximate inference techniques such as variational inference are used, leading to the following evidence lowerbound:
(5) 
which is optimised with minibatch stochastic gradient descent using a different function
for each element in the batch and sampling at each iteration.There are two interesting properties following from these equations: (i) The complexity of a Neural Process is , allowing its evaluation on large context and target sets. (ii) No gradient steps are required at test time (in contrast to many other popular metalearning techniques, e.g. (Finn et al.; 2017)). We will discuss examples where these properties are desirable when turning to specific problems later in the text.
3 Sequential Decision Making With Neural Process Surrogate Models
We now discuss how to apply NPs to various instances of the sequential decision making problem. In all cases, we will make choices under uncertainty to optimise some notion of utility. A popular strategy for such problems involves fitting a surrogate model to observed data at hand, making predictions about the problem and constantly refining it when more information becomes available. Resulting predictions in unobserved areas of the inputspace allow for planning and more principled techniques for exploration, making them useful when data efficiency is a particular concern.
3.1 Bayesian Optimisation
We first consider the problem of optimising blackbox functions without gradient information. A popular approach is Bayesian Optimisation (BO) (e.g. Shahriari et al.; 2016), where we wish to find the minimiser of a (possibly noisy) function on some space without requiring access to its derivatives. The BO approach consists of fitting a probabilistic surrogate model to approximate the function on a small set of evaluations observed thus far. Examples of a surrogate are Gaussian Processes, Treestructured Parzen (density) estimators, Bayesian Neural Networks or for the purpose of this paper, NPs. The decisions involved in the process is the choice of some at which we choose to next evaluate the function . This evaluation is typically assumed to be costly, e.g. when the optimisation of a machine learning algorithm is involved (Snoek et al.; 2012). Typically, in addition to providing a good function approximation from limited data, we thus require a suitable model to provide good uncertainty estimates, which will be helpful to address the inherent exploration/exploitation tradeoff during decision making.
In addition, we require an acquisition function to guide decision making, designed such that we consider at the next point for evaluation. Model uncertainty is typically incorporated into , as is done in some popular choices such as expected improvement (Moćkus et al.; 1978) or the UCB algorithm (Srinivas et al.; 2009)
. Given our surrogate model of choice for this paper, the Thompson sampling
(Thompson; 1933, Chapelle and Li; 2011) criterion is particularly convenient (although others could be used). That is,is chosen for evaluation with probability
(6) 
where is the NP decoder (4). This is usually approximated by drawing a single and choosing its minimum as the next evaluation. We show this procedure in Algorithm 1.
(7) 
3.2 Contextual MultiArmed Bandits
Closely related to Bayesian Optimisation, the decision problem known as a contextual multiarmed bandit is formulated as follows. At each trial :

Some informative context is revealed. This could for instance be features describing a user of an online content provider (e.g. Li et al.; 2010). Crucially, we assume to be independent of past trials.

Next, we are to choose one of arms and receive some reward given our choice at the current iteration . The current context , past actions and rewards are available to guide this choice. As is unknown, we face the same exploration/exploitation tradeoff as in the BO case.

The armselection strategy is updated given access to the newly acquired . Importantly, no reward is provided for any of the arms .
Neural Processes can be relatively straightforwardly applied to this problem. Decision making proceeds as in Algorithm 1 (Thompson sampling being a reasonable choice) with the main difference being that we evaluate the NP separately for each arm
. Past data is easily incorporated in the context set by providing a onehot vector indicating which arm
has been chosen, along with and .However, in practice it might be significantly more difficult to find or design related bandit problems that are available for pretraining. We will discuss a natural realworld contextual bandit application suitable for NP pretraining in the experimental section.
3.3 ModelBased Reinforcement Learning
Allowing for dependence between subsequently provided contexts (referred to as states in the RL literature) we arrive at reinforcement learning (RL) (Sutton and Barto; 2018). An RL problem is defined by (possibly stochastic) functions (defining the transitions between states given an agent’s actions) and the reward function . These functions are together often referred to as an environment. We obtain the necessary distribution over functions for NP training by changing the properties of the environment, writing to denote the distribution over functions for each task , i.e. . The objective of the RL algorithm for a fixed task is , for rewards obtained by acting on . is a policy with parameters , i.e. the agent’s decision making process. We also introduce , a discounting factor and , indicating a time index in the current episode.
In this paper, we will focus our attention to a particular set of techniques referred to as modelbased algorithms. Modelbased RL methods assume the existence of some approximation to the dynamics of the problem at hand (typically learned online). Examples of this technique are (e.g. Peng and Williams; 1993, Browne et al.; 2012). We apply Neural Processes to this problem by first metalearning an environment model using some exploratory policy (e.g. temporarilycorrelated random exploration) on samples of the task distribution . This gives us an environmentmodel capable of quickly adapting to the dynamics of problem instances within the task distribution.
Thereafter, we use the model in conjunction with any RL algorithm to learn a policy for a specific task . This can be done by by autoregressively sampling rollouts from the NP (i.e. by acting according to and sampling transitions using the NP’s approximation to . These rollouts are then used to update using the chosen RL algorithm (sometimes referred to as indirect RL). Optionally, we may also update using the real environment rollouts (direct RL). Algorithm 2 shows the proposed approach in more details.
Note that the linear complexity of an NP is particularly useful in this problem: As we allow for additional episodes on the real environment, the number of transitions that could be added to a context set grows quickly ( for episodes of steps). In complex environments, this may quickly become prohibitive for other methods (e.g. Gaussian Processes environment models (Deisenroth and Rasmussen; 2011)).
4 Problem Setup
4.1 Adversarial TaskSearch for Rl Agents
As modern machine learning methods are approaching sufficient maturity to be applied in the real world, understanding failure cases of intelligent systems has become an important topic in our field, leading to the improvement of robustness and understanding of complex algorithms.
One class of approaches towards improving robustness of such systems use adversarial attacks. The objective of an attack is to find a perturbation of the input such that predictions of the method being tested change dramatically in an unexpected fashion. Much of the rapidly growing body of work on adversarial examples (e.g. Szegedy et al.; 2013, Goodfellow et al.; 2014, Madry et al.; 2017) has studied the supervised learning case and shown concerning vulnerabilities. Recently, this analysis has been extended to Reinforcement Learning (e.g. Behzadan and Munir; 2017, Huang et al.; 2017, Uesato et al.; 2018).
Inspired by this line of work, we consider the recent study of (Ruderman et al.; 2018) concerning failures of pretrained RL agents. The authors show that supposedly superhuman agents trained on simple navigation problems in 3Dmazes catastrophically fail when challenged with adversarially designed task instances trivially solvable by human players. The authors use an evolutionary search technique that modifies previous examples based on the agent’s episode reward. A crucial limitation of this approach is that this technique produces outofdistribution examples, weakening the significance of the results.
Figure 2(a) shows an example of a procedurally generated maze considered by the authors. An agent starts with random orientation on a starting position (held fixed for this graphic only) indicated by a green star. A goal object randomly placed in one of the coloured squares is then to be found. The colour intensity indicates an estimate of a trained agent’s expected mean episode reward if the goal was to appear at the position. Black and white ink show occupied and empty spaces respectively. We can observe this to be a difficult optimisation problem, as many of the possible goal locations are close to the minimum.
Thus, we propose to tackle the worstcase search through a Bayesian Optimisation approach on a fixed set of possible candidate maps using NP surrogate model. More formally, we study the adversarial task search problem on mazes as follows. For a given maze , agent and goal positions and obtained episode reward , we consider functions of the form:
(8) 
We can thus consider the following problems of interest:

The worstcase position search for a fixed map:

The worstcase search over a set of maps, including positions:
We consider a fixed set of maps , such that for each map , there is only a finite set (of capacity ) of possible agent and goal positions. For a fixed number of iterations , the complexity of solving problem (ii) scales as , where corresponds to evaluation of the acquisition function on all possible inputs. However, assuming a solution to (i) has already been found, we can reduce the complexity to , for some small as follows: We use an additional map model , which for a given map directly predicts the minimum reward over all possible agent and goal positions, explaining the term . Then, given the map , we run our available solution to (i) for iterations to find agent and goal positions. This corresponds to the term . We refer to this model as position model.
4.2 Recommender Systems
Considering next the contextual multiarmed bandit problem discussed in Section 3.2, we apply NPs to recommender systems. Decisions made in this context are recommendations to a user. As we aim to learn more about a user’s preferences, we can think about certain recommendations as more exploratory in case they happen to be dissimilar to previously rated items. Indeed, the problem has previously been modelled as a contextual multiarmed bandit, e.g. by (Li et al.; 2010), using linear models.
The application of an NP to this problem is natural: We can think of each user as a function from items to ratings , i.e. , where each user is possibly evaluated on a different subset of items. Thus most available datasets for recommender systems naturally fit the requirements for NP pretraining. Connecting this to the general formulation of a contextual bandit problem, we can think of each arm as a particular item recommendation and consider the user id and/or any additional information that may be provided as context . Rewards are likely to be domainspecific, but may be as simple as the rating given by a user.
The sequential decision process in this case can be explicitly treated by finding a suitable acquisition function for recommender systems. While this choice most likely depends on the goals of a particular business, we will provide a proofofconcept analysis, explicitly maximising coverage over the input space to provide a function approximation as well as possible. This is motivated by the rootmeansquarederror metric used in the literature on the experiments we will consider. Inspired by work on decision trees, a natural criterion for evaluation could be the information gain at a particular candidate item/arm. Writing
to denote the reward for each arm except in the target set (likewise for ) and suppressing dependence on latents and the context for clarity, we can thus define the information gain at arm :(9) 
Note that this involves using samples of the model’s predictive distribution at arm to estimate the entropy given an additional item of information. Assuming is a univariate normal, we arrive at an intuitive equation to determine the expected optimal next arm for evaluation:
(10) 
where we made use of conditional independence, the analytic form of the entropy of a multivariate normal and the determinant of a diagonal matrix. We thus seek to recommend the next item such that the product of variances of all other items in the target set given the user’s expected response is minimised. At this point it is important to mention that the successful application of this idea depends on the quality of the uncertainty estimates used. This is a strength of the NP family.
4.3 ModelBased Rl
As a classic RL problem, we consider the control task “cartpole” (Barto et al.; 1983), fully defined by the physical parameters of the system. We obtain a distribution over tasks (i.e. state transition and reward functions) by uniformly sampling the pole mass and cart mass for each episode, allowing pretraining of an NP. For the exploration policy required during pretraining, we use the following random walk:
(11) 
where , are fixed for the entire episode and . Another option would be to use pretrained expert to give meaningful trajectory unrolls, especially for more complicated tasks. This however, requires a solution to at least one task within the distribution.
As the RL algorithm of choice, we use onpolicy SVG(1) (Heess et al.; 2015) without replay. We provide a comparison to the related modelfree RL algorithm SVG(0) (Heess et al.; 2015) with Retrace offpolicy correction (Munos et al.; 2016) as competitive baseline. For the experiments considered, we found that is was not necessary to update the policy using real environment trajectories.
5 Results and Analysis
(scaled in [0,1]) as a function of the number of iterations. Bold lines show the mean performance over 4 unseen agents on a set of heldout maps. We also show 20% of the standard deviation.
5.1 Adversarial TaskSearch
We consider a set of 1000 randomly chosen maps with 1620 agent and goal positions each, given a total agent population of 16 independently trained agents. We divide the maps into a 80% training and 20% holdout set. In order to encourage population diversity (both in terms of behaviour and performance), we trained each agent on both the task of interest, which is explore_goal_locations_large and four auxiliary tasks from DMLab30 (Beattie et al.; 2016), using a total of four different RL algorithms across the population. 12 agents (3 of each type) are used during pretraining, while we reserve the remaining 4 agents (1 of each type) for evaluation. As baselines we consider methods that are learned online: Gaussian Processes with a linear and Matern 3/2 product kernel (Bonilla et al.; 2008), Bayes by Backprob (BBB) (Blundell et al.; 2015), AlphaDivergence (AlphaDiv) (HernándezLobato et al.; 2016), Deep Kernel Learning (DKL) (Wilson et al.; 2016) and random search. In order to ensure a correct implementation we use the thoroughly tested code provided by the authors of Riquelme et al. (2018). In order to account for pretraining, we reuse embeddings (see eq. 2) from the NP.
Addressing the question of performance of the position model first, we show results in Figure 1(a) indicating strong performance of our method. Indeed, we find agent and goal positions close to the minimum after evaluating only approx. 5% of the possible search space. Most iterations are spent on determining the global minimum among a relatively large number of points of similar magnitude. In practice, if a point close enough to the minimum is sufficient to determine the existence of an adversarial map, search can be terminated much earlier.
In order to explain the sources of this improvement, we show an analysis of NP uncertainty in function space in Figure 2(b) for varying context sizes. The graphic should be understood as the equivalent of Figure 3 in (Garnelo et al.; 2018b) for the adversarial task search problem. More specifically, we plot functions of the form (8) drawn from a neural process by sampling using varying context sizes.
As we would expect, uncertainty in function space decreases significantly as additional context points are introduced. Furthermore, we observe an interesting change in predictions once context points two and three are introduced (blue and orange lines). The mean prediction of the model increases noticeably, indicating that the agent being evaluated performs superior to the mean agent encountered during pretraining.
Showing the advantage of pretraining our method, we illustrate episode reward predictions on maps given small context sets in Figure 2(a). Note that the model has learned to assign higher scores to points closer to the starting location taking into account obstacles, without any explicitly defined distance metric provided. Thus, predictions incorporate this valuable prior information and are fairly close to the ground truth after a single observation. Indeed, subsequently added points appear to be mainly adjusting the scale of the predicted reward and the relative ordering of lowreward points.
Finally, we test our model in full case search on holdout maps, using the proposed twostage approach to reduce search complexity as outlined above. From Figure 1(b), we continue to observe superior performance for this significantly more difficult problem.
5.2 Recommender Systems
We apply NPs to the Movielens 100k & 20m datasets (Harper and Konstan; 2016). While the specific format of the datasets vary slightly, in both cases we face the basic problem of recommending movies to a user, given sideinformation such as movie genre, and tags (20m only) or certain user features (100k only) such as occupation, age and sex. Importantly, while discrete ratings warrant treatment as an ordinal regression problem, we leave this for future work to allow our results to be directly comparable to (Chen et al.; 2018).
Discussing first the smaller MovieLens 100k dataset, we closely follow the suggested experimental setup in (Chen et al.; 2018). Importantly, 20% of the users are explicitly withheld from the training dataset to test for fewshot adaptation. This is nonstandard comparing to mainstream literature, which typically use a fraction of ratings for known users as a test set. This is particular interesting for NPs, recalling their application at test time without gradient steps (see discussion in section 2). Provided this works to a satisfactory degree, this property may be particularly desirable for fast ondevice recommendation on mobile devices. Finally, similar to the argument made in (Chen et al.; 2018), NPs can be trained in a federated learning setting, which may be an important advantage of our method in case data privacy happens to be a concern.
Model  20% of user data  50%  80% 

SVD++ (Koren; 2008)  1.0517  1.0217  1.0124 
Baseline Neural Network  0.9831  0.9679  0.9507 
MAML (Finn et al.; 2017)  0.9593  0.9441  0.9295 
NP (random)  0.9381  0.9148  0.9050 
NP (info. gain)  0.9370  0.8751  0.8060 
Model  90% 

BPMF (Salakhutdinov and Mnih; 2008)  0.8123 
SVDFeature (Chen et al.; 2012)  0.7852 
LLORMA (Lee et al.; 2013)  0.7843 
ALSWR (Zhou et al.; 2008)  0.7746 
IAutorec (Sedhain et al.; 2015)  0.7742 
UCFN (Strub et al.; 2016)  0.7856 
ICFN (Strub et al.; 2016)  0.7663 
NP (random)  0.7957 
Results for random context sets of 20%/50%/80% of test users’ ratings (as suggested by the authors) are shown in Table 1 (left). While these results are encouraging, the treatment as a decision making process using our acquisition function (denoted info. gain) leads to much stronger improvements.
For completeness, we also provide results on the much larger MovieLens 20m dataset, for which more competitive baselines are available. Unfortunately, we are unable to show results using our acquisition function, due to a lack of baselines and thus only evaluate our method with a random context set. Nevertheless, Table 1 (right) shows comparable results to stateoftheart recommendation systems, despite this limitation. We also discuss several suggestions for improvements in the conclusion.
5.3 ModelBased Rl
Results of the cartpole experiment are shown in Figure 3(a), We observe strong results, showing a modelbased RL algorithm with an NP model can successfully learn a task in about 1015 episodes. We show an example video for a particular run in ^{1}^{1}1https://goo.gl/9yKav3. Testing our method on the full support of the task distribution, we show the mean episode reward in Figure 3(c) (comparing to a random policy in blue). We observe that the same method generalises for all considered tasks. As expected, the reward decreases slightly for particularly heavy carts. We also provide a comparison of NP rollouts comparing to the real environment rollouts in Figure 3(b).
6 Related Work
There has been a recent surge of interest in MetaLearning or Learning to Learn, resulting in large array of methods (e.g. Koch et al.; 2015, Andrychowicz et al.; 2016, Wang et al.; 2016, Reed et al.; 2017), many of which may be applied in the problems we study (as we merely assume the existence of a general method for regression). However, predictive uncertainty may not be available for the majority of methods.
However, several recent publications focus on probabilistic ideas or reinterpretations of popular methods (e.g. Bauer et al.; 2017, Rusu et al.; 2018, Bachman et al.; 2018), and could thus be suitable for the problems we study. An example is Probabilistic MAML (Finn et al.; 2018) which forms an extension of the popular modelagnostic metalearning (MAML) algorithm (Finn et al.; 2017)
that can be learned with variational inference. Other recent works cast metalearning as hierarchical Bayesian inference
(e.g. Edwards and Storkey; 2016, Hewitt et al.; 2018, Grant et al.; 2018, Ravi and Beatson; 2019).Gaussian Processes (GPs) are popular candidates due to closedform Bayesian inference and have been used for several of the problems we study (e.g. Rasmussen; 2003, Krause and Ong; 2011, Deisenroth and Rasmussen; 2011). While providing excellent uncertainty estimates, the scale of modern datasets can make their application difficult, thus often requiring approximations (e.g. Titsias; 2009). Furthermore, their performance strongly depends on the choice of the most suitable kernel (and thus prior over function), which may in practice require careful design or compositional kernel search (e.g. Duvenaud et al.; 2013).
Deep Kernel Learning (Wilson et al.; 2016)
provides an alternative, addressing scalability concerns while allowing the advantage of structural properties of deep learning architectures. The network weights are learned by considering them part of the kernelhyperparameters.
Moreover, much of the recent work on Bayesian Neural Networks (e.g. Blundell et al.; 2015, Gal and Ghahramani; 2016, HernándezLobato et al.; 2016, Louizos and Welling; 2017) serves as a reasonable alternative, also benefiting from the flexibility and power of modern deep learning architectures.
Finally, the approach in (Chen et al.; 2017) tackles similar problems, applying metalearning for blackbox optimisation. It skips the uncertainty estimation and is trained to directly suggest the next point for evaluation.
7 Discussion
In this paper, we have demonstrated the use of Neural Processes to learn datadriven priors for decision problems over a diverse set of problems, showing competitive results. At this point we would like to remind the reader that no aspect of the models used in the experiments has been tailored towards the problems we study. Indeed, we choose among the same hyperparameters considered by Kim et al. (2019).
Our experiments on adversarial task search indicate that such a system may for instance be used within an agent evaluation pipeline to test for exploits. Moving towards the more complex case of adversarial task discovery, one faces the problem of the dimensionality of the input space. A simple approach would be to train a generative latent variable model on the map layouts, performing gradient steps on the NP predictions wrt. to the latent variables. However, building generative models with such hard constraints on the input space is itself a difficult problem. A second avenue of future work could utilise the presented method to train more robust agents, using the NP to suggest problems the agent is currently unable to solve (cf. curriculum learning).
The presented results for recommender systems in particular are encouraging, noting that many of the standard bells and whistles used for such systems are orthogonal to NPs and could thus be easily incorporated. Results may be further improved by also considering the model , i.e. the itemspecific function mapping from users to ratings . Ideally both moviespecific and userspecific functions could be combined in a single network.
Our RL experiments showed significant improvements in terms of data efficiency when a large set of related tasks is available. In future work, it would be interesting to consider more complex problems, which may require a more sophisticated policy during pretraining.
References
 Andrychowicz et al. (2016) M. Andrychowicz, M. Denil, S. Gomez, M. W. Hoffman, D. Pfau, T. Schaul, B. Shillingford, and N. De Freitas. Learning to learn by gradient descent by gradient descent. In Advances in Neural Information Processing Systems, pages 3981–3989, 2016.
 Bachman et al. (2018) P. Bachman, R. Islam, A. Sordoni, and Z. Ahmed. Vfunc: a deep generative model for functions. arXiv preprint arXiv:1807.04106, 2018.
 Barto et al. (1983) A. G. Barto, R. S. Sutton, and C. W. Anderson. Neuronlike adaptive elements that can solve difficult learning control problems. IEEE transactions on systems, man, and cybernetics, SMC13(5):834–846, 1983.
 Bauer et al. (2017) M. Bauer, M. RojasCarulla, J. B. Świątkowski, B. Schölkopf, and R. E. Turner. Discriminative kshot learning using probabilistic models. arXiv preprint arXiv:1706.00326, 2017.
 Beattie et al. (2016) C. Beattie, J. Z. Leibo, D. Teplyashin, T. Ward, M. Wainwright, H. Küttler, A. Lefrancq, S. Green, V. Valdés, A. Sadik, J. Schrittwieser, K. Anderson, S. York, M. Cant, A. Cain, A. Bolton, S. Gaffney, H. King, D. Hassabis, S. Legg, and S. Petersen. Deepmind lab. arXiv preprint arXiv:1612.03801, 2016. URL https://arxiv.org/abs/1612.03801.

Behzadan and Munir (2017)
V. Behzadan and A. Munir.
Vulnerability of deep reinforcement learning to policy induction
attacks.
In
International Conference on Machine Learning and Data Mining in Pattern Recognition
, pages 262–275. Springer, 2017.  Blundell et al. (2015) C. Blundell, J. Cornebise, K. Kavukcuoglu, and D. Wierstra. Weight uncertainty in neural networks. arXiv preprint arXiv:1505.05424, 2015.
 Bonilla et al. (2008) E. V. Bonilla, K. M. A. Chai, and C. K. Williams. Multitask gaussian process prediction. In Advances in Neural Information Processing Systems, 2008.
 Browne et al. (2012) C. B. Browne, E. Powley, D. Whitehouse, S. M. Lucas, P. I. Cowling, P. Rohlfshagen, S. Tavener, D. Perez, S. Samothrakis, and S. Colton. A survey of monte carlo tree search methods. IEEE Transactions on Computational Intelligence and AI in games, 4(1):1–43, 2012.
 CesaBianchi and Lugosi (2006) N. CesaBianchi and G. Lugosi. Prediction, learning, and games. Cambridge university press, 2006.
 Chapelle and Li (2011) O. Chapelle and L. Li. An empirical evaluation of thompson sampling. In Advances in neural information processing systems, pages 2249–2257, 2011.
 Chen et al. (2018) F. Chen, Z. Dong, Z. Li, and X. He. Federated metalearning for recommendation. arXiv preprint arXiv:1802.07876, 2018.
 Chen et al. (2012) T. Chen, W. Zhang, Q. Lu, K. Chen, Z. Zheng, and Y. Yu. Svdfeature: a toolkit for featurebased collaborative filtering. Journal of Machine Learning Research, 13(Dec):3619–3622, 2012.
 Chen et al. (2017) Y. Chen, M. W. Hoffman, S. G. Colmenarejo, M. Denil, T. P. Lillicrap, M. Botvinick, and N. de Freitas. Learning to learn without gradient descent by gradient descent. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pages 748–756. JMLR. org, 2017.
 Deisenroth and Rasmussen (2011) M. Deisenroth and C. E. Rasmussen. Pilco: A modelbased and dataefficient approach to policy search. In Proceedings of the 28th International Conference on machine learning (ICML11), pages 465–472, 2011.
 Duvenaud et al. (2013) D. Duvenaud, J. R. Lloyd, R. Grosse, J. B. Tenenbaum, and Z. Ghahramani. Structure discovery in nonparametric regression through compositional kernel search. arXiv preprint arXiv:1302.4922, 2013.
 Edwards and Storkey (2016) H. Edwards and A. Storkey. Towards a neural statistician. arXiv preprint arXiv:1606.02185, 2016.
 Espeholt et al. (2018) L. Espeholt, H. Soyer, R. Munos, K. Simonyan, V. Mnih, T. Ward, Y. Doron, V. Firoiu, T. Harley, I. Dunning, S. Legg, and K. Kavukcuoglu. Scalable distributed deeprl with importance weighted actorlearner architectures. arXiv:1802.01561, 2018. URL https://arxiv.org/abs/1802.01561.
 Finn et al. (2017) C. Finn, P. Abbeel, and S. Levine. Modelagnostic metalearning for fast adaptation of deep networks. arXiv preprint arXiv:1703.03400, 2017.
 Finn et al. (2018) C. Finn, K. Xu, and S. Levine. Probabilistic model agnostic metalearning. arXiv prerint arXiv:1806.02817, 2018.
 Gal and Ghahramani (2016) Y. Gal and Z. Ghahramani. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In international conference on machine learning, pages 1050–1059, 2016.
 Garnelo et al. (2018a) M. Garnelo, D. Rosenbaum, C. Maddison, T. Ramalho, D. Saxton, M. Shanahan, Y. W. Teh, D. Rezende, and S. A. Eslami. Conditional neural processes. In ICML, 2018a.
 Garnelo et al. (2018b) M. Garnelo, J. Schwarz, D. Rosenbaum, F. Viola, D. J. Rezende, S. Eslami, and Y. W. Teh. Neural processes. In ICML Workshop on Theoretical Foundations and Applications of Deep Generative Models, 2018b.
 Goodfellow et al. (2014) I. J. Goodfellow, J. Shlens, and C. Szegedy. Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572, 2014.
 Grant et al. (2018) E. Grant, C. Finn, S. Levine, T. Darrell, and T. Griffiths. Recasting gradientbased metalearning as hierarchical bayes. arXiv preprint arXiv:1801.08930, 2018.
 Harper and Konstan (2016) F. M. Harper and J. A. Konstan. The movielens datasets: History and context. Acm transactions on interactive intelligent systems (tiis), 5(4):19, 2016.
 Heess et al. (2015) N. Heess, G. Wayne, D. Silver, T. Lillicrap, T. Erez, and Y. Tassa. Learning continuous control policies by stochastic value gradients. In Advances in Neural Information Processing Systems, pages 2944–2952, 2015.
 HernándezLobato et al. (2016) J. M. HernándezLobato, Y. Li, M. Rowland, D. HernándezLobato, T. Bui, and R. Turner. Blackbox divergence minimization. In International Conference on Machine Learning, 2016.
 Hessel et al. (2018) M. Hessel, H. Soyer, L. Espeholt, W. Czarnecki, S. Schmitt, and H. v. Hasselt. Multitask deep reinforcement learning with popart. arXiv:1809.04474, 2018. URL https://arxiv.org/abs/1809.04474.
 Hewitt et al. (2018) L. B. Hewitt, M. I. Nye, A. Gane, T. Jaakkola, and J. B. Tenenbaum. The variational homoencoder: Learning to learn high capacity generative models from few examples. arXiv preprint arXiv:1807.08919, 2018.
 Huang et al. (2017) S. Huang, N. Papernot, I. Goodfellow, Y. Duan, and P. Abbeel. Adversarial attacks on neural network policies. arXiv preprint arXiv:1702.02284, 2017.
 Kapturowski et al. (2019) S. Kapturowski, G. Ostrovski, W. Dabney, J. Quan, and R. Munos. Recurrent experience replay in distributed reinforcement learning. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=r1lyTjAqYX.
 Kim et al. (2019) H. Kim, A. Mnih, J. Schwarz, M. Garnelo, A. Eslami, D. Rosenbaum, O. Vinyals, and Y. W. Teh. Attentive neural processes. In International Conference on Learning Representations, 2019.
 Koch et al. (2015) G. Koch, R. Zemel, and R. Salakhutdinov. Siamese neural networks for oneshot image recognition. In ICML Deep Learning Workshop, volume 2, 2015.
 Koren (2008) Y. Koren. Factorization meets the neighborhood: a multifaceted collaborative filtering model. In Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 426–434. ACM, 2008.
 Krause and Ong (2011) A. Krause and C. S. Ong. Contextual gaussian process bandit optimization. In Advances in Neural Information Processing Systems, pages 2447–2455, 2011.
 Le et al. (2018) T. A. Le, H. Kim, M. Garnelo, D. Rosenbaum, J. Schwarz, and Y. W. Teh. Empirical evaluation of neural process objectives. In NeurIPS workshop on Bayesian Deep Learning, 2018.
 Lee et al. (2013) J. Lee, S. Kim, G. Lebanon, and Y. Singer. Local lowrank matrix approximation. In International Conference on Machine Learning, pages 82–90, 2013.
 Li et al. (2010) L. Li, W. Chu, J. Langford, and R. E. Schapire. A contextualbandit approach to personalized news article recommendation. In Proceedings of the 19th international conference on World wide web, pages 661–670. ACM, 2010.
 Louizos and Welling (2017) C. Louizos and M. Welling. Multiplicative normalizing flows for variational bayesian neural networks. arXiv preprint arXiv:1703.01961, 2017.
 Madry et al. (2017) A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu. Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083, 2017.
 Moćkus et al. (1978) J. Moćkus, V. Tiesis, and A. Źilinskas. The application of bayesian methods for seeking the extremum. vol. 2, 1978.
 Munos et al. (2016) R. Munos, T. Stepleton, A. Harutyunyan, and M. G. Bellemare. Safe and efficient offpolicy reinforcement learning. In Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 510, 2016, Barcelona, Spain, pages 1046–1054, 2016.
 Nagabandi et al. (2018) A. Nagabandi, C. Finn, and S. Levine. Deep online learning via metalearning: Continual adaptation for modelbased rl. arXiv preprint arXiv:1812.07671, 2018.
 Peng and Williams (1993) J. Peng and R. J. Williams. Efficient learning and planning within the dyna framework. Adaptive Behavior, 1(4):437–454, 1993.
 Rasmussen (2003) C. E. Rasmussen. Gaussian processes in machine learning. In Summer School on Machine Learning, pages 63–71. Springer, 2003.
 Ravi and Beatson (2019) S. Ravi and A. Beatson. Amortized bayesian metalearning. In International Conference on Learning Representations, 2019.
 Reed et al. (2017) S. Reed, Y. Chen, T. Paine, A. v. d. Oord, S. Eslami, D. Rezende, O. Vinyals, and N. de Freitas. Fewshot autoregressive density estimation: Towards learning to learn distributions. arXiv preprint arXiv:1710.10304, 2017.
 Riquelme et al. (2018) C. Riquelme, G. Tucker, and J. Snoek. Deep bayesian bandits showdown: An empirical comparison of bayesian deep networks for thompson sampling. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=SyYe6kCW.
 Ruderman et al. (2018) A. Ruderman, R. Everett, B. Sikder, H. Soyer, J. Uesato, A. Kumar, C. Beattie, and P. Kohli. Uncovering surprising behaviors in reinforcement learning via worstcase analysis. 2018. URL https://openreview.net/forum?id=SkgZNnR5tX.
 Rusu et al. (2018) A. A. Rusu, D. Rao, J. Sygnowski, O. Vinyals, R. Pascanu, S. Osindero, and R. Hadsell. Metalearning with latent embedding optimization. arXiv preprint arXiv:1807.05960, 2018.

Salakhutdinov and Mnih (2008)
R. Salakhutdinov and A. Mnih.
Bayesian probabilistic matrix factorization using markov chain monte carlo.
In Proceedings of the 25th international conference on Machine learning, pages 880–887. ACM, 2008.  Schonlau et al. (1998) M. Schonlau, W. J. Welch, and D. R. Jones. Global versus local search in constrained optimization of computer models. Lecture NotesMonograph Series, pages 11–25, 1998.
 Schulman et al. (2015) J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz. Trust region policy optimization. In International Conference on Machine Learning, pages 1889–1897, 2015.

Sedhain et al. (2015)
S. Sedhain, A. K. Menon, S. Sanner, and L. Xie.
Autorec: Autoencoders meet collaborative filtering.
In Proceedings of the 24th International Conference on World Wide Web, pages 111–112. ACM, 2015.  Shahriari et al. (2016) B. Shahriari, K. Swersky, Z. Wang, R. P. Adams, and N. De Freitas. Taking the human out of the loop: A review of bayesian optimization. Proceedings of the IEEE, 104(1):148–175, 2016.
 Silver et al. (2016) D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, et al. Mastering the game of go with deep neural networks and tree search. nature, 529(7587):484, 2016.
 Snoek et al. (2012) J. Snoek, H. Larochelle, and R. P. Adams. Practical bayesian optimization of machine learning algorithms. In Advances in neural information processing systems, pages 2951–2959, 2012.
 Srinivas et al. (2009) N. Srinivas, A. Krause, S. M. Kakade, and M. Seeger. Gaussian process optimization in the bandit setting: No regret and experimental design. arXiv preprint arXiv:0912.3995, 2009.
 Strub et al. (2016) F. Strub, R. Gaudel, and J. Mary. Hybrid recommender system based on autoencoders. In Proceedings of the 1st Workshop on Deep Learning for Recommender Systems, pages 11–16. ACM, 2016.
 Sutton and Barto (2018) R. S. Sutton and A. G. Barto. Reinforcement learning: An introduction. MIT press, 2018.
 Szegedy et al. (2013) C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199, 2013.
 Thompson (1933) W. R. Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3/4):285–294, 1933.
 Titsias (2009) M. Titsias. Variational learning of inducing variables in sparse gaussian processes. In Artificial Intelligence and Statistics, pages 567–574, 2009.
 Uesato et al. (2018) J. Uesato, B. O’Donoghue, A. v. d. Oord, and P. Kohli. Adversarial risk and the dangers of evaluating against weak attacks. In Proceedings of the 28th International Conference on machine learning (ICML), 2018.
 Wang et al. (2016) J. X. Wang, Z. KurthNelson, D. Tirumala, H. Soyer, J. Z. Leibo, R. Munos, C. Blundell, D. Kumaran, and M. Botvinick. Learning to reinforcement learn. arXiv preprint arXiv:1611.05763, 2016.
 Wayne et al. (2018) G. Wayne, C. Hung, D. Amos, M. Mirza, A. Ahuja, A. GrabskaBarwinska, J. W. Rae, P. Mirowski, J. Z. Leibo, A. Santoro, M. Gemici, M. Reynolds, T. Harley, J. Abramson, S. Mohamed, D. J. Rezende, D. Saxton, A. Cain, C. Hillier, D. Silver, K. Kavukcuoglu, M. Botvinick, D. Hassabis, and T. P. Lillicrap. Unsupervised predictive memory in a goaldirected agent. CoRR, abs/1803.10760, 2018. URL http://arxiv.org/abs/1803.10760.
 Wilson et al. (2016) A. G. Wilson, Z. Hu, R. Salakhutdinov, and E. P. Xing. Deep kernel learning. In Artificial Intelligence and Statistics, pages 370–378, 2016.
 Zhou et al. (2008) Y. Zhou, D. Wilkinson, R. Schreiber, and R. Pan. Largescale parallel collaborative filtering for the netflix prize. In International conference on algorithmic applications in management, pages 337–348. Springer, 2008.
Appendix A Acknowledgements
We would like to thank Yutian Chen, Avraham Ruderman, Nicolas Heess, Arthur Guez and Raia Hadsell for insightful discussions and feedback on an earlier version of this manuscript.
Appendix B Adversarial Task Search
We consider four types of agents: IMPALA [Espeholt et al., 2018], PopArtIMPALA [Hessel et al., 2018], MERLIN [Wayne et al., 2018], R2D2 [Kapturowski et al., 2019]. Each agent is trained on the explore_goal_locations_large level included in the DMLab30 task suite [Espeholt et al., 2018], in addition to four auxiliary ones chosen at random. The hyperparameters for the NP models are specified in Table 2. As for baselines, we take the default parameters from [Riquelme et al., 2018]. More specific parameters are given in the Table 3. DKL and GP use the same kernel. For random search in the full case, for each map we make 2 random position choices.
Model type  Context size  (Latent) Encoder  Attention  
Position Model  200  128  Squared Exponential (scale = 0.5)  
Map Model  100  64  No attention  
Model type  Pos iterations ()  Decoder  Batch size  Fixed or Maximum variance 
Position Model  1620  32  Maximum 30  
Full Model  5  32  Fixed 0.1  
Method  Encoder  Learning rate  Training Frequency  Training epochs 
Random Search All.  
BBB All  3  0.01  5  1000 
Div All  3  0.01  5  1000 
DKL Pos.  0.005  5  1000  
GP Pos.  0.001  5  1000  
DKL Full.  0.001  5  200  
GP Full.  0.1  5  200  
Method  Decoder variance  Initial variance  Prior variance  
BBB All  
Div All 
Appendix C Recommender Systems
We now discuss details of the experiment on recommender systems. In addition to the common choices for Neural Processes, we embed user (20m only) and movie ids (both) in a learnable matrix. The size of each embedding vector is shown as and respectively. Note that for each provided rating, we are also given a timestamp, which we normalise by the empirical mean and standard deviation. As these statistics are unknown at test time, we use estimates from the training set as the best approximation. Choices for architecture and hyperparameters are provided in Table 4. Note that for both datasets, only training and test set size are specified in Chen et al. [2018] and Strub et al. [2016]. We introduce an additional validation set and train on the union of train and validation sets before running a final evaluation. In both cases, we are given a minimum of 20 ratings per user.
Dataset  (Latent) Encoder  Attention  
100k  /8  Multihead (Identity)  
20m  128/128  
Dataset  ?  Decoder  Learning rate  Batch size 
100k  32  
20m  32  
Indicates whether contexts were included in the target set during training. Note that the archiecture for encoder and latent encoder (if a latent variable is used) is identical except for a final linear transform (in parenthesis). Missing entries for
indicate that no latent variable was used. Likewise, missing entries for Attention indicate that no attention was used. denotes concatenation.c.1 MOVIELENS100k
This dataset consists of 100k ratings from 943 users on 1682 movies. For our model, we use the following users features: Age, sex and occupation. In addition, we provide information about the movie genres as a khot vector. As discussed in the main text, the dataset split is nonstandard: As opposed to withholding a fraction of ratings for known users, we reserve 70% of users (including all ratings) as training users, 10% as validation users and all remaining users for the test set. These splits are chosen at random.
In order to allow for reproducibility in future work, we report the test set user ids used in our experiments here: 5, 7, 10, 12, 15, 16, 29, 41, 43, 47, 49, 55, 61, 63, 64, 67, 73, 74, 77, 78, 79, 85, 87, 93, 120, 123, 130, 139, 148, 150, 158, 163, 176, 184, 185, 189, 191, 194, 195, 210, 212, 215, 217, 223, 226, 227, 228, 232, 234, 243, 245, 249, 253, 257, 258, 259, 264, 266, 267, 269, 270, 271, 277, 284, 291, 293, 298, 303, 305, 319, 321, 324, 329, 330, 339, 341, 344, 346, 350, 354, 363, 365, 369, 372, 381, 386, 387, 389, 391, 400, 403, 410, 412, 414, 423, 427, 434, 435, 439, 441, 443, 457, 461, 467, 469, 481, 490, 495, 498, 507, 508, 511, 516, 517, 524, 530, 544, 561, 563, 580, 591, 597, 607, 624, 629, 633, 635, 638, 641, 644, 662, 668, 678, 680, 692, 703, 711, 723, 730, 731, 740, 743, 751, 758, 764, 768, 770, 783, 785, 788, 789, 790, 793, 794, 798, 800, 801, 802, 804, 810, 812, 813, 814, 821, 827, 834, 836, 841, 843, 850, 851, 853, 860, 861, 868, 873, 887, 889, 893, 896, 901, 902, 903, 905, 906, 916, 921, 922, 923.
For the results shown using the information gain acquisition function, we used random draws from the conditional prior to estimate .
c.2 MOVIELENS20m
The 20m version of the movielens dataset consists of 20000263m ratings from 138493 users on 27278 movies. Distinct from the 100k version, we are not given any user information. In addition to movie genres, we also use tags provided by users for all movies considered, on which we apply dimensionality reduction to obtain feature vectors of dimension 50. Details in Strub et al. [2016]. Note that for this dataset, we found CNPs Garnelo et al. [2018a] to work slightly better.
Appendix D ModelBasedRl
All the results are reported with 10 random seeds for the baseline method and the NPbased modelbased RL method. We report the best hyperparameters in Table 5.
Method  Training steps/episode  Actor network  Rollout length  Batch size  Learning rate 
Model free  500  10  512  
Model based  500  100  512  
Method  Entropy bonus  Critic Network  Min sigma  Max sigma  Activation 
Model free  0.01  3 1  0.01  No  ELU 
Model based  0.001  2 1  0.01  0.6  ReLU 