Meta-Learning surrogate models for sequential decision making

by   Alexandre Galashov, et al.

Meta-learning methods leverage past experience to learn data-driven inductive biases from related problems, increasing learning efficiency on new tasks. This ability renders them particularly suitable for sequential decision making with limited experience. Within this problem family, we argue for the use of such approaches in the study of model-based approaches to Bayesian Optimisation, contextual bandits and Reinforcement Learning. We approach the problem by learning distributions over functions using Neural Processes (NPs), a recently introduced probabilistic meta-learning method. This allows the treatment of model uncertainty to tackle the exploration/exploitation dilemma. We show that NPs are suitable for sequential decision making on a diverse set of domains, including adversarial task search, recommender systems and model-based reinforcement learning.


Transformer Neural Processes: Uncertainty-Aware Meta Learning Via Sequence Modeling

Neural Processes (NPs) are a popular class of approaches for meta-learni...

Recurrent Sum-Product-Max Networks for Decision Making in Perfectly-Observed Environments

Recent investigations into sum-product-max networks (SPMN) that generali...

Meta-Learning Hypothesis Spaces for Sequential Decision-making

Obtaining reliable, adaptive confidence sets for prediction functions (h...

Bayesian decision-making under misspecified priors with applications to meta-learning

Thompson sampling and other Bayesian sequential decision-making algorith...

Specialization in Hierarchical Learning Systems

Joining multiple decision-makers together is a powerful way to obtain mo...

Meta Neural Ordinary Differential Equations For Adaptive Asynchronous Control

Model-based Reinforcement Learning and Control have demonstrated great p...

Automating Predictive Modeling Process using Reinforcement Learning

Building a good predictive model requires an array of activities such as...

1 Introduction

Sequential decision making encompasses a large range of problems with many decades of research targeted at problems such as Bayesian optimisation, multi-armed contextual bandits or reinforcement learning. Recent years have brought great advances particularly in RL (e.g. Schulman et al.; 2015, Silver et al.; 2016)

, allowing the successful application of RL agents to increasingly complex domains. Yet, most modern machine learning algorithms require multiple magnitudes more experience than humans to solve even relatively simple problems. In this paper we argue that such systems ought to leverage past experience acquired by tackling related problem instances, allowing fast progress on a target task in the limited-data setting.

Consider the task of designing a motor controller for an array of robot arms in a large factory. The robots vary in age, size and proportions. The objective of the controller is to send motor commands to the robot arms in such a way that allows each one to achieve its designated task. The majority of current methods may tackle the control of each arm as a separate problem, despite similarity between the arms and tasks they may be assigned to. Instead, we argue that availability of data on related problems ought to be harnessed and discuss how learning data-driven priors can allow learning of a controller for additional robot arms in a fraction of the time.

A second question that immediately arises in the design of such a controller is how to deal with uncertainty: uncertainty about the proportions of each robot, uncertainty about the physics of their motor movements, and the state of the environment. We thus argue that a suitable method should allow reasoning about predictive uncertainty, rapidly adjusting these estimates once more data becomes available.

Much recent research has studied the problem of decision making under such uncertainty. Frameworks such as Bayesian Optimisation (e.g. Moćkus et al.; 1978, Schonlau et al.; 1998) and Contextual Bandits (e.g. Cesa-Bianchi and Lugosi; 2006) offer ways of efficiently balancing exploration and exploitation to simultaneously resolve uncertainties while making progress towards each robot arm’s objective. This is often achieved by specifying some sort of prior belief model over the dynamics of the system in advance (e.g. a Gaussian Process), which is then used to balance exploration and exploitation. Importantly, note that significant domain-knowledge may be required to make an appropriate choice of a prior.

In this paper, we propose a strategy for combining the efficiency of traditional techniques for sequential decision making with the flexibility of learning-based approaches. We employ the framework of Neural Processes (NPs) (Garnelo et al.; 2018b) to instead learn suitable priors in a data-driven manner, and utilise them as part of the inner loop of established sequential decision making algorithms. For instance, in the factory, a Neural Process learns a prior over the dynamics of robot arms, rapidly adapts this prior at run-time to each individual arm, which is then used by a Bayesian Optimisation algorithm to control the robots. Thus, the Neural Process falls within the ‘Learning to Learn’ or ‘Meta-Learning’ paradigm, and allows for sharing of knowledge across the robot arms.

The contributions of the paper are:

  • We argue for and demonstrate the advantage of data-driven priors obtained through Meta-Learning in the sequential decision making paradigm.

  • The introduction of Neural Processes as a single model family that can be used to tackle a diverse set of such problems.

  • A range of novel problem instances that can be approached through meta-learning, showing new approaches to recommender systems and adversarial task search.

  • We empirically demonstrate the advantage of learning NPs for decision making, obtaining strong results on the experiments considered.

The remainder of the paper is structured as follows. In Section 2 we introduce Neural Processes (NPs). In Section 3 we show NPs can be used to tackle general problems in (a) Bayesian optimisation (BO), (b) contextual multi-armed bandits (CMBs) and (c) model-based reinforcement learning (MBRL). In Section 4 we describe the three disjoint problem instances that we consider in this paper: (a) we use BO and NPs to identify failure cases in pretrained RL agents, (b) we use CMBs and NPs to train recommender systems, and (c) we use MBRL with NPs to learn solutions to continuous control problems from only a handful of experience. We report the results of these experiments in Section 5, presented related work in Section 6 and conclude with a discussion.

2 Neural Processes

Neural processes (NPs) are a family of models for few-shot learning (Garnelo et al.; 2018b). Given a number of realisations from some unknown stochastic process , NPs can be used to predict the values of

at some new, unobserved locations. In contrast to standard approaches to supervised learning such as linear regression or standard neural networks, NPs model a distribution over functions that agree with the observations provided so far (similar to e.g. Gaussian Processes

(Rasmussen; 2003)). This is reflected in how NPs are trained: We require a dataset of evaluations of similar functions over the same spaces and . Note however, that we do not assume each function to be evaluated at the same . Examples of such datasets could be the temperature profile over a day in different cities around the world or evaluations of functions generated from a Gaussian process with a fixed kernel. We provide further examples throughout the paper.

In order to allow NPs to learn distributions over functions, we split evaluations for each function into two disjoint subsets: a set of context points and a set of targets that contains unobserved points. These data points are then processed by a neural network as follows:


First, we use an encoder with parameters, transforming all in the context set to obtain representations . We then aggregate all to a single representation using a permutation invariant operator (such as addition) that captures the information about the underlying function provided by the context points. Later on, we parameterise a distribution over a latent variable , here assumed to be Normal with estimated by an encoder network using parameters . Note that this latent variable is introduced to model uncertainty in function space, extending the Conditional Neural Process (Garnelo et al.; 2018a).

Thereafter, a decoder is used to obtain predictive distributions at target positions . Specifically, we have with parameters depending on the data modelled. In practice, we might decide to share parameters, e.g. by setting or . To reduce notational clutter, we suppress dependencies on parameters from now on.

In order to learn the resulting intractable objective, approximate inference techniques such as variational inference are used, leading to the following evidence lower-bound:


which is optimised with mini-batch stochastic gradient descent using a different function

for each element in the batch and sampling at each iteration.

There are two interesting properties following from these equations: (i) The complexity of a Neural Process is , allowing its evaluation on large context and target sets. (ii) No gradient steps are required at test time (in contrast to many other popular meta-learning techniques, e.g. (Finn et al.; 2017)). We will discuss examples where these properties are desirable when turning to specific problems later in the text.

Recently, attention has been successfully applied for NPs (Kim et al.; 2019), improving predictions at observed points. For various alternatives to the loss in (5) we refer the interested reader to (Le et al.; 2018).

3 Sequential Decision Making With Neural Process Surrogate Models

We now discuss how to apply NPs to various instances of the sequential decision making problem. In all cases, we will make choices under uncertainty to optimise some notion of utility. A popular strategy for such problems involves fitting a surrogate model to observed data at hand, making predictions about the problem and constantly refining it when more information becomes available. Resulting predictions in unobserved areas of the input-space allow for planning and more principled techniques for exploration, making them useful when data efficiency is a particular concern.

3.1 Bayesian Optimisation

We first consider the problem of optimising black-box functions without gradient information. A popular approach is Bayesian Optimisation (BO) (e.g. Shahriari et al.; 2016), where we wish to find the minimiser of a (possibly noisy) function on some space without requiring access to its derivatives. The BO approach consists of fitting a probabilistic surrogate model to approximate the function on a small set of evaluations observed thus far. Examples of a surrogate are Gaussian Processes, Tree-structured Parzen (density) estimators, Bayesian Neural Networks or for the purpose of this paper, NPs. The decisions involved in the process is the choice of some at which we choose to next evaluate the function . This evaluation is typically assumed to be costly, e.g. when the optimisation of a machine learning algorithm is involved (Snoek et al.; 2012). Typically, in addition to providing a good function approximation from limited data, we thus require a suitable model to provide good uncertainty estimates, which will be helpful to address the inherent exploration/exploitation trade-off during decision making.

In addition, we require an acquisition function to guide decision making, designed such that we consider at the next point for evaluation. Model uncertainty is typically incorporated into , as is done in some popular choices such as expected improvement (Moćkus et al.; 1978) or the UCB algorithm (Srinivas et al.; 2009)

. Given our surrogate model of choice for this paper, the Thompson sampling

(Thompson; 1933, Chapelle and Li; 2011) criterion is particularly convenient (although others could be used). That is,

is chosen for evaluation with probability


where is the NP decoder (4). This is usually approximated by drawing a single and choosing its minimum as the next evaluation. We show this procedure in Algorithm 1.

   - Function to evaluate
   - Initial randomly drawn context set
   - Maximum number of function iterations
   - Neural process pretrained on evaluations of similar functions
  for n=1, …, N do
     Infer conditional prior
     Thompson sampling: Draw , find
     Evaluate target function and add result to context set
  end for
Algorithm 1 Bayesian Optimisation with NPs and Thompson sampling.

3.2 Contextual Multi-Armed Bandits

Closely related to Bayesian Optimisation, the decision problem known as a contextual multi-armed bandit is formulated as follows. At each trial :

  1. Some informative context is revealed. This could for instance be features describing a user of an online content provider (e.g. Li et al.; 2010). Crucially, we assume to be independent of past trials.

  2. Next, we are to choose one of arms and receive some reward given our choice at the current iteration . The current context , past actions and rewards are available to guide this choice. As is unknown, we face the same exploration/exploitation trade-off as in the BO case.

  3. The arm-selection strategy is updated given access to the newly acquired . Importantly, no reward is provided for any of the arms .

Neural Processes can be relatively straight-forwardly applied to this problem. Decision making proceeds as in Algorithm 1 (Thompson sampling being a reasonable choice) with the main difference being that we evaluate the NP separately for each arm

. Past data is easily incorporated in the context set by providing a one-hot vector indicating which arm

has been chosen, along with and .

However, in practice it might be significantly more difficult to find or design related bandit problems that are available for pretraining. We will discuss a natural real-world contextual bandit application suitable for NP pretraining in the experimental section.

3.3 Model-Based Reinforcement Learning

Allowing for dependence between subsequently provided contexts (referred to as states in the RL literature) we arrive at reinforcement learning (RL) (Sutton and Barto; 2018). An RL problem is defined by (possibly stochastic) functions (defining the transitions between states given an agent’s actions) and the reward function . These functions are together often referred to as an environment. We obtain the necessary distribution over functions for NP training by changing the properties of the environment, writing to denote the distribution over functions for each task , i.e. . The objective of the RL algorithm for a fixed task is , for rewards obtained by acting on . is a policy with parameters , i.e. the agent’s decision making process. We also introduce , a discounting factor and , indicating a time index in the current episode.

In this paper, we will focus our attention to a particular set of techniques referred to as model-based algorithms. Model-based RL methods assume the existence of some approximation to the dynamics of the problem at hand (typically learned online). Examples of this technique are (e.g. Peng and Williams; 1993, Browne et al.; 2012). We apply Neural Processes to this problem by first meta-learning an environment model using some exploratory policy (e.g. temporarily-correlated random exploration) on samples of the task distribution . This gives us an environment-model capable of quickly adapting to the dynamics of problem instances within the task distribution.

Thereafter, we use the model in conjunction with any RL algorithm to learn a policy for a specific task . This can be done by by autoregressively sampling rollouts from the NP (i.e. by acting according to and sampling transitions using the NP’s approximation to . These rollouts are then used to update using the chosen RL algorithm (sometimes referred to as indirect RL). Optionally, we may also update using the real environment rollouts (direct RL). Algorithm 2 shows the proposed approach in more details.

  Training Input:
   - Neural process with parameters to estimate and
   - Task Distribution
   - Exploratory policy.
  while Pretraining not finished do
     Sample a task .
     Obtain transitions from using .
     Randomly split into context and target sets .
     Take a gradient step to minimise wrt. 
  end while
  Evaluation Input:
   - Policy parameters for target task , - Replay
   - Rollout length
  while true do
     Generate trajectory on target task (i.e. the real environment) using .
     for n=1,…, N do
        Sample state observed on to initialise a trajectory .
        Generate trajectory using and autoregressive sampling from (given as context).
        Update policy using any RL algorithm.
     end for
  end while
Algorithm 2 Model-based Reinforcement Learning with NPs

Note that the linear complexity of an NP is particularly useful in this problem: As we allow for additional episodes on the real environment, the number of transitions that could be added to a context set grows quickly ( for episodes of steps). In complex environments, this may quickly become prohibitive for other methods (e.g. Gaussian Processes environment models (Deisenroth and Rasmussen; 2011)).

4 Problem Setup

4.1 Adversarial Task-Search for Rl Agents

As modern machine learning methods are approaching sufficient maturity to be applied in the real world, understanding failure cases of intelligent systems has become an important topic in our field, leading to the improvement of robustness and understanding of complex algorithms.

One class of approaches towards improving robustness of such systems use adversarial attacks. The objective of an attack is to find a perturbation of the input such that predictions of the method being tested change dramatically in an unexpected fashion. Much of the rapidly growing body of work on adversarial examples (e.g. Szegedy et al.; 2013, Goodfellow et al.; 2014, Madry et al.; 2017) has studied the supervised learning case and shown concerning vulnerabilities. Recently, this analysis has been extended to Reinforcement Learning (e.g. Behzadan and Munir; 2017, Huang et al.; 2017, Uesato et al.; 2018).

Inspired by this line of work, we consider the recent study of (Ruderman et al.; 2018) concerning failures of pretrained RL agents. The authors show that supposedly superhuman agents trained on simple navigation problems in 3D-mazes catastrophically fail when challenged with adversarially designed task instances trivially solvable by human players. The authors use an evolutionary search technique that modifies previous examples based on the agent’s episode reward. A crucial limitation of this approach is that this technique produces out-of-distribution examples, weakening the significance of the results.

Figure 2(a) shows an example of a procedurally generated maze considered by the authors. An agent starts with random orientation on a starting position (held fixed for this graphic only) indicated by a green star. A goal object randomly placed in one of the coloured squares is then to be found. The colour intensity indicates an estimate of a trained agent’s expected mean episode reward if the goal was to appear at the position. Black and white ink show occupied and empty spaces respectively. We can observe this to be a difficult optimisation problem, as many of the possible goal locations are close to the minimum.

Thus, we propose to tackle the worst-case search through a Bayesian Optimisation approach on a fixed set of possible candidate maps using NP surrogate model. More formally, we study the adversarial task search problem on mazes as follows. For a given maze , agent and goal positions and obtained episode reward , we consider functions of the form:


We can thus consider the following problems of interest:

  1. The worst-case position search for a fixed map:

  2. The worst-case search over a set of maps, including positions:

We consider a fixed set of maps , such that for each map , there is only a finite set (of capacity ) of possible agent and goal positions. For a fixed number of iterations , the complexity of solving problem (ii) scales as , where corresponds to evaluation of the acquisition function on all possible inputs. However, assuming a solution to (i) has already been found, we can reduce the complexity to , for some small as follows: We use an additional map model , which for a given map directly predicts the minimum reward over all possible agent and goal positions, explaining the term . Then, given the map , we run our available solution to (i) for iterations to find agent and goal positions. This corresponds to the term . We refer to this model as position model.

4.2 Recommender Systems

Considering next the contextual multi-armed bandit problem discussed in Section 3.2, we apply NPs to recommender systems. Decisions made in this context are recommendations to a user. As we aim to learn more about a user’s preferences, we can think about certain recommendations as more exploratory in case they happen to be dissimilar to previously rated items. Indeed, the problem has previously been modelled as a contextual multi-armed bandit, e.g. by (Li et al.; 2010), using linear models.

The application of an NP to this problem is natural: We can think of each user as a function from items to ratings , i.e. , where each user is possibly evaluated on a different subset of items. Thus most available datasets for recommender systems naturally fit the requirements for NP pretraining. Connecting this to the general formulation of a contextual bandit problem, we can think of each arm as a particular item recommendation and consider the user id and/or any additional information that may be provided as context . Rewards are likely to be domain-specific, but may be as simple as the rating given by a user.

The sequential decision process in this case can be explicitly treated by finding a suitable acquisition function for recommender systems. While this choice most likely depends on the goals of a particular business, we will provide a proof-of-concept analysis, explicitly maximising coverage over the input space to provide a function approximation as well as possible. This is motivated by the root-mean-squared-error metric used in the literature on the experiments we will consider. Inspired by work on decision trees, a natural criterion for evaluation could be the information gain at a particular candidate item/arm. Writing

to denote the reward for each arm except in the target set (likewise for ) and suppressing dependence on latents and the context for clarity, we can thus define the information gain at arm :


Note that this involves using samples of the model’s predictive distribution at arm to estimate the entropy given an additional item of information. Assuming is a univariate normal, we arrive at an intuitive equation to determine the expected optimal next arm for evaluation:


where we made use of conditional independence, the analytic form of the entropy of a multivariate normal and the determinant of a diagonal matrix. We thus seek to recommend the next item such that the product of variances of all other items in the target set given the user’s expected response is minimised. At this point it is important to mention that the successful application of this idea depends on the quality of the uncertainty estimates used. This is a strength of the NP family.

4.3 Model-Based Rl

As a classic RL problem, we consider the control task “cartpole” (Barto et al.; 1983), fully defined by the physical parameters of the system. We obtain a distribution over tasks (i.e. state transition and reward functions) by uniformly sampling the pole mass and cart mass for each episode, allowing pretraining of an NP. For the exploration policy required during pretraining, we use the following random walk:


where , are fixed for the entire episode and . Another option would be to use pretrained expert to give meaningful trajectory unrolls, especially for more complicated tasks. This however, requires a solution to at least one task within the distribution.

As the RL algorithm of choice, we use on-policy SVG(1) (Heess et al.; 2015) without replay. We provide a comparison to the related model-free RL algorithm SVG(0) (Heess et al.; 2015) with Retrace off-policy correction (Munos et al.; 2016) as competitive baseline. For the experiments considered, we found that is was not necessary to update the policy using real environment trajectories.

5 Results and Analysis

(a) Position model results
(b) Full search problem results
Figure 1: Bayesian Optimisation results. Left: Position search Right: Full map search. We report the minimum up to iteration

(scaled in [0,1]) as a function of the number of iterations. Bold lines show the mean performance over 4 unseen agents on a set of held-out maps. We also show 20% of the standard deviation.

5.1 Adversarial Task-Search

We consider a set of 1000 randomly chosen maps with 1620 agent and goal positions each, given a total agent population of 16 independently trained agents. We divide the maps into a 80% training and 20% holdout set. In order to encourage population diversity (both in terms of behaviour and performance), we trained each agent on both the task of interest, which is explore_goal_locations_large and four auxiliary tasks from DMLab-30 (Beattie et al.; 2016), using a total of four different RL algorithms across the population. 12 agents (3 of each type) are used during pretraining, while we reserve the remaining 4 agents (1 of each type) for evaluation. As baselines we consider methods that are learned online: Gaussian Processes with a linear and Matern 3/2 product kernel (Bonilla et al.; 2008), Bayes by Backprob (BBB) (Blundell et al.; 2015), AlphaDivergence (AlphaDiv) (Hernández-Lobato et al.; 2016), Deep Kernel Learning (DKL) (Wilson et al.; 2016) and random search. In order to ensure a correct implementation we use the thoroughly tested code provided by the authors of Riquelme et al. (2018). In order to account for pretraining, we reuse embeddings (see eq. 2) from the NP.

Addressing the question of performance of the position model first, we show results in Figure 1(a) indicating strong performance of our method. Indeed, we find agent and goal positions close to the minimum after evaluating only approx. 5% of the possible search space. Most iterations are spent on determining the global minimum among a relatively large number of points of similar magnitude. In practice, if a point close enough to the minimum is sufficient to determine the existence of an adversarial map, search can be terminated much earlier.

In order to explain the sources of this improvement, we show an analysis of NP uncertainty in function space in Figure 2(b) for varying context sizes. The graphic should be understood as the equivalent of Figure 3 in (Garnelo et al.; 2018b) for the adversarial task search problem. More specifically, we plot functions of the form (8) drawn from a neural process by sampling using varying context sizes.

(a) Few shot predictions
(b) Uncertainty analysis
Figure 2: Top: Held-out agent episode returns for a fixed start (green star) and varying goal positions. Shown is the ground-truth and few-shot predictions by a NP. Bottom: Expected episode reward for an unseen agent over 1000 functions drawn from an NP. Positions are sorted by the absolute distance between agent and goal positions.

As we would expect, uncertainty in function space decreases significantly as additional context points are introduced. Furthermore, we observe an interesting change in predictions once context points two and three are introduced (blue and orange lines). The mean prediction of the model increases noticeably, indicating that the agent being evaluated performs superior to the mean agent encountered during pretraining.

Showing the advantage of pretraining our method, we illustrate episode reward predictions on maps given small context sets in Figure 2(a). Note that the model has learned to assign higher scores to points closer to the starting location taking into account obstacles, without any explicitly defined distance metric provided. Thus, predictions incorporate this valuable prior information and are fairly close to the ground truth after a single observation. Indeed, subsequently added points appear to be mainly adjusting the scale of the predicted reward and the relative ordering of low-reward points.

Finally, we test our model in full case search on holdout maps, using the proposed two-stage approach to reduce search complexity as outlined above. From Figure 1(b), we continue to observe superior performance for this significantly more difficult problem.

5.2 Recommender Systems

We apply NPs to the Movielens 100k & 20m datasets (Harper and Konstan; 2016). While the specific format of the datasets vary slightly, in both cases we face the basic problem of recommending movies to a user, given side-information such as movie genre, and tags (20m only) or certain user features (100k only) such as occupation, age and sex. Importantly, while discrete ratings warrant treatment as an ordinal regression problem, we leave this for future work to allow our results to be directly comparable to (Chen et al.; 2018).

Discussing first the smaller MovieLens 100k dataset, we closely follow the suggested experimental setup in (Chen et al.; 2018). Importantly, 20% of the users are explicitly withheld from the training dataset to test for few-shot adaptation. This is non-standard comparing to mainstream literature, which typically use a fraction of ratings for known users as a test set. This is particular interesting for NPs, recalling their application at test time without gradient steps (see discussion in section 2). Provided this works to a satisfactory degree, this property may be particularly desirable for fast on-device recommendation on mobile devices. Finally, similar to the argument made in (Chen et al.; 2018), NPs can be trained in a federated learning setting, which may be an important advantage of our method in case data privacy happens to be a concern.

Model 20% of user data 50% 80%
SVD++ (Koren; 2008) 1.0517 1.0217 1.0124
Baseline Neural Network 0.9831 0.9679 0.9507
MAML (Finn et al.; 2017) 0.9593 0.9441 0.9295
NP (random) 0.9381 0.9148 0.9050
NP (info. gain) 0.9370 0.8751 0.8060
Model 90%
BPMF (Salakhutdinov and Mnih; 2008) 0.8123
SVDFeature (Chen et al.; 2012) 0.7852
LLORMA (Lee et al.; 2013) 0.7843
ALS-WR (Zhou et al.; 2008) 0.7746
I-Autorec (Sedhain et al.; 2015) 0.7742
U-CFN (Strub et al.; 2016) 0.7856
I-CFN (Strub et al.; 2016) 0.7663
NP (random) 0.7957
Table 1: Results on MovieLens 100k (left) and 20m (right). For the 100k dataset, we report results on varying number of unseen ratings on new users. On MovieLens 20m we report results on 10% unseen ratings of known users. In both cases, we report the RMSE. Baseline results taken from (Chen et al.; 2018) and (Strub et al.; 2016).

Results for random context sets of 20%/50%/80% of test users’ ratings (as suggested by the authors) are shown in Table 1 (left). While these results are encouraging, the treatment as a decision making process using our acquisition function (denoted info. gain) leads to much stronger improvements.

For completeness, we also provide results on the much larger MovieLens 20m dataset, for which more competitive baselines are available. Unfortunately, we are unable to show results using our acquisition function, due to a lack of baselines and thus only evaluate our method with a random context set. Nevertheless, Table 1 (right) shows comparable results to state-of-the-art recommendation systems, despite this limitation. We also discuss several suggestions for improvements in the conclusion.

5.3 Model-Based Rl

(a) Example learning curves.
(b) Eenvironment and model rollouts using a random initial state and .
(c) Performance for different task parameters
Figure 3: Left & Middle: Learning curves (showing mean and standard deviation over 10 random repetitions.) and example rollouts for default task parameters . Right: Mean episode reward at convergence for varying cart and pole masses.

Results of the cartpole experiment are shown in Figure 3(a), We observe strong results, showing a model-based RL algorithm with an NP model can successfully learn a task in about 10-15 episodes. We show an example video for a particular run in 111 Testing our method on the full support of the task distribution, we show the mean episode reward in Figure 3(c) (comparing to a random policy in blue). We observe that the same method generalises for all considered tasks. As expected, the reward decreases slightly for particularly heavy carts. We also provide a comparison of NP rollouts comparing to the real environment rollouts in Figure  3(b).

6 Related Work

There has been a recent surge of interest in Meta-Learning or Learning to Learn, resulting in large array of methods (e.g. Koch et al.; 2015, Andrychowicz et al.; 2016, Wang et al.; 2016, Reed et al.; 2017), many of which may be applied in the problems we study (as we merely assume the existence of a general method for regression). However, predictive uncertainty may not be available for the majority of methods.

However, several recent publications focus on probabilistic ideas or re-interpretations of popular methods (e.g. Bauer et al.; 2017, Rusu et al.; 2018, Bachman et al.; 2018), and could thus be suitable for the problems we study. An example is Probabilistic MAML (Finn et al.; 2018) which forms an extension of the popular model-agnostic meta-learning (MAML) algorithm (Finn et al.; 2017)

that can be learned with variational inference. Other recent works cast meta-learning as hierarchical Bayesian inference

(e.g. Edwards and Storkey; 2016, Hewitt et al.; 2018, Grant et al.; 2018, Ravi and Beatson; 2019).

Gaussian Processes (GPs) are popular candidates due to closed-form Bayesian inference and have been used for several of the problems we study (e.g. Rasmussen; 2003, Krause and Ong; 2011, Deisenroth and Rasmussen; 2011). While providing excellent uncertainty estimates, the scale of modern datasets can make their application difficult, thus often requiring approximations (e.g. Titsias; 2009). Furthermore, their performance strongly depends on the choice of the most suitable kernel (and thus prior over function), which may in practice require careful design or compositional kernel search (e.g. Duvenaud et al.; 2013).

Deep Kernel Learning (Wilson et al.; 2016)

provides an alternative, addressing scalability concerns while allowing the advantage of structural properties of deep learning architectures. The network weights are learned by considering them part of the kernel-hyperparameters.

Moreover, much of the recent work on Bayesian Neural Networks (e.g. Blundell et al.; 2015, Gal and Ghahramani; 2016, Hernández-Lobato et al.; 2016, Louizos and Welling; 2017) serves as a reasonable alternative, also benefiting from the flexibility and power of modern deep learning architectures.

Finally, the approach in (Chen et al.; 2017) tackles similar problems, applying meta-learning for black-box optimisation. It skips the uncertainty estimation and is trained to directly suggest the next point for evaluation.

7 Discussion

In this paper, we have demonstrated the use of Neural Processes to learn data-driven priors for decision problems over a diverse set of problems, showing competitive results. At this point we would like to remind the reader that no aspect of the models used in the experiments has been tailored towards the problems we study. Indeed, we choose among the same hyperparameters considered by Kim et al. (2019).

Our experiments on adversarial task search indicate that such a system may for instance be used within an agent evaluation pipeline to test for exploits. Moving towards the more complex case of adversarial task discovery, one faces the problem of the dimensionality of the input space. A simple approach would be to train a generative latent variable model on the map layouts, performing gradient steps on the NP predictions wrt. to the latent variables. However, building generative models with such hard constraints on the input space is itself a difficult problem. A second avenue of future work could utilise the presented method to train more robust agents, using the NP to suggest problems the agent is currently unable to solve (cf. curriculum learning).

The presented results for recommender systems in particular are encouraging, noting that many of the standard bells and whistles used for such systems are orthogonal to NPs and could thus be easily incorporated. Results may be further improved by also considering the model , i.e. the item-specific function mapping from users to ratings . Ideally both movie-specific and user-specific functions could be combined in a single network.

Our RL experiments showed significant improvements in terms of data efficiency when a large set of related tasks is available. In future work, it would be interesting to consider more complex problems, which may require a more sophisticated policy during pretraining.


Appendix A Acknowledgements

We would like to thank Yutian Chen, Avraham Ruderman, Nicolas Heess, Arthur Guez and Raia Hadsell for insightful discussions and feedback on an earlier version of this manuscript.

Appendix B Adversarial Task Search

We consider four types of agents: IMPALA [Espeholt et al., 2018], PopArt-IMPALA [Hessel et al., 2018], MERLIN [Wayne et al., 2018], R2D2 [Kapturowski et al., 2019]. Each agent is trained on the explore_goal_locations_large level included in the DM-Lab30 task suite  [Espeholt et al., 2018], in addition to four auxiliary ones chosen at random. The hyperparameters for the NP models are specified in Table 2. As for baselines, we take the default parameters from [Riquelme et al., 2018]. More specific parameters are given in the Table 3. DKL and GP use the same kernel. For random search in the full case, for each map we make 2 random position choices.

Model type Context size (Latent) Encoder Attention
Position Model 200 128 Squared Exponential (scale = 0.5)
Map Model 100 64 No attention
Model type Pos iterations () Decoder Batch size Fixed or Maximum variance
Position Model 1620 32 Maximum 30
Full Model 5 32 Fixed 0.1
Table 2: Hyperparameters for the adversarial task search problem. denotes concatenation.
Method Encoder Learning rate Training Frequency

Training epochs

Random Search All.
BBB All 3 0.01 5 1000
-Div All 3 0.01 5 1000
DKL Pos. 0.005 5 1000
GP Pos. 0.001 5 1000
DKL Full. 0.001 5 200
GP Full. 0.1 5 200
Method Decoder variance Initial variance Prior variance
-Div All
Table 3: Hyperaparameters for baseline models for the adversarial task search. denotes concatenation.

Appendix C Recommender Systems

We now discuss details of the experiment on recommender systems. In addition to the common choices for Neural Processes, we embed user (20m only) and movie ids (both) in a learnable matrix. The size of each embedding vector is shown as and respectively. Note that for each provided rating, we are also given a timestamp, which we normalise by the empirical mean and standard deviation. As these statistics are unknown at test time, we use estimates from the training set as the best approximation. Choices for architecture and hyperparameters are provided in Table 4. Note that for both datasets, only training and test set size are specified in Chen et al. [2018] and Strub et al. [2016]. We introduce an additional validation set and train on the union of train and validation sets before running a final evaluation. In both cases, we are given a minimum of 20 ratings per user.

Dataset (Latent) Encoder Attention
100k -/8 Multihead (Identity)
20m 128/128
Dataset ? Decoder Learning rate Batch size
100k 32
20m 32
Table 4: Neural Process architecture for the recommender system experiments. : User/Movie Embedding size.

Indicates whether contexts were included in the target set during training. Note that the archiecture for encoder and latent encoder (if a latent variable is used) is identical except for a final linear transform (in parenthesis). Missing entries for

indicate that no latent variable was used. Likewise, missing entries for Attention indicate that no attention was used. denotes concatenation.

c.1 MOVIELENS-100k

This dataset consists of 100k ratings from 943 users on 1682 movies. For our model, we use the following users features: Age, sex and occupation. In addition, we provide information about the movie genres as a k-hot vector. As discussed in the main text, the dataset split is non-standard: As opposed to withholding a fraction of ratings for known users, we reserve 70% of users (including all ratings) as training users, 10% as validation users and all remaining users for the test set. These splits are chosen at random.

In order to allow for reproducibility in future work, we report the test set user ids used in our experiments here: 5, 7, 10, 12, 15, 16, 29, 41, 43, 47, 49, 55, 61, 63, 64, 67, 73, 74, 77, 78, 79, 85, 87, 93, 120, 123, 130, 139, 148, 150, 158, 163, 176, 184, 185, 189, 191, 194, 195, 210, 212, 215, 217, 223, 226, 227, 228, 232, 234, 243, 245, 249, 253, 257, 258, 259, 264, 266, 267, 269, 270, 271, 277, 284, 291, 293, 298, 303, 305, 319, 321, 324, 329, 330, 339, 341, 344, 346, 350, 354, 363, 365, 369, 372, 381, 386, 387, 389, 391, 400, 403, 410, 412, 414, 423, 427, 434, 435, 439, 441, 443, 457, 461, 467, 469, 481, 490, 495, 498, 507, 508, 511, 516, 517, 524, 530, 544, 561, 563, 580, 591, 597, 607, 624, 629, 633, 635, 638, 641, 644, 662, 668, 678, 680, 692, 703, 711, 723, 730, 731, 740, 743, 751, 758, 764, 768, 770, 783, 785, 788, 789, 790, 793, 794, 798, 800, 801, 802, 804, 810, 812, 813, 814, 821, 827, 834, 836, 841, 843, 850, 851, 853, 860, 861, 868, 873, 887, 889, 893, 896, 901, 902, 903, 905, 906, 916, 921, 922, 923.

For the results shown using the information gain acquisition function, we used random draws from the conditional prior to estimate .


The 20m version of the movielens dataset consists of 20000263m ratings from 138493 users on 27278 movies. Distinct from the 100k version, we are not given any user information. In addition to movie genres, we also use tags provided by users for all movies considered, on which we apply dimensionality reduction to obtain feature vectors of dimension 50. Details in Strub et al. [2016]. Note that for this dataset, we found CNPs Garnelo et al. [2018a] to work slightly better.

Appendix D Model-Based-Rl

All the results are reported with 10 random seeds for the baseline method and the NP-based model-based RL method. We report the best hyperparameters in Table 5.

Method Training steps/episode Actor network Rollout length Batch size Learning rate
Model free 500 10 512
Model based 500 100 512
Method Entropy bonus Critic Network Min sigma Max sigma Activation
Model free 0.01 3 1 0.01 No ELU
Model based 0.001 2 1 0.01 0.6 ReLU
Table 5: Hyperparameters for the model-based RL experiments. denotes concatenation.