Sequential decision making encompasses a large range of problems with many decades of research targeted at problems such as Bayesian optimisation, multi-armed contextual bandits or reinforcement learning. Recent years have brought great advances particularly in RL (e.g. Schulman et al.; 2015, Silver et al.; 2016)
, allowing the successful application of RL agents to increasingly complex domains. Yet, most modern machine learning algorithms require multiple magnitudes more experience than humans to solve even relatively simple problems. In this paper we argue that such systems ought to leverage past experience acquired by tackling related problem instances, allowing fast progress on a target task in the limited-data setting.
Consider the task of designing a motor controller for an array of robot arms in a large factory. The robots vary in age, size and proportions. The objective of the controller is to send motor commands to the robot arms in such a way that allows each one to achieve its designated task. The majority of current methods may tackle the control of each arm as a separate problem, despite similarity between the arms and tasks they may be assigned to. Instead, we argue that availability of data on related problems ought to be harnessed and discuss how learning data-driven priors can allow learning of a controller for additional robot arms in a fraction of the time.
A second question that immediately arises in the design of such a controller is how to deal with uncertainty: uncertainty about the proportions of each robot, uncertainty about the physics of their motor movements, and the state of the environment. We thus argue that a suitable method should allow reasoning about predictive uncertainty, rapidly adjusting these estimates once more data becomes available.
Much recent research has studied the problem of decision making under such uncertainty. Frameworks such as Bayesian Optimisation (e.g. Moćkus et al.; 1978, Schonlau et al.; 1998) and Contextual Bandits (e.g. Cesa-Bianchi and Lugosi; 2006) offer ways of efficiently balancing exploration and exploitation to simultaneously resolve uncertainties while making progress towards each robot arm’s objective. This is often achieved by specifying some sort of prior belief model over the dynamics of the system in advance (e.g. a Gaussian Process), which is then used to balance exploration and exploitation. Importantly, note that significant domain-knowledge may be required to make an appropriate choice of a prior.
In this paper, we propose a strategy for combining the efficiency of traditional techniques for sequential decision making with the flexibility of learning-based approaches. We employ the framework of Neural Processes (NPs) (Garnelo et al.; 2018b) to instead learn suitable priors in a data-driven manner, and utilise them as part of the inner loop of established sequential decision making algorithms. For instance, in the factory, a Neural Process learns a prior over the dynamics of robot arms, rapidly adapts this prior at run-time to each individual arm, which is then used by a Bayesian Optimisation algorithm to control the robots. Thus, the Neural Process falls within the ‘Learning to Learn’ or ‘Meta-Learning’ paradigm, and allows for sharing of knowledge across the robot arms.
The contributions of the paper are:
We argue for and demonstrate the advantage of data-driven priors obtained through Meta-Learning in the sequential decision making paradigm.
The introduction of Neural Processes as a single model family that can be used to tackle a diverse set of such problems.
A range of novel problem instances that can be approached through meta-learning, showing new approaches to recommender systems and adversarial task search.
We empirically demonstrate the advantage of learning NPs for decision making, obtaining strong results on the experiments considered.
The remainder of the paper is structured as follows. In Section 2 we introduce Neural Processes (NPs). In Section 3 we show NPs can be used to tackle general problems in (a) Bayesian optimisation (BO), (b) contextual multi-armed bandits (CMBs) and (c) model-based reinforcement learning (MBRL). In Section 4 we describe the three disjoint problem instances that we consider in this paper: (a) we use BO and NPs to identify failure cases in pretrained RL agents, (b) we use CMBs and NPs to train recommender systems, and (c) we use MBRL with NPs to learn solutions to continuous control problems from only a handful of experience. We report the results of these experiments in Section 5, presented related work in Section 6 and conclude with a discussion.
2 Neural Processes
Neural processes (NPs) are a family of models for few-shot learning (Garnelo et al.; 2018b). Given a number of realisations from some unknown stochastic process , NPs can be used to predict the values of
at some new, unobserved locations. In contrast to standard approaches to supervised learning such as linear regression or standard neural networks, NPs model a distribution over functions that agree with the observations provided so far (similar to e.g. Gaussian Processes(Rasmussen; 2003)). This is reflected in how NPs are trained: We require a dataset of evaluations of similar functions over the same spaces and . Note however, that we do not assume each function to be evaluated at the same . Examples of such datasets could be the temperature profile over a day in different cities around the world or evaluations of functions generated from a Gaussian process with a fixed kernel. We provide further examples throughout the paper.
In order to allow NPs to learn distributions over functions, we split evaluations for each function into two disjoint subsets: a set of context points and a set of targets that contains unobserved points. These data points are then processed by a neural network as follows:
First, we use an encoder with parameters, transforming all in the context set to obtain representations . We then aggregate all to a single representation using a permutation invariant operator (such as addition) that captures the information about the underlying function provided by the context points. Later on, we parameterise a distribution over a latent variable , here assumed to be Normal with estimated by an encoder network using parameters . Note that this latent variable is introduced to model uncertainty in function space, extending the Conditional Neural Process (Garnelo et al.; 2018a).
Thereafter, a decoder is used to obtain predictive distributions at target positions . Specifically, we have with parameters depending on the data modelled. In practice, we might decide to share parameters, e.g. by setting or . To reduce notational clutter, we suppress dependencies on parameters from now on.
In order to learn the resulting intractable objective, approximate inference techniques such as variational inference are used, leading to the following evidence lower-bound:
which is optimised with mini-batch stochastic gradient descent using a different functionfor each element in the batch and sampling at each iteration.
There are two interesting properties following from these equations: (i) The complexity of a Neural Process is , allowing its evaluation on large context and target sets. (ii) No gradient steps are required at test time (in contrast to many other popular meta-learning techniques, e.g. (Finn et al.; 2017)). We will discuss examples where these properties are desirable when turning to specific problems later in the text.
3 Sequential Decision Making With Neural Process Surrogate Models
We now discuss how to apply NPs to various instances of the sequential decision making problem. In all cases, we will make choices under uncertainty to optimise some notion of utility. A popular strategy for such problems involves fitting a surrogate model to observed data at hand, making predictions about the problem and constantly refining it when more information becomes available. Resulting predictions in unobserved areas of the input-space allow for planning and more principled techniques for exploration, making them useful when data efficiency is a particular concern.
3.1 Bayesian Optimisation
We first consider the problem of optimising black-box functions without gradient information. A popular approach is Bayesian Optimisation (BO) (e.g. Shahriari et al.; 2016), where we wish to find the minimiser of a (possibly noisy) function on some space without requiring access to its derivatives. The BO approach consists of fitting a probabilistic surrogate model to approximate the function on a small set of evaluations observed thus far. Examples of a surrogate are Gaussian Processes, Tree-structured Parzen (density) estimators, Bayesian Neural Networks or for the purpose of this paper, NPs. The decisions involved in the process is the choice of some at which we choose to next evaluate the function . This evaluation is typically assumed to be costly, e.g. when the optimisation of a machine learning algorithm is involved (Snoek et al.; 2012). Typically, in addition to providing a good function approximation from limited data, we thus require a suitable model to provide good uncertainty estimates, which will be helpful to address the inherent exploration/exploitation trade-off during decision making.
In addition, we require an acquisition function to guide decision making, designed such that we consider at the next point for evaluation. Model uncertainty is typically incorporated into , as is done in some popular choices such as expected improvement (Moćkus et al.; 1978) or the UCB algorithm (Srinivas et al.; 2009)
. Given our surrogate model of choice for this paper, the Thompson sampling(Thompson; 1933, Chapelle and Li; 2011) criterion is particularly convenient (although others could be used). That is,
is chosen for evaluation with probability
3.2 Contextual Multi-Armed Bandits
Closely related to Bayesian Optimisation, the decision problem known as a contextual multi-armed bandit is formulated as follows. At each trial :
Some informative context is revealed. This could for instance be features describing a user of an online content provider (e.g. Li et al.; 2010). Crucially, we assume to be independent of past trials.
Next, we are to choose one of arms and receive some reward given our choice at the current iteration . The current context , past actions and rewards are available to guide this choice. As is unknown, we face the same exploration/exploitation trade-off as in the BO case.
The arm-selection strategy is updated given access to the newly acquired . Importantly, no reward is provided for any of the arms .
Neural Processes can be relatively straight-forwardly applied to this problem. Decision making proceeds as in Algorithm 1 (Thompson sampling being a reasonable choice) with the main difference being that we evaluate the NP separately for each arm
. Past data is easily incorporated in the context set by providing a one-hot vector indicating which armhas been chosen, along with and .
However, in practice it might be significantly more difficult to find or design related bandit problems that are available for pretraining. We will discuss a natural real-world contextual bandit application suitable for NP pretraining in the experimental section.
3.3 Model-Based Reinforcement Learning
Allowing for dependence between subsequently provided contexts (referred to as states in the RL literature) we arrive at reinforcement learning (RL) (Sutton and Barto; 2018). An RL problem is defined by (possibly stochastic) functions (defining the transitions between states given an agent’s actions) and the reward function . These functions are together often referred to as an environment. We obtain the necessary distribution over functions for NP training by changing the properties of the environment, writing to denote the distribution over functions for each task , i.e. . The objective of the RL algorithm for a fixed task is , for rewards obtained by acting on . is a policy with parameters , i.e. the agent’s decision making process. We also introduce , a discounting factor and , indicating a time index in the current episode.
In this paper, we will focus our attention to a particular set of techniques referred to as model-based algorithms. Model-based RL methods assume the existence of some approximation to the dynamics of the problem at hand (typically learned online). Examples of this technique are (e.g. Peng and Williams; 1993, Browne et al.; 2012). We apply Neural Processes to this problem by first meta-learning an environment model using some exploratory policy (e.g. temporarily-correlated random exploration) on samples of the task distribution . This gives us an environment-model capable of quickly adapting to the dynamics of problem instances within the task distribution.
Thereafter, we use the model in conjunction with any RL algorithm to learn a policy for a specific task . This can be done by by autoregressively sampling rollouts from the NP (i.e. by acting according to and sampling transitions using the NP’s approximation to . These rollouts are then used to update using the chosen RL algorithm (sometimes referred to as indirect RL). Optionally, we may also update using the real environment rollouts (direct RL). Algorithm 2 shows the proposed approach in more details.
Note that the linear complexity of an NP is particularly useful in this problem: As we allow for additional episodes on the real environment, the number of transitions that could be added to a context set grows quickly ( for episodes of steps). In complex environments, this may quickly become prohibitive for other methods (e.g. Gaussian Processes environment models (Deisenroth and Rasmussen; 2011)).
4 Problem Setup
4.1 Adversarial Task-Search for Rl Agents
As modern machine learning methods are approaching sufficient maturity to be applied in the real world, understanding failure cases of intelligent systems has become an important topic in our field, leading to the improvement of robustness and understanding of complex algorithms.
One class of approaches towards improving robustness of such systems use adversarial attacks. The objective of an attack is to find a perturbation of the input such that predictions of the method being tested change dramatically in an unexpected fashion. Much of the rapidly growing body of work on adversarial examples (e.g. Szegedy et al.; 2013, Goodfellow et al.; 2014, Madry et al.; 2017) has studied the supervised learning case and shown concerning vulnerabilities. Recently, this analysis has been extended to Reinforcement Learning (e.g. Behzadan and Munir; 2017, Huang et al.; 2017, Uesato et al.; 2018).
Inspired by this line of work, we consider the recent study of (Ruderman et al.; 2018) concerning failures of pretrained RL agents. The authors show that supposedly superhuman agents trained on simple navigation problems in 3D-mazes catastrophically fail when challenged with adversarially designed task instances trivially solvable by human players. The authors use an evolutionary search technique that modifies previous examples based on the agent’s episode reward. A crucial limitation of this approach is that this technique produces out-of-distribution examples, weakening the significance of the results.
Figure 2(a) shows an example of a procedurally generated maze considered by the authors. An agent starts with random orientation on a starting position (held fixed for this graphic only) indicated by a green star. A goal object randomly placed in one of the coloured squares is then to be found. The colour intensity indicates an estimate of a trained agent’s expected mean episode reward if the goal was to appear at the position. Black and white ink show occupied and empty spaces respectively. We can observe this to be a difficult optimisation problem, as many of the possible goal locations are close to the minimum.
Thus, we propose to tackle the worst-case search through a Bayesian Optimisation approach on a fixed set of possible candidate maps using NP surrogate model. More formally, we study the adversarial task search problem on mazes as follows. For a given maze , agent and goal positions and obtained episode reward , we consider functions of the form:
We can thus consider the following problems of interest:
The worst-case position search for a fixed map:
The worst-case search over a set of maps, including positions:
We consider a fixed set of maps , such that for each map , there is only a finite set (of capacity ) of possible agent and goal positions. For a fixed number of iterations , the complexity of solving problem (ii) scales as , where corresponds to evaluation of the acquisition function on all possible inputs. However, assuming a solution to (i) has already been found, we can reduce the complexity to , for some small as follows: We use an additional map model , which for a given map directly predicts the minimum reward over all possible agent and goal positions, explaining the term . Then, given the map , we run our available solution to (i) for iterations to find agent and goal positions. This corresponds to the term . We refer to this model as position model.
4.2 Recommender Systems
Considering next the contextual multi-armed bandit problem discussed in Section 3.2, we apply NPs to recommender systems. Decisions made in this context are recommendations to a user. As we aim to learn more about a user’s preferences, we can think about certain recommendations as more exploratory in case they happen to be dissimilar to previously rated items. Indeed, the problem has previously been modelled as a contextual multi-armed bandit, e.g. by (Li et al.; 2010), using linear models.
The application of an NP to this problem is natural: We can think of each user as a function from items to ratings , i.e. , where each user is possibly evaluated on a different subset of items. Thus most available datasets for recommender systems naturally fit the requirements for NP pretraining. Connecting this to the general formulation of a contextual bandit problem, we can think of each arm as a particular item recommendation and consider the user id and/or any additional information that may be provided as context . Rewards are likely to be domain-specific, but may be as simple as the rating given by a user.
The sequential decision process in this case can be explicitly treated by finding a suitable acquisition function for recommender systems. While this choice most likely depends on the goals of a particular business, we will provide a proof-of-concept analysis, explicitly maximising coverage over the input space to provide a function approximation as well as possible. This is motivated by the root-mean-squared-error metric used in the literature on the experiments we will consider. Inspired by work on decision trees, a natural criterion for evaluation could be the information gain at a particular candidate item/arm. Writingto denote the reward for each arm except in the target set (likewise for ) and suppressing dependence on latents and the context for clarity, we can thus define the information gain at arm :
Note that this involves using samples of the model’s predictive distribution at arm to estimate the entropy given an additional item of information. Assuming is a univariate normal, we arrive at an intuitive equation to determine the expected optimal next arm for evaluation:
where we made use of conditional independence, the analytic form of the entropy of a multivariate normal and the determinant of a diagonal matrix. We thus seek to recommend the next item such that the product of variances of all other items in the target set given the user’s expected response is minimised. At this point it is important to mention that the successful application of this idea depends on the quality of the uncertainty estimates used. This is a strength of the NP family.
4.3 Model-Based Rl
As a classic RL problem, we consider the control task “cartpole” (Barto et al.; 1983), fully defined by the physical parameters of the system. We obtain a distribution over tasks (i.e. state transition and reward functions) by uniformly sampling the pole mass and cart mass for each episode, allowing pretraining of an NP. For the exploration policy required during pretraining, we use the following random walk:
where , are fixed for the entire episode and . Another option would be to use pretrained expert to give meaningful trajectory unrolls, especially for more complicated tasks. This however, requires a solution to at least one task within the distribution.
As the RL algorithm of choice, we use on-policy SVG(1) (Heess et al.; 2015) without replay. We provide a comparison to the related model-free RL algorithm SVG(0) (Heess et al.; 2015) with Retrace off-policy correction (Munos et al.; 2016) as competitive baseline. For the experiments considered, we found that is was not necessary to update the policy using real environment trajectories.
5 Results and Analysis
(scaled in [0,1]) as a function of the number of iterations. Bold lines show the mean performance over 4 unseen agents on a set of held-out maps. We also show 20% of the standard deviation.
5.1 Adversarial Task-Search
We consider a set of 1000 randomly chosen maps with 1620 agent and goal positions each, given a total agent population of 16 independently trained agents. We divide the maps into a 80% training and 20% holdout set. In order to encourage population diversity (both in terms of behaviour and performance), we trained each agent on both the task of interest, which is explore_goal_locations_large and four auxiliary tasks from DMLab-30 (Beattie et al.; 2016), using a total of four different RL algorithms across the population. 12 agents (3 of each type) are used during pretraining, while we reserve the remaining 4 agents (1 of each type) for evaluation. As baselines we consider methods that are learned online: Gaussian Processes with a linear and Matern 3/2 product kernel (Bonilla et al.; 2008), Bayes by Backprob (BBB) (Blundell et al.; 2015), AlphaDivergence (AlphaDiv) (Hernández-Lobato et al.; 2016), Deep Kernel Learning (DKL) (Wilson et al.; 2016) and random search. In order to ensure a correct implementation we use the thoroughly tested code provided by the authors of Riquelme et al. (2018). In order to account for pretraining, we reuse embeddings (see eq. 2) from the NP.
Addressing the question of performance of the position model first, we show results in Figure 1(a) indicating strong performance of our method. Indeed, we find agent and goal positions close to the minimum after evaluating only approx. 5% of the possible search space. Most iterations are spent on determining the global minimum among a relatively large number of points of similar magnitude. In practice, if a point close enough to the minimum is sufficient to determine the existence of an adversarial map, search can be terminated much earlier.
In order to explain the sources of this improvement, we show an analysis of NP uncertainty in function space in Figure 2(b) for varying context sizes. The graphic should be understood as the equivalent of Figure 3 in (Garnelo et al.; 2018b) for the adversarial task search problem. More specifically, we plot functions of the form (8) drawn from a neural process by sampling using varying context sizes.
As we would expect, uncertainty in function space decreases significantly as additional context points are introduced. Furthermore, we observe an interesting change in predictions once context points two and three are introduced (blue and orange lines). The mean prediction of the model increases noticeably, indicating that the agent being evaluated performs superior to the mean agent encountered during pretraining.
Showing the advantage of pretraining our method, we illustrate episode reward predictions on maps given small context sets in Figure 2(a). Note that the model has learned to assign higher scores to points closer to the starting location taking into account obstacles, without any explicitly defined distance metric provided. Thus, predictions incorporate this valuable prior information and are fairly close to the ground truth after a single observation. Indeed, subsequently added points appear to be mainly adjusting the scale of the predicted reward and the relative ordering of low-reward points.
Finally, we test our model in full case search on holdout maps, using the proposed two-stage approach to reduce search complexity as outlined above. From Figure 1(b), we continue to observe superior performance for this significantly more difficult problem.
5.2 Recommender Systems
We apply NPs to the Movielens 100k & 20m datasets (Harper and Konstan; 2016). While the specific format of the datasets vary slightly, in both cases we face the basic problem of recommending movies to a user, given side-information such as movie genre, and tags (20m only) or certain user features (100k only) such as occupation, age and sex. Importantly, while discrete ratings warrant treatment as an ordinal regression problem, we leave this for future work to allow our results to be directly comparable to (Chen et al.; 2018).
Discussing first the smaller MovieLens 100k dataset, we closely follow the suggested experimental setup in (Chen et al.; 2018). Importantly, 20% of the users are explicitly withheld from the training dataset to test for few-shot adaptation. This is non-standard comparing to mainstream literature, which typically use a fraction of ratings for known users as a test set. This is particular interesting for NPs, recalling their application at test time without gradient steps (see discussion in section 2). Provided this works to a satisfactory degree, this property may be particularly desirable for fast on-device recommendation on mobile devices. Finally, similar to the argument made in (Chen et al.; 2018), NPs can be trained in a federated learning setting, which may be an important advantage of our method in case data privacy happens to be a concern.
|Model||20% of user data||50%||80%|
|SVD++ (Koren; 2008)||1.0517||1.0217||1.0124|
|Baseline Neural Network||0.9831||0.9679||0.9507|
|MAML (Finn et al.; 2017)||0.9593||0.9441||0.9295|
|NP (info. gain)||0.9370||0.8751||0.8060|
|BPMF (Salakhutdinov and Mnih; 2008)||0.8123|
|SVDFeature (Chen et al.; 2012)||0.7852|
|LLORMA (Lee et al.; 2013)||0.7843|
|ALS-WR (Zhou et al.; 2008)||0.7746|
|I-Autorec (Sedhain et al.; 2015)||0.7742|
|U-CFN (Strub et al.; 2016)||0.7856|
|I-CFN (Strub et al.; 2016)||0.7663|
Results for random context sets of 20%/50%/80% of test users’ ratings (as suggested by the authors) are shown in Table 1 (left). While these results are encouraging, the treatment as a decision making process using our acquisition function (denoted info. gain) leads to much stronger improvements.
For completeness, we also provide results on the much larger MovieLens 20m dataset, for which more competitive baselines are available. Unfortunately, we are unable to show results using our acquisition function, due to a lack of baselines and thus only evaluate our method with a random context set. Nevertheless, Table 1 (right) shows comparable results to state-of-the-art recommendation systems, despite this limitation. We also discuss several suggestions for improvements in the conclusion.
5.3 Model-Based Rl
Results of the cartpole experiment are shown in Figure 3(a), We observe strong results, showing a model-based RL algorithm with an NP model can successfully learn a task in about 10-15 episodes. We show an example video for a particular run in 111https://goo.gl/9yKav3. Testing our method on the full support of the task distribution, we show the mean episode reward in Figure 3(c) (comparing to a random policy in blue). We observe that the same method generalises for all considered tasks. As expected, the reward decreases slightly for particularly heavy carts. We also provide a comparison of NP rollouts comparing to the real environment rollouts in Figure 3(b).
6 Related Work
There has been a recent surge of interest in Meta-Learning or Learning to Learn, resulting in large array of methods (e.g. Koch et al.; 2015, Andrychowicz et al.; 2016, Wang et al.; 2016, Reed et al.; 2017), many of which may be applied in the problems we study (as we merely assume the existence of a general method for regression). However, predictive uncertainty may not be available for the majority of methods.
However, several recent publications focus on probabilistic ideas or re-interpretations of popular methods (e.g. Bauer et al.; 2017, Rusu et al.; 2018, Bachman et al.; 2018), and could thus be suitable for the problems we study. An example is Probabilistic MAML (Finn et al.; 2018) which forms an extension of the popular model-agnostic meta-learning (MAML) algorithm (Finn et al.; 2017)
that can be learned with variational inference. Other recent works cast meta-learning as hierarchical Bayesian inference(e.g. Edwards and Storkey; 2016, Hewitt et al.; 2018, Grant et al.; 2018, Ravi and Beatson; 2019).
Gaussian Processes (GPs) are popular candidates due to closed-form Bayesian inference and have been used for several of the problems we study (e.g. Rasmussen; 2003, Krause and Ong; 2011, Deisenroth and Rasmussen; 2011). While providing excellent uncertainty estimates, the scale of modern datasets can make their application difficult, thus often requiring approximations (e.g. Titsias; 2009). Furthermore, their performance strongly depends on the choice of the most suitable kernel (and thus prior over function), which may in practice require careful design or compositional kernel search (e.g. Duvenaud et al.; 2013).
Deep Kernel Learning (Wilson et al.; 2016)
provides an alternative, addressing scalability concerns while allowing the advantage of structural properties of deep learning architectures. The network weights are learned by considering them part of the kernel-hyperparameters.
Moreover, much of the recent work on Bayesian Neural Networks (e.g. Blundell et al.; 2015, Gal and Ghahramani; 2016, Hernández-Lobato et al.; 2016, Louizos and Welling; 2017) serves as a reasonable alternative, also benefiting from the flexibility and power of modern deep learning architectures.
Finally, the approach in (Chen et al.; 2017) tackles similar problems, applying meta-learning for black-box optimisation. It skips the uncertainty estimation and is trained to directly suggest the next point for evaluation.
In this paper, we have demonstrated the use of Neural Processes to learn data-driven priors for decision problems over a diverse set of problems, showing competitive results. At this point we would like to remind the reader that no aspect of the models used in the experiments has been tailored towards the problems we study. Indeed, we choose among the same hyperparameters considered by Kim et al. (2019).
Our experiments on adversarial task search indicate that such a system may for instance be used within an agent evaluation pipeline to test for exploits. Moving towards the more complex case of adversarial task discovery, one faces the problem of the dimensionality of the input space. A simple approach would be to train a generative latent variable model on the map layouts, performing gradient steps on the NP predictions wrt. to the latent variables. However, building generative models with such hard constraints on the input space is itself a difficult problem. A second avenue of future work could utilise the presented method to train more robust agents, using the NP to suggest problems the agent is currently unable to solve (cf. curriculum learning).
The presented results for recommender systems in particular are encouraging, noting that many of the standard bells and whistles used for such systems are orthogonal to NPs and could thus be easily incorporated. Results may be further improved by also considering the model , i.e. the item-specific function mapping from users to ratings . Ideally both movie-specific and user-specific functions could be combined in a single network.
Our RL experiments showed significant improvements in terms of data efficiency when a large set of related tasks is available. In future work, it would be interesting to consider more complex problems, which may require a more sophisticated policy during pretraining.
- Andrychowicz et al. (2016) M. Andrychowicz, M. Denil, S. Gomez, M. W. Hoffman, D. Pfau, T. Schaul, B. Shillingford, and N. De Freitas. Learning to learn by gradient descent by gradient descent. In Advances in Neural Information Processing Systems, pages 3981–3989, 2016.
- Bachman et al. (2018) P. Bachman, R. Islam, A. Sordoni, and Z. Ahmed. Vfunc: a deep generative model for functions. arXiv preprint arXiv:1807.04106, 2018.
- Barto et al. (1983) A. G. Barto, R. S. Sutton, and C. W. Anderson. Neuronlike adaptive elements that can solve difficult learning control problems. IEEE transactions on systems, man, and cybernetics, SMC-13(5):834–846, 1983.
- Bauer et al. (2017) M. Bauer, M. Rojas-Carulla, J. B. Świątkowski, B. Schölkopf, and R. E. Turner. Discriminative k-shot learning using probabilistic models. arXiv preprint arXiv:1706.00326, 2017.
- Beattie et al. (2016) C. Beattie, J. Z. Leibo, D. Teplyashin, T. Ward, M. Wainwright, H. Küttler, A. Lefrancq, S. Green, V. Valdés, A. Sadik, J. Schrittwieser, K. Anderson, S. York, M. Cant, A. Cain, A. Bolton, S. Gaffney, H. King, D. Hassabis, S. Legg, and S. Petersen. Deepmind lab. arXiv preprint arXiv:1612.03801, 2016. URL https://arxiv.org/abs/1612.03801.
Behzadan and Munir (2017)
V. Behzadan and A. Munir.
Vulnerability of deep reinforcement learning to policy induction
International Conference on Machine Learning and Data Mining in Pattern Recognition, pages 262–275. Springer, 2017.
- Blundell et al. (2015) C. Blundell, J. Cornebise, K. Kavukcuoglu, and D. Wierstra. Weight uncertainty in neural networks. arXiv preprint arXiv:1505.05424, 2015.
- Bonilla et al. (2008) E. V. Bonilla, K. M. A. Chai, and C. K. Williams. Multi-task gaussian process prediction. In Advances in Neural Information Processing Systems, 2008.
- Browne et al. (2012) C. B. Browne, E. Powley, D. Whitehouse, S. M. Lucas, P. I. Cowling, P. Rohlfshagen, S. Tavener, D. Perez, S. Samothrakis, and S. Colton. A survey of monte carlo tree search methods. IEEE Transactions on Computational Intelligence and AI in games, 4(1):1–43, 2012.
- Cesa-Bianchi and Lugosi (2006) N. Cesa-Bianchi and G. Lugosi. Prediction, learning, and games. Cambridge university press, 2006.
- Chapelle and Li (2011) O. Chapelle and L. Li. An empirical evaluation of thompson sampling. In Advances in neural information processing systems, pages 2249–2257, 2011.
- Chen et al. (2018) F. Chen, Z. Dong, Z. Li, and X. He. Federated meta-learning for recommendation. arXiv preprint arXiv:1802.07876, 2018.
- Chen et al. (2012) T. Chen, W. Zhang, Q. Lu, K. Chen, Z. Zheng, and Y. Yu. Svdfeature: a toolkit for feature-based collaborative filtering. Journal of Machine Learning Research, 13(Dec):3619–3622, 2012.
- Chen et al. (2017) Y. Chen, M. W. Hoffman, S. G. Colmenarejo, M. Denil, T. P. Lillicrap, M. Botvinick, and N. de Freitas. Learning to learn without gradient descent by gradient descent. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 748–756. JMLR. org, 2017.
- Deisenroth and Rasmussen (2011) M. Deisenroth and C. E. Rasmussen. Pilco: A model-based and data-efficient approach to policy search. In Proceedings of the 28th International Conference on machine learning (ICML-11), pages 465–472, 2011.
- Duvenaud et al. (2013) D. Duvenaud, J. R. Lloyd, R. Grosse, J. B. Tenenbaum, and Z. Ghahramani. Structure discovery in nonparametric regression through compositional kernel search. arXiv preprint arXiv:1302.4922, 2013.
- Edwards and Storkey (2016) H. Edwards and A. Storkey. Towards a neural statistician. arXiv preprint arXiv:1606.02185, 2016.
- Espeholt et al. (2018) L. Espeholt, H. Soyer, R. Munos, K. Simonyan, V. Mnih, T. Ward, Y. Doron, V. Firoiu, T. Harley, I. Dunning, S. Legg, and K. Kavukcuoglu. Scalable distributed deep-rl with importance weighted actor-learner architectures. arXiv:1802.01561, 2018. URL https://arxiv.org/abs/1802.01561.
- Finn et al. (2017) C. Finn, P. Abbeel, and S. Levine. Model-agnostic meta-learning for fast adaptation of deep networks. arXiv preprint arXiv:1703.03400, 2017.
- Finn et al. (2018) C. Finn, K. Xu, and S. Levine. Probabilistic model agnostic meta-learning. arXiv prerint arXiv:1806.02817, 2018.
- Gal and Ghahramani (2016) Y. Gal and Z. Ghahramani. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In international conference on machine learning, pages 1050–1059, 2016.
- Garnelo et al. (2018a) M. Garnelo, D. Rosenbaum, C. Maddison, T. Ramalho, D. Saxton, M. Shanahan, Y. W. Teh, D. Rezende, and S. A. Eslami. Conditional neural processes. In ICML, 2018a.
- Garnelo et al. (2018b) M. Garnelo, J. Schwarz, D. Rosenbaum, F. Viola, D. J. Rezende, S. Eslami, and Y. W. Teh. Neural processes. In ICML Workshop on Theoretical Foundations and Applications of Deep Generative Models, 2018b.
- Goodfellow et al. (2014) I. J. Goodfellow, J. Shlens, and C. Szegedy. Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572, 2014.
- Grant et al. (2018) E. Grant, C. Finn, S. Levine, T. Darrell, and T. Griffiths. Recasting gradient-based meta-learning as hierarchical bayes. arXiv preprint arXiv:1801.08930, 2018.
- Harper and Konstan (2016) F. M. Harper and J. A. Konstan. The movielens datasets: History and context. Acm transactions on interactive intelligent systems (tiis), 5(4):19, 2016.
- Heess et al. (2015) N. Heess, G. Wayne, D. Silver, T. Lillicrap, T. Erez, and Y. Tassa. Learning continuous control policies by stochastic value gradients. In Advances in Neural Information Processing Systems, pages 2944–2952, 2015.
- Hernández-Lobato et al. (2016) J. M. Hernández-Lobato, Y. Li, M. Rowland, D. Hernández-Lobato, T. Bui, and R. Turner. Black-box -divergence minimization. In International Conference on Machine Learning, 2016.
- Hessel et al. (2018) M. Hessel, H. Soyer, L. Espeholt, W. Czarnecki, S. Schmitt, and H. v. Hasselt. Multi-task deep reinforcement learning with popart. arXiv:1809.04474, 2018. URL https://arxiv.org/abs/1809.04474.
- Hewitt et al. (2018) L. B. Hewitt, M. I. Nye, A. Gane, T. Jaakkola, and J. B. Tenenbaum. The variational homoencoder: Learning to learn high capacity generative models from few examples. arXiv preprint arXiv:1807.08919, 2018.
- Huang et al. (2017) S. Huang, N. Papernot, I. Goodfellow, Y. Duan, and P. Abbeel. Adversarial attacks on neural network policies. arXiv preprint arXiv:1702.02284, 2017.
- Kapturowski et al. (2019) S. Kapturowski, G. Ostrovski, W. Dabney, J. Quan, and R. Munos. Recurrent experience replay in distributed reinforcement learning. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=r1lyTjAqYX.
- Kim et al. (2019) H. Kim, A. Mnih, J. Schwarz, M. Garnelo, A. Eslami, D. Rosenbaum, O. Vinyals, and Y. W. Teh. Attentive neural processes. In International Conference on Learning Representations, 2019.
- Koch et al. (2015) G. Koch, R. Zemel, and R. Salakhutdinov. Siamese neural networks for one-shot image recognition. In ICML Deep Learning Workshop, volume 2, 2015.
- Koren (2008) Y. Koren. Factorization meets the neighborhood: a multifaceted collaborative filtering model. In Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 426–434. ACM, 2008.
- Krause and Ong (2011) A. Krause and C. S. Ong. Contextual gaussian process bandit optimization. In Advances in Neural Information Processing Systems, pages 2447–2455, 2011.
- Le et al. (2018) T. A. Le, H. Kim, M. Garnelo, D. Rosenbaum, J. Schwarz, and Y. W. Teh. Empirical evaluation of neural process objectives. In NeurIPS workshop on Bayesian Deep Learning, 2018.
- Lee et al. (2013) J. Lee, S. Kim, G. Lebanon, and Y. Singer. Local low-rank matrix approximation. In International Conference on Machine Learning, pages 82–90, 2013.
- Li et al. (2010) L. Li, W. Chu, J. Langford, and R. E. Schapire. A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th international conference on World wide web, pages 661–670. ACM, 2010.
- Louizos and Welling (2017) C. Louizos and M. Welling. Multiplicative normalizing flows for variational bayesian neural networks. arXiv preprint arXiv:1703.01961, 2017.
- Madry et al. (2017) A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu. Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083, 2017.
- Moćkus et al. (1978) J. Moćkus, V. Tiesis, and A. Źilinskas. The application of bayesian methods for seeking the extremum. vol. 2, 1978.
- Munos et al. (2016) R. Munos, T. Stepleton, A. Harutyunyan, and M. G. Bellemare. Safe and efficient off-policy reinforcement learning. In Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5-10, 2016, Barcelona, Spain, pages 1046–1054, 2016.
- Nagabandi et al. (2018) A. Nagabandi, C. Finn, and S. Levine. Deep online learning via meta-learning: Continual adaptation for model-based rl. arXiv preprint arXiv:1812.07671, 2018.
- Peng and Williams (1993) J. Peng and R. J. Williams. Efficient learning and planning within the dyna framework. Adaptive Behavior, 1(4):437–454, 1993.
- Rasmussen (2003) C. E. Rasmussen. Gaussian processes in machine learning. In Summer School on Machine Learning, pages 63–71. Springer, 2003.
- Ravi and Beatson (2019) S. Ravi and A. Beatson. Amortized bayesian meta-learning. In International Conference on Learning Representations, 2019.
- Reed et al. (2017) S. Reed, Y. Chen, T. Paine, A. v. d. Oord, S. Eslami, D. Rezende, O. Vinyals, and N. de Freitas. Few-shot autoregressive density estimation: Towards learning to learn distributions. arXiv preprint arXiv:1710.10304, 2017.
- Riquelme et al. (2018) C. Riquelme, G. Tucker, and J. Snoek. Deep bayesian bandits showdown: An empirical comparison of bayesian deep networks for thompson sampling. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=SyYe6k-CW.
- Ruderman et al. (2018) A. Ruderman, R. Everett, B. Sikder, H. Soyer, J. Uesato, A. Kumar, C. Beattie, and P. Kohli. Uncovering surprising behaviors in reinforcement learning via worst-case analysis. 2018. URL https://openreview.net/forum?id=SkgZNnR5tX.
- Rusu et al. (2018) A. A. Rusu, D. Rao, J. Sygnowski, O. Vinyals, R. Pascanu, S. Osindero, and R. Hadsell. Meta-learning with latent embedding optimization. arXiv preprint arXiv:1807.05960, 2018.
Salakhutdinov and Mnih (2008)
R. Salakhutdinov and A. Mnih.
Bayesian probabilistic matrix factorization using markov chain monte carlo.In Proceedings of the 25th international conference on Machine learning, pages 880–887. ACM, 2008.
- Schonlau et al. (1998) M. Schonlau, W. J. Welch, and D. R. Jones. Global versus local search in constrained optimization of computer models. Lecture Notes-Monograph Series, pages 11–25, 1998.
- Schulman et al. (2015) J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz. Trust region policy optimization. In International Conference on Machine Learning, pages 1889–1897, 2015.
Sedhain et al. (2015)
S. Sedhain, A. K. Menon, S. Sanner, and L. Xie.
Autorec: Autoencoders meet collaborative filtering.In Proceedings of the 24th International Conference on World Wide Web, pages 111–112. ACM, 2015.
- Shahriari et al. (2016) B. Shahriari, K. Swersky, Z. Wang, R. P. Adams, and N. De Freitas. Taking the human out of the loop: A review of bayesian optimization. Proceedings of the IEEE, 104(1):148–175, 2016.
- Silver et al. (2016) D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, et al. Mastering the game of go with deep neural networks and tree search. nature, 529(7587):484, 2016.
- Snoek et al. (2012) J. Snoek, H. Larochelle, and R. P. Adams. Practical bayesian optimization of machine learning algorithms. In Advances in neural information processing systems, pages 2951–2959, 2012.
- Srinivas et al. (2009) N. Srinivas, A. Krause, S. M. Kakade, and M. Seeger. Gaussian process optimization in the bandit setting: No regret and experimental design. arXiv preprint arXiv:0912.3995, 2009.
- Strub et al. (2016) F. Strub, R. Gaudel, and J. Mary. Hybrid recommender system based on autoencoders. In Proceedings of the 1st Workshop on Deep Learning for Recommender Systems, pages 11–16. ACM, 2016.
- Sutton and Barto (2018) R. S. Sutton and A. G. Barto. Reinforcement learning: An introduction. MIT press, 2018.
- Szegedy et al. (2013) C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199, 2013.
- Thompson (1933) W. R. Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3/4):285–294, 1933.
- Titsias (2009) M. Titsias. Variational learning of inducing variables in sparse gaussian processes. In Artificial Intelligence and Statistics, pages 567–574, 2009.
- Uesato et al. (2018) J. Uesato, B. O’Donoghue, A. v. d. Oord, and P. Kohli. Adversarial risk and the dangers of evaluating against weak attacks. In Proceedings of the 28th International Conference on machine learning (ICML), 2018.
- Wang et al. (2016) J. X. Wang, Z. Kurth-Nelson, D. Tirumala, H. Soyer, J. Z. Leibo, R. Munos, C. Blundell, D. Kumaran, and M. Botvinick. Learning to reinforcement learn. arXiv preprint arXiv:1611.05763, 2016.
- Wayne et al. (2018) G. Wayne, C. Hung, D. Amos, M. Mirza, A. Ahuja, A. Grabska-Barwinska, J. W. Rae, P. Mirowski, J. Z. Leibo, A. Santoro, M. Gemici, M. Reynolds, T. Harley, J. Abramson, S. Mohamed, D. J. Rezende, D. Saxton, A. Cain, C. Hillier, D. Silver, K. Kavukcuoglu, M. Botvinick, D. Hassabis, and T. P. Lillicrap. Unsupervised predictive memory in a goal-directed agent. CoRR, abs/1803.10760, 2018. URL http://arxiv.org/abs/1803.10760.
- Wilson et al. (2016) A. G. Wilson, Z. Hu, R. Salakhutdinov, and E. P. Xing. Deep kernel learning. In Artificial Intelligence and Statistics, pages 370–378, 2016.
- Zhou et al. (2008) Y. Zhou, D. Wilkinson, R. Schreiber, and R. Pan. Large-scale parallel collaborative filtering for the netflix prize. In International conference on algorithmic applications in management, pages 337–348. Springer, 2008.
Appendix A Acknowledgements
We would like to thank Yutian Chen, Avraham Ruderman, Nicolas Heess, Arthur Guez and Raia Hadsell for insightful discussions and feedback on an earlier version of this manuscript.
Appendix B Adversarial Task Search
We consider four types of agents: IMPALA [Espeholt et al., 2018], PopArt-IMPALA [Hessel et al., 2018], MERLIN [Wayne et al., 2018], R2D2 [Kapturowski et al., 2019]. Each agent is trained on the explore_goal_locations_large level included in the DM-Lab30 task suite [Espeholt et al., 2018], in addition to four auxiliary ones chosen at random. The hyperparameters for the NP models are specified in Table 2. As for baselines, we take the default parameters from [Riquelme et al., 2018]. More specific parameters are given in the Table 3. DKL and GP use the same kernel. For random search in the full case, for each map we make 2 random position choices.
|Model type||Context size||(Latent) Encoder||Attention|
|Position Model||200||128||Squared Exponential (scale = 0.5)|
|Map Model||100||64||No attention|
|Model type||Pos iterations ()||Decoder||Batch size||Fixed or Maximum variance|
|Position Model||1620||32||Maximum 30|
|Full Model||5||32||Fixed 0.1|
|Method||Encoder||Learning rate||Training Frequency||
|Random Search All.|
|Method||Decoder variance||Initial variance||Prior variance|
Appendix C Recommender Systems
We now discuss details of the experiment on recommender systems. In addition to the common choices for Neural Processes, we embed user (20m only) and movie ids (both) in a learnable matrix. The size of each embedding vector is shown as and respectively. Note that for each provided rating, we are also given a timestamp, which we normalise by the empirical mean and standard deviation. As these statistics are unknown at test time, we use estimates from the training set as the best approximation. Choices for architecture and hyperparameters are provided in Table 4. Note that for both datasets, only training and test set size are specified in Chen et al.  and Strub et al. . We introduce an additional validation set and train on the union of train and validation sets before running a final evaluation. In both cases, we are given a minimum of 20 ratings per user.
|Dataset||?||Decoder||Learning rate||Batch size|
Indicates whether contexts were included in the target set during training. Note that the archiecture for encoder and latent encoder (if a latent variable is used) is identical except for a final linear transform (in parenthesis). Missing entries forindicate that no latent variable was used. Likewise, missing entries for Attention indicate that no attention was used. denotes concatenation.
This dataset consists of 100k ratings from 943 users on 1682 movies. For our model, we use the following users features: Age, sex and occupation. In addition, we provide information about the movie genres as a k-hot vector. As discussed in the main text, the dataset split is non-standard: As opposed to withholding a fraction of ratings for known users, we reserve 70% of users (including all ratings) as training users, 10% as validation users and all remaining users for the test set. These splits are chosen at random.
In order to allow for reproducibility in future work, we report the test set user ids used in our experiments here: 5, 7, 10, 12, 15, 16, 29, 41, 43, 47, 49, 55, 61, 63, 64, 67, 73, 74, 77, 78, 79, 85, 87, 93, 120, 123, 130, 139, 148, 150, 158, 163, 176, 184, 185, 189, 191, 194, 195, 210, 212, 215, 217, 223, 226, 227, 228, 232, 234, 243, 245, 249, 253, 257, 258, 259, 264, 266, 267, 269, 270, 271, 277, 284, 291, 293, 298, 303, 305, 319, 321, 324, 329, 330, 339, 341, 344, 346, 350, 354, 363, 365, 369, 372, 381, 386, 387, 389, 391, 400, 403, 410, 412, 414, 423, 427, 434, 435, 439, 441, 443, 457, 461, 467, 469, 481, 490, 495, 498, 507, 508, 511, 516, 517, 524, 530, 544, 561, 563, 580, 591, 597, 607, 624, 629, 633, 635, 638, 641, 644, 662, 668, 678, 680, 692, 703, 711, 723, 730, 731, 740, 743, 751, 758, 764, 768, 770, 783, 785, 788, 789, 790, 793, 794, 798, 800, 801, 802, 804, 810, 812, 813, 814, 821, 827, 834, 836, 841, 843, 850, 851, 853, 860, 861, 868, 873, 887, 889, 893, 896, 901, 902, 903, 905, 906, 916, 921, 922, 923.
For the results shown using the information gain acquisition function, we used random draws from the conditional prior to estimate .
The 20m version of the movielens dataset consists of 20000263m ratings from 138493 users on 27278 movies. Distinct from the 100k version, we are not given any user information. In addition to movie genres, we also use tags provided by users for all movies considered, on which we apply dimensionality reduction to obtain feature vectors of dimension 50. Details in Strub et al. . Note that for this dataset, we found CNPs Garnelo et al. [2018a] to work slightly better.
Appendix D Model-Based-Rl
All the results are reported with 10 random seeds for the baseline method and the NP-based model-based RL method. We report the best hyperparameters in Table 5.
|Method||Training steps/episode||Actor network||Rollout length||Batch size||Learning rate|
|Method||Entropy bonus||Critic Network||Min sigma||Max sigma||Activation|
|Model free||0.01||3 1||0.01||No||ELU|
|Model based||0.001||2 1||0.01||0.6||ReLU|