1 Introduction
Reinforcement learning (RL) is typically concerned with finding an optimal policy
that maximises expected return for a given Markov decision process (MDP) with an unknown reward and transition function. If these were known, the optimal policy could in theory be computed without interacting with the environment. By contrast, learning in an
unknown environment typically requires trading off exploration (learning about the environment) and exploitation (taking promising actions). Balancing this tradeoff is key to maximising expected return during learning. A Bayesoptimal policy, which does so optimally, conditions actions not only on the environment state but on the agent’s own uncertainty about the current MDP.In principle, a Bayesoptimal policy can be computed using the framework of Bayesadaptive Markov decision processes (BAMDPs) (Martin, 1967; Duff and Barto, 2002). The agent maintains a belief, i.e., a posterior distribution, over possible environments. Augmenting the state space of the underlying MDP with this posterior distribution yields a BAMDP, a special case of a belief MDP (Kaelbling et al., 1998). A Bayesoptimal agent maximises expected return in the BAMDP by systematically seeking out the data needed to quickly reduce uncertainty, but only insofar as doing so helps maximise expected return. Its performance is bounded from above by the optimal policy for the given MDP, which does not need to take exploratory actions but requires prior knowledge about the MDP to compute.
Unfortunately, planning in a BAMDP, i.e., computing a Bayesoptimal policy that conditions on the augmented state, is intractable for all but the smallest tasks. A common shortcut is to rely instead on posterior sampling (Thompson, 1933; Strens, 2000; Osband et al., 2013). Here, the agent periodically samples a single hypothesis MDP (e.g., at the beginning of an episode) from its posterior, and the policy that is optimal for the sampled MDP is followed until the next sample is drawn. Planning is far more tractable since it is done on a regular MDP, not a BAMDP. However, posterior sampling’s exploration can be highly inefficient and far from Bayesoptimal.
Consider the example of a gridworld in Figure 1, where the agent must navigate to an unknown goal located in the grey area (0(a)
). To maintain a posterior, the agent can uniformly assign nonzero probability to cells where the goal could be, and zero to all other cells. A Bayesoptimal strategy strategically searches the set of goal positions that the posterior considers possible, until the goal is found (
0(b)). Posterior sampling by contrast samples a possible goal position, takes the shortest route there, and then resamples a different goal position from the updated posterior (0(c)). Doing so is much less efficient since the agent’s uncertainty is not reduced optimally (e.g., states are revisited).As this example illustrates, Bayesoptimal policies can explore much more efficiently than posterior sampling. Hence, a key challenge is to find ways to learn approximately Bayesoptimal policies while retaining the tractability of posterior sampling. In addition, many complex tasks pose another key challenge: the inference involved in maintaining a posterior belief, needed even for posterior sampling, may itself be intractable.
In this paper, we combine ideas from Bayesian reinforcement learning, approximate variational inference, and metalearning to tackle these challenges, and equip an agent with the ability to strategically explore unseen (but related) environments for a given distribution, in order to maximise its expected return.
More specifically, we propose variational BayesAdaptive Deep RL (variBAD), a way to metalearn to perform approximate inference on a new task^{1}^{1}1We use the terms environment, task, and MDP, interchangeably., and incorporate task uncertainty directly during action selection. We represent a single MDP using a learned, lowdimensional stochastic latent variable . Given a set of tasks sampled from a distribution, we jointly metatrain: (1) a variational autoencoder that can infer the posterior distribution over in a new task while interacting with the environment, and (2) a policy that conditions on this posterior distribution over the MDP embeddings, and thus learns how to trade off exploration and exploitation when selecting actions under task uncertainty. Figure 0(e) shows the performance of our method versus the hardcoded optimal (i.e., given privileged goal information), Bayesoptimal, and posterior sampling exploration strategies. VariBAD’s performance closely matches that of Bayesoptimal action selection, matching optimal performance from the third rollout.
Previous approaches to BAMDPs are only tractable in environments with small action and state spaces. VariBAD offers a tractable and dramatically more flexible approach for learning Bayesadaptive policies tailored to the task distribution seen during training, with the only assumption that such a task distribution is available for metatraining. We evaluate our approach on the gridworld shown above and on MuJoCo domains that are widely used in metaRL, and show that variBAD exhibits superior exploratory behaviour at test time compared to existing metalearning methods, achieving higher returns during learning. As such, variBAD opens a path to tractable approximate Bayesoptimal exploration for deep reinforcement learning.
2 Background
We define a Markov decision process (MDP) as a tuple where is a set of states, is a set of actions, is a reward function, is a transition function, is an initial state distribution, is a discount factor and is the horizon. In the standard RL setting, the goal is to learn a policy that maximises the expected return In this paper, we consider a multitask metalearning setting which we introduce next.
2.1 MetaLearning
We adopt the standard metalearning setting in which we consider a distribution over MDPs, where an MDP is defined by a tuple . Across tasks, the reward function and transition function can vary, and typically some structure is shared across tasks. The index represents either task description (e.g., a goal position; a natural language instruction; the leg length of a humanoid) or a task ID. Sampling an MDP from is typically done by sampling a reward and transition function from a distribution .
The distribution over MDPs is unknown to the agent, but it is possible to sample single MDPs for metatraining. At each metatraining iteration, a batch of tasks is drawn, . For each task , we are given a limited number of environment interactions for learning (how to learn), i.e., to maximise performance within that initially unknown MDP (the task description or ID is unknown). How those environment interactions are used depends on the metalearning method (see Section 4 for an overview). At metatest time, the agent is evaluated based on the average return it achieves during learning, across a set of tasks drawn from . Doing this well requires at least two things: (1) incorporating prior knowledge obtained in related tasks, and (2) reasoning about (task) uncertainty when selecting actions to trade off exploration and exploitation. In the following, we combine ideas from metalearning and Bayesian RL to tackle these challenges.
2.2 Bayesian Reinforcement Learning
When the MDP (i.e., transition and/or reward function) is unknown, optimal decision making has to trade off exploration and exploitation when selecting actions. In principle, this can be done by taking a Bayesian approach to reinforcement learning (Bellman, 1956; Duff and Barto, 2002; Ghavamzadeh et al., 2015). In the Bayesian formulation of RL, we assume that the transition and reward functions are distributed according to a prior . Since the agent does not have access to the true reward and transition function, it can maintain a belief at every timestep, which is the posterior over the MDP given the agent’s experience: given a trajectory of states, actions, and rewards, , the prior can be updated to form a posterior belief . This is typically formulated by maintaining a distribution over the model parameters.
To allow the agent to incorporate the task uncertainty into its decisionmaking, this belief can be augmented to the state space, where is the belief space. States in are often called hyperstates, and they transition according to
(1) 
i.e., the new environment state is the expected new state w.r.t. the current posterior distribution of the transition function, and the belief is updated deterministically according to Bayes rule. The reward function on hyperstates is defined as the expected reward under the current posterior (after the state transition) over reward functions,
(2) 
We can now formulate the BayesAdaptive Markov decision process (BAMDP, Duff and Barto (2002)), which consists of the tuple . This is a special case of a belief MDP, i.e, the MDP formed by taking the posterior beliefs maintained by an agent in a partially observable MDP (POMDP) and reinterpreting them as Markov states (Cassandra et al., 1994). It is a special case because the belief is over the transition and reward functions, which are constant for a given task. By contrast, in an arbitrary belief MDP, the belief is over a hidden state that can change at each timestep. The agent’s objective is now to maximise the expected return in this BAMDP,
(3) 
i.e., maximises the expected return in an initially unknown environment (i.e., reward and transition function), while learning, within the horizon H. This objective is maximised by the Bayesoptimal policy, which automatically trades off exploration and exploitation: it takes exploratory actions to reduce its task uncertainty only insofar as it helps to maximise the expected return within the horizon.
The BAMDP framework is powerful because it provides a principled way of formulating Bayesoptimal behaviour. However, solving the BAMDP is hopelessly intractable for most interesting problems. The main challenges are as follows.

We typically do not have access to the prior distribution but can only sample from it.

We typically do not know the parameterisation of the true reward and/or transition model. Instead, we can choose to approximate them, for example using deep neural networks.

The belief update (computing the posterior ) is often intractable.

Even given the posterior over the reward and transition model, planning in belief space is typically intractable.
In the following, we propose a method that uses metalearning to do Bayesian reinforcement learning, which amounts to metalearning a prior over tasks and performing inference over reward and transition functions using deep learning. Crucially, our Bayesadaptive policy is learned endtoend with the inference framework, which means that no planning is necessary at test time. It can be applied to the typical metalearning setting and makes minimal assumptions (no task ID or description is required), resulting in a highly flexible and scalable approach to Bayesadaptive Deep RL.
3 BayesAdaptive Deep RL via MetaLearning
In this section, we present variBAD, and describe how we tackle the challenges outlined above. We start by describing how to represent reward and transition functions, and (posterior) distributions over these. We then consider how to metalearn to perform approximate variational inference in a given task, and finally put all the pieces together to form our training objective.
In the typical metalearning setting, the reward and transition functions that are unique to each MDP are unknown, but also share some structure across the MDPs in . We know that there exists a true which represents either a task description or task ID, but we do not have access to this information. We therefore represent this value using a learned stochastic latent variable . For a given MDP we can then write
(4)  
(5) 
where and are shared across tasks. Since we do not have access to the true task description or ID, we need to infer given the agent’s experience up to time step collected in ,
(6) 
i.e., we want to infer the posterior distribution over given (from now on, we drop the sub and superscript for ease of notation).
Recall that our goal is to learn a distribution over the MDPs, and given a posteriori knowledge of the environment compute the optimal action. Given the above reformulation, it is now sufficient to reason about the embedding , instead of the transition and reward dynamics. This is particularly useful when deploying deep learning strategies, where the reward and transition function can consist of millions of parameters, but the embedding
can be a small vector.
3.1 Approximate Inference
Computing the exact posterior is typically not possible: we do not have access to the MDP (and hence the transition and reward function), and marginalising over tasks is computationally infeasible. Consequently, we need to learn a model of the environment , parameterised by , together with an amortised inference network , parameterised by , which allows fast inference at runtime at each timestep . The actionselection policy is not part of the MDP, so an environmental model can only give rise to a distribution of trajectories when conditioned on actions, which we typically draw from our current policy, . At any given time step , our model learning objective is thus to maximise
(7) 
where is the trajectory distribution induced by our policy and we slightly abuse notation by denoting by the statereward trajectories, excluding the actions. In the following, we drop the conditioning on to simplify notation.
Instead of optimising (7), which is intractable, we can optimise a tractable lower bound, defined with a learned approximate posterior
which can be estimated by Monte Carlo sampling (for the full derivation see Appendix
A):(8)  
The term is often referred to as the reconstruction loss, and as the decoder. The term is the KLdivergence between our variational posterior and the prior over the embeddings . For sufficiently expressive decoders, we are free to choose . We set the prior to our previous posterior, , with initial prior .
As can be seen in Equation 8 and Figure 2, when the agent is at timestep , we encode the past trajectory to get the current posterior since this is all the information available to perform inference about the current task. We then decode the entire trajectory including the future, i.e., model . This is different than the conventional VAE setup and is possible since we have access to this information during training. Decoding not only the past but also the future is important because this way, variBAD learns to perform inference about unseen states given the past. The reconstruction term factorises as
(9)  
Here, is the initial state distribution , the transition function , and the reward function . From now on, we include in for ease of notation.
3.2 Training Objective
We can now formulate a training objective for learning the approximate posterior distribution over task embeddings, the policy, and the generalised reward and transition functions and . We use deep neural networks to represent the individual components. These are:

The encoder , parameterised by ;

An approximate transition function and an approximate reward function which are jointly parameterised by ; and

A policy parameterised by and dependent on .
The policy is conditioned on both the environment state and the posterior over , . This is similar to the formulation of BAMDPs introduced in 2.2
, with the difference that we learn a unifying distribution over MDP embeddings, instead of the transition/reward function directly. This makes learning easier since there are fewer parameters to perform inference over, and we can use data from all tasks to learn the shared reward and transition function. The posterior can be represented by the distribution’s parameters (e.g., mean and standard deviation if
is Gaussian).Our overall objective is to maximise
(10) 
Expectations are approximated by Monte Carlo samples, and the ELBO can be optimised using the reparameterisation trick (Kingma and Welling, 2014). For , we use the prior . Past trajectories can be encoded using, e.g., a recurrent network as in Duan et al. (2016); Wang et al. (2016) (which we did in our experiments) or using an encoder that computes an encoding per tuple and aggregates them in some way (Zaheer et al., 2017; Garnelo et al., 2018; Rakelly et al., 2019). The network architecture is shown in Figure 2.
By training the VAE with different context lengths , variBAD can learn how to perform inference online (while the agent is interacting with an environment), and decrease its uncertainty over time given more data. In practice, we may subsample a fixed number of ELBO terms (for random time steps ) for computational efficiency if is large.
Equation (10) is trained endtoend, and both the VAE and the RL agent are trained simultaneously. The parameter weights the supervised model learning objective against the RL loss. This is necessary since parameters
are shared between the model and the policy. However, we found that not backpropagating the RL loss through the encoder is typically sufficient in practice, which speeds up training considerably. In practice, we therefore optimise the policy and the VAE using different optimisers with different learning rates. The RL agent and the VAE do not necessarily have to be trained with the same data batches. In our implementation, we typically use onpolicy algorithms so the policy is only trained with the most recent data. The VAE however is independent of this and can be trained with any of the previous rollouts. To this end, we maintain a separate buffer of trajectories used to compute the ELBO.
4 Related Work
Meta Reinforcement Learning.
A prominent modelfree approach to metalearning is to utilise the dynamics of recurrent neural networks for fast adaptation for RL
(Wang et al., 2016; Duan et al., 2016), building on work by Hochreiter et al. (2001)for supervised learning problems. At every time step, the network gets an auxiliary input indicating the action and reward of the preceding step. This allows learning within a task to happen online, entirely in the dynamics of the recurrent network. If we were to remove the decoder in Figure
2 and the VAE objective in Equation (7), variBAD reduces to this setting, i.e., the main differences are that we use a stochastic latent variable (an inductive bias for representing uncertainty) together with a decoder to reconstruct previous and future transitions and rewards (which acts as an auxiliary loss (Jaderberg et al., 2017), and helps to explicitly encode the task in latent space and deduce information about unseen states). Ortega et al. (2019) provide an indepth discussion of metalearning sequential strategies and how memorybased metalearning can be recast within a Bayesian framework.Another popular approach to meta RL is to learn an initialisation of the network parameters, such that at test time, only a few gradient steps are necessary to achieve good performance (Finn et al., 2017; Nichol and Schulman, 2018). These methods do not directly account for the fact that the initial policy needs to explore, a problem addressed by Stadie et al. (2018) (EMAML) and Rothfuss et al. (2019) (ProMP). Other methods that perform gradient adaptation at test time are, a.o., Houthooft et al. (2018)
who metalearn a loss function conditioned on the agent’s experience that is used at test time so learn a policy (from scratch); and
Sung et al. (2017) who learn a metacritic that can criticise any actor for any task, and is used at test time to train a policy. Compared to variBAD, these methods usually separate exploration (before gradient adaptation) and exploitation (after gradient adaptation) at test time by design, making them less sample efficient.Skill / Task Embeddings. Learning (variational) task or skill embeddings for meta / transfer reinforcement learning is used in a variety of approaches. Hausman et al. (2018)
learn an embedding space of skills using approximate variational inference (with a different lower bound than variBAD). At test time the policy is fixed, while a new embedder is learned that interpolates between already learned skills.
Arnekvist et al. (2019) learn a stochastic embedding of optimal functions for different skills, such that the policy can be conditioned on (samples of) this embedding. When learning a new task, adaptation is done in latent space. CoReyes et al. (2018) learn a latent space of lowlevel skills that can be controlled by a higherlevel policy, framed within the setting of hierarchical reinforcement learning. This embedding is learned using a VAE to encode state trajectories and decode states and actions. Zintgraf et al. (2019) learn a deterministic task embedding trained similarly to MAML (Finn et al., 2017). Similar to variBAD, Zhang et al. (2018) use learned dynamics and reward modules to learn a latent representation which the policy conditions on and show that transferring the (fixed) encoder to new environments helps learning. Perez et al. (2018) learn dynamic models with auxiliary latent variables similar to variBAD, and use them for modelpredictive control. Lan et al. (2019) learn a task embedding with an optimisation procedure similar to MAML, where the encoder is updated at test time, and the policy is fixed. Sæmundsson et al. (2018)explicitly learn an embedding of the environment model, which is subsequently used for model predictive control (and not, like in variBAD, for exploration). In the field of imitation learning, some approaches embed expert demonstrations to represent the task; e.g.,
Wang et al. (2017) use variational methods and Duan et al. (2017) learn deterministic embeddings.VariBAD differs from the above methods mainly in what the embedding represents (i.e., task uncertainty over the MDP) and how it is used: the policy conditions on the posterior distribution over environments, allowing the policy to reason about task uncertainty and trade off exploration and exploitation online. Our metatraining objective (8) explicitly optimises for Bayesoptimal behaviour. Unlike some of the above methods, variBAD does not use the model at test time, but modelbased planning using variBAD is a natural extension for future work. Recent work of Humplik et al. (2019) also exploits the idea of conditioning the policy on a latent distribution over task embeddings. This embedding is trained using privileged information during training such as a task ID, or an expert policy. VariBAD on the other hand can be applied even when such information is not available.
Bayesian Reinforcement Learning. Bayesian methods for Reinforcement learning can be used to quantify uncertainty and support actionselection for trading off exploration and exploitation, and provide a way to incorporate prior knowledge into the algorithms (see Ghavamzadeh et al. (2015) for an extensive review). A Bayesoptimal policy is one that optimally trades off exploration and exploitation, and thus maximises expected return during learning. While such a policy can in principle be computed using the BAMDP framework, it is hopelessly intractable for all but the smallest tasks. Existing methods are therefore restricted to small and discrete state / action spaces (Asmuth and Littman, 2011; Guez et al., 2012, 2013), or a discrete set of tasks (Brunskill, 2012; Poupart et al., 2006). VariBAD opens a path to tractable approximate Bayesoptimal exploration for deep reinforcement learning by leveraging ideas from metalearning and approximate variational inference. However, due to the use of deep neural networks, it lacks the formal guarantees enjoyed by some of the methods mentioned above.
Posterior sampling (Strens, 2000; Osband et al., 2013)
, which extends Thompson sampling
(Thompson, 1933) from bandits to MDPs, estimates a posterior distribution over MDPs (i.e., model and reward functions), in the same spirit as variBAD. This posterior is used to periodically sample a single hypothesis MDP (e.g., at the beginning of an episode), and the policy that is optimal for the sampled MDP is followed subsequently. This approach is less efficient than Bayesoptimal behaviour and therefore typically has lower expected return during learning.A related approach for intertask transfer of abstract knowledge is to pose policy search with priors as Markov Chain Monte Carlo inference
(Wingate et al., 2011). Similarly Guez et al. (2013) propose a Monte Carlo Tree Search based method for Bayesian planning to get a tractable, samplebased method for obtaining approximate Bayesoptimal behaviour. Osband et al. (2018) note that nonBayesian treatment for decision making can be arbitrarily suboptimal and propose a simple randomised prior based approach for structured exploration. Some recent deep RL methods use stochastic latent variables for structured exploration (Gupta et al., 2018; Rakelly et al., 2019), which gives rise to behaviour similar to posterior sampling. Compared to variBAD, these use a different (inference) framework that does not directly encode the MDP (reward and transition function). Other ways to use the posterior to guide exploration are, e.g., certain reward bonuses Kolter and Ng (2009); Sorg et al. (2012) and approaches based on the idea of optimism in the face of uncertainty (Kearns and Singh, 2002; Brafman and Tennenholtz, 2002). NonBayesian methods for exploration are often used in practice, such as other exploration bonuses (e.g., via statevisitation counts) or by ensuring exploration via uninformed sampling of actions (e.g., greedy action selection). In general, such methods are prone to wasteful exploration that does not help maximise expected reward.Variational Inference and MetaLearning. A main difference of variBAD to many existing Bayesian RL methods is that we metalearn the inference procedure, i.e., both the (meaning of the) prior and how to update the posterior, instead of assuming that we have access to the prior or wellbehaved distributions for which we can update the posterior analytically. Apart from (RL) methods mentioned above, related work in this direction can be found, a.o., in Garnelo et al. (2018); Gordon et al. (2019); Choi et al. (2019). By comparison, variBAD has a different inference procedure tailored to the setting of learning Bayesoptimal policies for a given distribution over MDPs.
POMDPs. Several deep learning approaches to modelfree reinforcement learning (Igl et al., 2019) and model learning for planning (Tschiatschek et al., 2018) in partially observable Markov decision processes have recently been proposed and utilise approximate variational inference methods. VariBAD by contrast focuses on BAMDPs (Martin, 1967; Duff and Barto, 2002; Ghavamzadeh et al., 2015), a special case of POMDPs where the model / reward parameters constitute the hidden state and the agent must maintain a belief over them. While in general the hidden state in a POMDP can change at each timestep, in a BAMDP the underlying task, and therefore the hidden state, is fixed over time. We exploit this property by learning an embedding that is fixed over time, unlike approaches like Igl et al. (2019) which use filtering to track the changing hidden state. While we utilise the power of deep approximate variational inference, other approaches for BAMDPs often use more accurate but less scalable methods, e.g., Lee et al. (2019) discretise the latent distribution and use Bayesian filtering for the posterior update.
5 Experiments
In this section we first investigate the properties of variBAD on a didactic gridworld domain. We show that variBAD performs structured and online
exploration as it attempts to identify the task at hand. Then we consider more complex metalearning settings by employing on two MuJoCo continuous control tasks commonly used in the metaRL literature. We show that variBAD can learn to adapt to the task during the first rollout, unlike most existing metalearning techniques. Details and hyperparameters can be found in the appendix.
5.1 Gridworld



Handpicked but representative example test rollout. The blue background indicates the posterior probability of receiving a reward at that cell.
(b) Probability of receiving a reward for each cell, as predicted by the decoder, over the course of interacting with the environment. The black line indicates the average, and the green line is the goal state. (c) Visualisation of the latent space; each line is one latent dimension, the black line is the average.We start with a didactic gridworld environment, to gain insight into variBAD’s properties and demonstrate the learned policy’s ability to perform structured online exploration. The task is to go to a goal (selected uniformly at random) in a gridworld. Crucially, the goal is unobserved by the agent, inducing task uncertainty and necessitating exploration. The goal can be anywhere except around the starting cell, which is at the bottom left. The horizon within this environment is , i.e., we train an agent to maximise performance within steps. Actions are: up, right, down, left, stay (executed deterministically), and after steps the agent is reset (which looks like an episode break, hence we say that this task has episodes, although strictly speaking it is only one episode with horizon ). The agent gets a sparse reward signal: on nongoal cells, and on the goal cell. The best strategy is to explore until the goal is found, and stay at the goal or return to it when reset to the initial position. We use a latent dimensionality of .
Figure 3 illustrates how variBAD behaves at test time with deterministic actions (i.e., all exploration is done by the policy). In 2(a) we see how the agent interacts with the environment, with the blue background visualising the posterior belief by using the learned reward function. VariBAD learns the correct prior and adjusts its belief correctly over time. It predicts no reward for cells it has visited, and explores the remaining cells until it finds the goal.
A nice property of variBAD is that we can gain insight into the agent’s belief about the environment by analysing what the decoder predicts, and how the latent space changes while the agent interacts with the environment. Figure 2(b) show the reward predictions: each line represents a grid cell and its value the probability of receiving a reward at that cell. As the agent gathers more data, more and more cells are excluded (), until eventually the agent finds the goal. In Figure 2(c)
we visualise the 5dimensional latent space. We see that once the agent finds the goal, the posterior concentrates: the variance drops close to zero, and the mean settles on a value.
As we showed in Figure 0(e), the behaviour of variBAD closely matches that of the Bayesoptimal policy. Recall that the Bayesoptimal policy is the one which optimally trades off exploration and exploitation in an unknown environment, and outperforms posterior sampling. Our results on this gridworld indicate that variBAD is an effective way to approximate Bayesoptimal control. We also tried a similar approach to Duan et al. (2016); Wang et al. (2016), by using a similar architecture as variBAD but a deterministic embedding (of size ) and no decoder. We observed that this performs worse overall compared to variBAD (see Appendix B for a qualitative and quantitative comparison).
5.2 Mujoco Continuous Control MetaLearning Tasks


We show that variBAD is capable of scaling to more complex meta learning settings by employing it on MuJoCo (Todorov et al., 2012) locomotion tasks commonly used in the metaRL literature.^{2}^{2}2Environments taken from https://github.com/katerakelly/oyster. We consider the HalfCheetahDir environment where the agent has to run either forwards or backwards (i.e., there are only two tasks), and the HalfCheetahVel environment where the agent has to run at different velocities (i.e., there are infinitely many tasks). Both environments have a horizon , and we aim to maximise performance within this horizon, i.e., within a single rollout.
Figure 3(a) shows the performance at test time compared to existing methods. While we show performance for multiple rollouts for the sake of completeness, anything beyond the first rollout is not directly relevant to our goal, which is to maximise performance on a new task, while learning, within a single episode. Only variBAD and RL are able to adapt to the task at hand within a single episode. RL underperforms variBAD on the HalfCheetahDir environment, and learning is slower and less stable (see learning curves in Appendix C). Even though the first rollout includes exploratory steps, this matches the optimal oracle policy (which gets the true task description as input) up to a small margin. All the other methods (PEARL Rakelly et al. (2019), EMAML Stadie et al. (2018) and ProMP Rothfuss et al. (2019)) are not designed to maximise reward during a single rollout, and perform poorly in this case. They all require substantially more environment interactions in each new task to achieve good performance. PEARL, which is akin to posterior sampling, only starts performing well starting from the third episode (Note that PEARL outperforms our oracle slightly, which could be due to the fact that our oracle is based on PPO, whereas PEARL is based on SAC, which often performs better on MuJoCo benchmarks). EMAML and ProMP typically use around  rollouts for the gradient update, which is far less sample efficient than variBAD. In Figure 3(a), we show the performance with smaller batch sizes ( rollouts), collected with the initial policy, and by performing a gradient update also on the learned initialisation.
To get a sense for where these differences might stem from, consider Figure 3(b) which shows example behaviour of the policies during the first three rollouts in the HalfCheetahDir environment, when the task is “go left”. Both variBAD and RL adapt to the task online, whereas PEARL acts according to the current sample, which in the first two rollouts can mean walking in the wrong direction. While we outperform at metatest time, PEARL is more sample efficient during metatraining (see learning curves in Appendix C), since it is an offpolicy method. Extending variBAD to offpolicy methods is an interesting but orthogonal direction for future work. Overall, our empirical results confirm that variBAD can scale up to current benchmarks and maximise expected reward within a single episode.
6 Conclusion
In this paper we presented variBAD, a novel deep reinforcement learning method to approximate Bayesoptimal behaviour. We used the metalearning framework to utilise knowledge obtained in related tasks, and perform approximate inference over a learned, lowdimensional latent representation of the MDP. In a didactic gridworld environment, we showed that our agent closely matches the behaviour of the Bayesoptimal policy, which is the policy that optimally trades off exploration and exploitation. We further showed that variBAD can be scaled up to challenging MuJoCo tasks, and that it outperforms existing methods in terms of achieved reward during a single episode. In summary, we believe variBAD opens a path to tractable approximate Bayesoptimal exploration for deep reinforcement learning.
Acknowledgments
We thank Anuj Mahajan who contributed to early work on this topic. We also thank Joost van Amersfoort, Andrei Rusu and Dushyant Rao for useful discussions and feedback. Luisa Zintgraf is supported by the Microsoft Research PhD Scholarship Program. Maximilian Igl is supported by the UK EPSRC CDT in Autonomous Intelligent Machines and Systems. Sebastian Schulze is supported by Dyson. This work was supported by a generous equipment grant and a donated DGX1 from NVIDIA. This project has received funding from the European Research Council under the European Union’s Horizon 2020 research and innovation programme (grant agreement number 637713).
References
 Vpe: variational policy embedding for transfer reinforcement learning. In International Conference on Robotics and Automation (ICRA), Cited by: §4.

Learning is planning: near bayesoptimal reinforcement learning via montecarlo tree search.
In
Conference on Uncertainty in Artificial Intelligence (UAI)
, Cited by: §4.  A problem in the sequential design of experiments. Sankhyā: The Indian Journal of Statistics (19331960) 16 (3/4), pp. 221–229. Cited by: §2.2.

Rmaxa general polynomial time algorithm for nearoptimal reinforcement learning.
Journal of Machine Learning Research
, pp. 3:213–231. Cited by: §4.  Bayesoptimal reinforcement learning for discrete uncertainty domains. In International Conference on Autonomous Agents and Multiagent Systems (AAMAS), pp. 3:1385–1386. Cited by: §4.
 Acting optimally in partially observable stochastic domains. In Twelfth National Conference on Artificial Intelligence, Note: AAAI Classic Paper Award, 2013 Cited by: §2.2.
 Metaamortized variational inference and learning. In International Conference on Learning Representation (ICLR), Cited by: §4.

Selfconsistent trajectory autoencoder: hierarchical reinforcement learning with trajectory embeddings
. In International Conference on Machine Learning (ICML), Cited by: §4.  Oneshot imitation learning. In Advances in Neural Information Processing Systems (NeurIPS), pp. 1087–1098. Cited by: §4.
 RL: fast reinforcement learning via slow reinforcement learning. arXiv preprint arXiv:1611.02779. Cited by: §B.3, §3.2, §4, §5.1.
 Optimal learning: computational procedures for bayesadaptive markov decision processes. Ph.D. Thesis, University of Massachusetts at Amherst. Cited by: §1, §2.2, §2.2, §4.
 Modelagnostic metalearning for fast adaptation of deep networks. In International Conference on Machine Learning (ICML), Cited by: §4, §4.
 Neural processes. In ICML 2018 Workshop on Theoretical Foundations and Applications of Deep Generative Models, Cited by: §3.2, §4.
 Bayesian reinforcement learning: a survey. Foundations and Trends® in Machine Learning 8 (56), pp. 359–483. Cited by: §2.2, §4, §4.
 Metalearning probabilistic inference for prediction. In International Conference on Learning Representation (ICLR), Cited by: §4.
 Efficient bayesadaptive reinforcement learning using samplebased search. In Advances in Neural Processing Systems (NeurIPS), pp. 1025–1033. Cited by: §4.
 Scalable and efficient bayesadaptive reinforcement learning based on montecarlo tree search. Journal of Artificial Intelligence Research 48, pp. 841–883. Cited by: §4, §4.
 Metareinforcement learning of structured exploration strategies. In Advances in Neural Processing Systems (NeurIPS), Cited by: §4.
 Learning an embedding space for transferable robot skills. In International Conference on Learning Representation (ICLR), Cited by: §4.
 Learning to learn using gradient descent. In International Conference on Artificial Neural Networks (ICLR), pp. 87–94. Cited by: §4.
 Evolved policy gradients. In Advances in Neural Information Processing Systems (NeurIPS), pp. 5400–5409. Cited by: §4.
 Meta reinforcement learning as task inference. arXiv preprint arXiv:1905.06424. Cited by: §4.
 Deep variational reinforcement learning for pomdps. In International Conference on Machine Learning (ICML), Cited by: §4.
 Reinforcement learning with unsupervised auxiliary tasks. In International Conference on Learning Representation (ICLR), Cited by: §4.
 Planning and acting in partially observable stochastic domains. Artificial intelligence 101 (12), pp. 99–134. Cited by: §1.
 Nearoptimal reinforcement learning in polynomial time. Machine learning 49 (23), pp. 209–232. Cited by: §4.
 Autoencoding variational bayes. In International Conference on Learning Representation (ICLR), Cited by: §3.2.
 Nearbayesian exploration in polynomial time. In International Conference on Machine Learning (ICML), pp. 513–520. Cited by: §4.
 Meta reinforcement learning with task embedding and shared policy. In International Joint Conference on Artificial Intelligence (IJCAI), Cited by: §4.
 Bayesian policy optimization for model uncertainty. In International Conference on Learning Representation (ICLR), Cited by: §4.
 Bayesian decision problems and markov chains. Wiley. Cited by: §1, §4.
 Reptile: a scalable metalearning algorithm. arXiv preprint arXiv:1803.02999. Cited by: §4.
 Metalearning of sequential strategies. arXiv preprint arXiv:1905.03030. Cited by: §4.
 Randomized prior functions for deep reinforcement learning. In Advances in Neural Information Processing Systems (NeurIPS), pp. 8626–8638. Cited by: §4.
 (More) efficient reinforcement learning via posterior sampling. In Advances in Neural Information Processing Systems (NeurIPS), pp. 3003–3011. Cited by: §1, §4.

Efficient transfer learning and online adaptation with latent variable models for continuous control
. In Continual Learning Workshop, NeurIPS 2018, Cited by: §4.  An analytic solution to discrete bayesian reinforcement learning. In International Conference on Machine Learning (ICML), pp. 697–704. Cited by: §4.
 Efficient offpolicy metareinforcement learning via probabilistic context variables. In International Conference on Machine Learning (ICML), Cited by: §C.1, §3.2, §4, §5.2.
 ProMP: proximal metapolicy search. In International Conference on Learning Representation (ICLR), Cited by: §C.1, §4, §5.2.
 Meta reinforcement learning with latent variable gaussian processes. In Conference on Uncertainty in Artificial Intelligence (UAI), Cited by: §4.
 Variancebased rewards for approximate bayesian reinforcement learning. In Conference on Uncertainty in Artificial Intelligence (UAI2010), Cited by: §4.
 Some considerations on learning to explore via metareinforcement learning. In Advances in Neural Processing Systems (NeurIPS), Cited by: §C.1, §4, §5.2.
 A bayesian framework for reinforcement learning. In International Conference on Machine Learning (ICML), Vol. 2000, pp. 943–950. Cited by: §1, §4.
 Learning to learn: metacritic networks for sample efficient learning. arXiv preprint arXiv:1706.09529. Cited by: §4.
 On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika 25 (3/4), pp. 285–294. Cited by: §1, §4.
 MuJoCo: a physics engine for modelbased control.. In IROS, pp. 5026–5033. External Links: ISBN 9781467317375, Link Cited by: §5.2.
 Variational inference for dataefficient model learning in pomdps. arXiv preprint arXiv:1805.09281. Cited by: §4.
 Learning to reinforcement learn. In Annual Meeting of the Cognitive Science Community (CogSci), Cited by: §B.3, §3.2, §4, §5.1.
 Robust imitation of diverse behaviors. In Advances in Neural Information Processing Systems (NeurIPS), pp. 5320–5329. Cited by: §4.
 Bayesian policy search with policy priors. In International Joint Conference on Artificial Intelligence (ICJAI), Cited by: §4.
 Deep sets. In Advances in Neural Processing Systems (NeurIPS), pp. 3391–3401. Cited by: §3.2.
 Decoupling dynamics and reward for transfer learning. In ICLR workshop track, Cited by: §4.
 Fast context adaptation via metalearning. In International Conference on Machine Learning (ICML), Cited by: §4.
Appendix A Full ELBO derivation
Equation (8) can be derived as follows.
(11)  
Appendix B Experiments: Gridworld
b.1 Additional Remarks
Figure 2(c) visualises how the latent space changes as the agent interacts with the environment. As we can see, the value of the latent dimensions starts around mean and variance , which is the prior we chose for the beginning of an episode. Given that the variance increases for a little bit before the agent finds the goal, this prior might not be optimal. A natural extension of variBAD is therefore to also learn the prior to match the task at hand.
b.2 Hyperparameters
We used the PyTorch framework for our experiments. Hyperparameters can be found below.
RL Algorithm  A2C 

Number of policy steps (A2C)  10 
Epsilon (A2C)  1e5 
Gamma (A2C)  0.99 
Max grad norm (A2C)  0.5 
Value loss coefficient (A2C)  0.5 
Entropy coefficient (A2C)  0.01 
GAE parameter tau (A2C)  0.95 
Number of parallel processes (A2C)  16 
ELBO loss coefficient  1.0 
Policy LR  0.001 
Policy VAE  0.001 
Task embedding size  5 
Policy architecture  2 hidden layers, 32 nodes each, TanH activations 
Encoder architecture  FC layer with 40 nodes, GRU with hidden size 64, 
output layer with 10 outputs ( and ), ReLu activations 

Reward decoder architecture  2 hidden layers, 32 nodes each, 
25 outputs heads, ReLu activations  
Decoder loss function  Binary cross entropy 
b.3 Comparison to RL
Figure 5 shows the behaviour of variBAD in comparison to a recurrent policy, an architecture which resembles the approach presented by Duan et al. (2016) and Wang et al. (2016). As we can see, the recurrent policy revisits states it has already seen before, indicating that its does task inference less efficiently. We believe that the stochastic latent embedding helps the policy express task uncertainty better, and that the auxiliary loss of decoding the embedding help learning.
Figure 6 shows the learning curves for this environment (for the full horizon ) for variBAD and RL, in comparison to the hardcoded optimal policy (which has access to the goal position), Bayesoptimal policy, and posterior sampling policy (evaluated on a horizon of , i.e., the first three ”episodes” shown in Figure 1). As we can see, variBAD closely approximates Bayesoptimal behaviour, and outperforms RL.


Appendix C MuJoCo
c.1 Learning Curves
Figure 7 shows the learning curves for the MuJoCo environments for all approaches. The oracle policy was trained using PPO. Our approach for hyperparametertuning was to tune the hyperparameters for PPO using the oracle policy, and then using these for both variBAD and RL, only further tuning those aspects particular to those methods (mostly VAE parameters and latent dimension). PEARL (Rakelly et al., 2019) was trained using the reference implementation provided by the authors. The environments we used are also taken from this implementation. EMAML (Stadie et al., 2018) and ProMP (Rothfuss et al., 2019) were trained using the reference implementation provided by Rothfuss et al. (2019).
c.2 Additional Remarks
Note that even though we maximise performance for a horizon of for both RL and variBAD, we sometimes keep the posterior distribution (or the hidden state of the RNN) which was obtained during one rollout when resetting the environment. This is to make sure that the agent can learn how to act for more than one episode.
We observe that RL is unstable when it comes to maintaining its performance over multiple rollouts. We hypothesize this is due to the specific property of the HalfCheetah environments that the state can give information about the task if the agent has already adapted. E.g., in the HalfCheetahDir environment, the xposition is part of the state observed by the agent. At the beginning of the episode, when starting close to , the agent has to infer in which direction to run. However, once it moves further and further away from the origin, it is sufficient to rely on the actual environment state to infer which task the agent is currently in. This could lead to the hidden state of the recurrent part of the network being ignored (and taking values that the agent cannot interpret), such that when reset to the origin, the inference procedure has to be redone. VariBADdoes not have this problem, since we train the latent embedding to represent the task, and only the task. Therefore, the agent does not have to do the inference procedure again when reset to the starting position, but can rely on the latent task description that is given by the approximate posterior.
c.3 Hyperparameters
We used the PyTorch framework for our experiments. Hyperparameters can be found below. We trained only a reward decoder, since the state transitions stay the same across environment.
RL Algorithm  PPO 

Number of policy steps  200 
Epochs (PPO)  2 
Minibatches  4 
Max grad norm (PPO)  0.5 
Clip parameter (PPO)  0.2 
Value loss coefficient (PPO)  0.5 
Entropy coefficient (PPO)  0.01 
Number of parallel processes (PPO)  16 
Notes  We use a huber loss in the RL loss 
ELBO loss coefficient  1.0 
Policy LR  0.0007 
Policy VAE  0.001 
Task embedding size  5 
Number of frames used for training  5e+7 
Policy architecture  2 hidden layers, 128 nodes each, TanH activations 
Encoder architecture  FC layer with 208 nodes, 
GRU with hidden size 64,  
output layer with 5 outputs, ReLu activations  
Reward decoder architecture  2 hidden layers, 32 nodes each, 
ReLu activations  
Reward decoder loss function  Mean squared error 