Recent advances in reinforcement learning algorithms combined with deep neural networks have led to a rapid progress in several difficult domains (e.g.Mnih et al. (2015); Silver et al. (2017); Heess et al. (2017); OpenAI et al. (2018)). Remarkably, the reinforcement learning algorithms which solved these tasks succeeded without access to prior knowledge about the structure of the tasks they were solving. Whilst the ability to learn with minimal prior knowledge is desirable, it can lead to computationally intensive training.
This inefficiency of learning should be contrasted with human behavior. When trying to master a new skill, our learning progress relies heavily on prior knowledge that we have collected while solving previous instances of similar problems. The hope is that artificial agents can similarly develop the ability to quickly learn if they have been previously trained in sufficiently rich multi-task settings in which the ability to learn is essential for success.
To study the emergence of efficient learning grounded in prior knowledge about a task distribution, several recent papers have turned to a well-established “meta” perspective on reinforcement learning (Duan et al., 2016; Wang et al., 2016; Mishra et al., 2018; Ritter et al., 2018). An optimal reinforcement learning algorithm is an agent acting in an unknownMarkov decision process (MDP) which minimizes cumulative regret, i.e. the difference between observed rewards, and the best rewards it could have gotten if it had known the environment. The problem of finding an optimal algorithm can be reformulated as maximizing future discounted rewards in a partially observable Markov decision process (POMDP) whose dynamics are the same as that of the unknown MDP but whose unobserved state also contains the parameters (e.g. reward function or transition probabilities) of the MDP (see e.g. Duff 2002; Poupart et al. 2006; Brunskill 2012). We will refer to this POMDP as a meta-RL POMDP because solving it for a given initial distribution over the parameters of the unknown MDP yields a reinforcement learning algorithm which is (on average) optimal when applied to MDPs drawn from this distribution.
In general, the optimal policy in a POMDP depends on the full history of actions, observations, and rewards. This dependence on the agent’s experience can be captured by a sufficient statistic called the belief state (Kaelbling et al., 1998). In the case of the meta-RL POMDP, the relevant part of the belief state is the posterior distribution over the unknown MDP (which we’ll refer to as a task), given the agent’s experience. Reasoning about this belief state is at the heart of Bayesian reinforcement learning (Strens, 2000)
, and many algorithms with optimal regret guarantees such as Thompson sampling(Agrawal & Goyal, 2013)
effectively separate the algorithm for estimating the belief state from that for acting based on this estimate. In this work we aim to exploit a similar separation of concerns between task inference and acting in situations in which analytic solutions are intractable, and we thus have to learn the underlying components.
We develop a two-stream architecture for meta-reinforcement learning that augments the agent with a separate belief network the role of which is to estimate the belief state. We show that we can learn recurrent agents off-policy, and, in particular, that we can effectively train the recurrent belief network off-policy via supervised learning using a variety of predictive losses.
In a meta-learning setup in which the task distribution is under the designer’s control, the task specifications are privileged information which is available at training time. We demonstrate that training the belief network with such privileged information is particularly effective, and enables the agent to solve several meta-learning problems more efficiently than an agent without additional supervision. Note that privileged information is not required during test time. We further present similar findings for other task-related auxiliary losses which do not directly model the belief state such as: inferring actions that an optimal agent for the unknown task would take, and inferring a task label which was learned in a pre-training phase using training tasks.
As all of these objectives rely on privileged information about the unknown task provided by the teacher during training, we will consider a set-up in which we only train the agent on a finite set of training tasks from the task distribution that we are interested in, and we evaluate it on a separate holdout set of tasks from the same or similar distribution. All of our environments reflect this evaluation protocol.
In the next sections we formalize the connection between meta reinforcement learning and task inference, and then we apply these ideas to several environments of varying difficulty, including a complex continuous control environment in which a robot has to learn an efficient search strategy which requires it to remember information from more than steps ago.
Our main contribution is to demonstrate that leveraging privileged information about the unknown task to directly learn the belief state is a simple way to boost the performance of meta reinforcement learning agents, and that such privileged information can be effectively incorporated into off-policy reinforcement learning algorithms.
Our method relies on basic results for MDPs and POMDPs. A MDP is a tuple , where is the state space, is the action space, is the transition probability between states due to an action , is the distribution of initial states, and is the probability of obtaining reward after transitioning to a state from due to an action . A POMDP is a tuple which generalizes MDPs by including an observation space , and the probability of observing after transitioning to a state due to an action .
We denote sequences of states as , and similarly for observations, actions, and rewards. In POMDPs, we further define the observed trajectory as . Given a policy
, the joint distribution between the states and the trajectory factorizes as
The solution to a POMDP is a policy which maximizes discounted returns (Kaelbling et al., 1998), i.e.
Note that conditioning the policy on past rewards is a subtle, yet important, generalization of what is typically assumed in POMDPs (Izadi & Precup, 2005).
The optimal policy’s dependence on the trajectory can be summarized using the so-called belief state which is a distribution over the state space , and which satisfies
The belief state is a sufficient statistic for the optimal , and so (Kaelbling et al., 1998). It is important to note, however, that in many tasks it is not necessary to have access to the full belief state in order to act optimally.
3 Meta reinforcement learning and task inference
Meta reinforcement learning aims to train agents to quickly adapt to novel tasks. To model this “few shot” nature of learning, most meta-RL papers consider a set-up in which the agent is given episodes to explore and adapt to a fixed task. The performance of a meta learner is measured either as the cumulative rewards during these episodes (Duan et al., 2016), or as the cumulative rewards in new episodes after these “adaptation” episodes (Finn et al., 2017). We will be interested in the former measure. Furthermore, we will not explicitly assume the above K-shot formulation. Instead, we will only have a single episode, and the -shot structure will be present implicitly in the dynamics of the environment.
Specifically, we define meta reinforcement learning as the problem of finding a memory-based agent which, on average, maximizes future discounted rewards in MDPs from some set without knowing which of these MDPs it is solving. We refer to as the task space, and are the state and action spaces shared by all the MDPs, and , , and are the task-specific transition matrices, reward distributions, and initial state distributions. We will be interested in the average performance of the agent with respect to some prior distribution over tasks .
The solution to the meta learning problem can be formulated as a solution to a POMDP which shares actions with the above MDPs, has states , transition matrix , reward distribution , initial state distribution , and deterministic observations . Therefore, the optimal agent solves:
This agent should be interpreted as the optimal memory-based policy for the tasks in which is constrained to not have access to the task label .
The belief state of this POMDP is
where is the posterior over tasks given what the agent has observed so far. We will overload notation and refer to the posterior alone as the belief state since it is the only interesting part. The fact that the belief state is a sufficient statistic of the past has a natural interpretation in the meta learning setup–when acting at time , the optimal meta-learner only makes decisions based on the current observation , and the current belief about which task it is solving. Without relying on the general POMDP result, this can be seen by rewriting Eq. 4 as:
where , and
The posterior is independent of the policy , and the pair is Markovian with transition law:
Details and proof are provided in Appendix A. Note that the belief is a deterministic function of the past. The Markovian nature of implies that in order to solve Eq. 7 we can restrict ourselves to policies which only depend on these variables, i.e. .
The above discussion can be extended to the more general case in which the tasks in are POMDPs rather than MDPs, and this will be the case in some of our environments. If the tasks are POMDPs, then, in addition to and , might need to have access to additional information about the past (i.e. the belief state of the task-specific POMDP). For this reason we will typically model as a recurrent network.
The posterior is independent of the policy which generated because does not explicitly depend on . We will rely on this observation in the next section as it will allow us to learn an approximation to from off-policy data. If the tasks in are MDPs, then is invariant to permutations of the order of transitions in the trajectory . However, since some of our task distributions consist of POMDPs, we will not utilize this property.
4 Off-policy meta-RL with a learned belief state
In practice, in order to solve the problem in Eq. 4, we will instead solve
where is a training set of tasks (but we will evaluate on a holdout set of validation tasks).
Based on the above theory, we hypothesize that we can speed up learning in this problem by giving the policy access to a representation of the belief state . Unfortunately, the belief state is usually not available. Even in situations where the system dynamics and reward distributions are known, the exact posterior is often computationally intractable.
We propose to learn an approximate representation of the belief state by training a neural network which processes the agent’s trajectory and, at every time step , predicts one of the following task-related auxiliary targets :
The unknown task description, . Note that the task description is typically structured, e.g. pair of spatial coordinates of a target location, which allows for generalization across tasks.
An expert action. We assume that during training we have access to expert policies for each training task. We define the auxiliary target as .
Pre-trained task embedding , where is the ID of the current training task. The function can be arbitrary, however we learn it in a pre-training phase (see Appendix B for details).
Given a neural network which predicts this auxiliary target, we can share its representations (e.g. the last layer activations) with the policy (see Section 5). If we can predict the task description, then this representation is guaranteed to represent the belief state. In the case of the other auxiliary targets, we still expect the representation to be an accurate approximation of the belief state because the auxiliary targets are closely related to the task.
In detail, let be the posterior of the auxiliary target . We constrain ourselves to auxiliary targets such that can be expressed as a policy-independent function of the task posterior . This condition is satisfied trivially when is the task description or a pre-trained task embedding. It also holds for expert actions because .
We consider learning a parameterized approximation to by minimizing
where is an arbitrary behavioral policy which is not conditioned on the task label . Averaging over off-policy trajectories is justified by the previous observation that the posterior , and hence also , is independent of the policy which generated the data as long as was not conditioned on . In cases where
is a vector valued variable we approximate the posterior using a factorized distribution which typically overestimates uncertainty.
In order to minimize the objective in Eq. 11, we assume that the target is privileged information, i.e. information available at training time but not during evaluation. Optimizing this posterior can then be implemented in an algorithmically convenient way: We can optimize Eq. 11 with respect to using off-policy data and supervised learning (c.f. amortized inference (Gershman & Goodman, 2014; Paige & Wood, 2016))
We consider two different ways of sharing a representation of the learned belief state with the policy and value networks, both of which are augmentations of our baseline agent. The architectures that we use are outlined below and in Fig. 1 (see also Appendices E and F):
Baseline LSTM agent: An architecture similar to the one used in the algorithm (Duan et al., 2016). It’s an actor-critic architecture which does not utilize the learned belief state. Actor and critic are separate networks that each process observations with an MLP encoder followed by a LSTM network (Hochreiter & Schmidhuber, 1997). The output features are then linearly mapped to output the parameters of the policy distribution, or a scalar value function. In case of a Q-value function, we also concatenate the action with the output of the MLP before passing the result to the LSTM.
Belief network agent. The approximate belief state is modeled with a separate recurrent belief network which outputs parameterization of a distribution over the auxiliary variable , and is trained to solve the supervised learning problem in Eq. 12. On every step, we augment the inputs fed into the actor and critic with the belief network’s top layer’s features from the previous step.111The belief network is trained exclusively using the supervised objective in Eq. 12 and we do not propagate gradients from the agent into the belief network. Note that both policy and value are deterministic functions of the belief state, i.e. we do not rely on samples from the belief distribution.
Auxiliary head agent. A traditional architecture in which an additional MLP is attached to the outputs of the actor’s and critic’s LSTMs, and which outputs a parameterization of a distribution over the auxiliary variable . The agent is trained by optimizing the reinforcement learning losses to which we add the auxiliary likelihood loss in Eq. 12
weighted by some hyperparameter.
One practical advantage of the belief network agent over the auxiliary head agent is that the former does not require us to tune the balance between the reinforcement learning and supervised learning objectives, and interference between gradients of competing losses is not an issue.
6 Regularization via information bottleneck
The belief, optimal policy, and value function at time are all deterministic functions of the trajectory prefix . However, overfitting could significantly impair performance, and so we regularize some of our architectures using an information bottleneck (IB) similar to (Alemi et al., 2017; Chalk et al., 2016): We add a stochastic layer on top of the LSTM in the value function, policy, and belief network (see Fig. 1), and we regularize the noise in this layer by adding KL terms to each of these networks losses. For instance, for the belief network this corresponds to maximizing the following objective:
We discuss IB regularization in more detail in Appendix C.
For environments with a continuous action space we train the agents with a distributed version of the off-policy SVG(0) algorithm (Heess et al., 2015) which utilizes the Retrace operator (Munos et al., 2016) for learning the action-value function and includes an entropy regularization term in the policy update (Williams & Peng, 1991; Riedmiller et al., 2018). The supervised learning loss from Eq. 12 is added to the actor and critic losses on every iteration. We use the same distributed setup and strategy for initializing recurrent networks during off-policy learning as in Liu et al. (2019).
When the action space is discrete (bandit experiments), our agents are trained with the on-policy PPO algorithm (Schulman et al., 2017). The only change to the original algorithm is that the loss which is being optimized on every iteration also includes the supervised loss from Eq. 12.
8 Related work
Meta reinforcement learning in which the goal is to learn an algorithm for quickly solving tasks drawn from some prespecified distribution has recently received considerable amount of attention. The various approaches can be roughly divided into two classes based on how much inductive bias is incorporated into the meta learner. Meta learners based on policy gradients typically utilize the MAML framework (Finn et al., 2017; Gupta et al., 2018)
. Our work is more related to the other class which implements the meta learner as a recurrent neural network without relying on any prior knowledge about learning algorithms(Duan et al., 2016; Wang et al., 2016; Mishra et al., 2018; Ritter et al., 2018). These approaches are closely related to learning to navigate in partially observable domains (Mirowski et al., 2016).
In a concurrent work, Rakelly et al. (2019) consider learning the meta learner off-policy by separating control and task inference, and their algorithm also relies on privileged information about the training task ID. Such privileged information is closely related to having access to expert policies, or to pre-trained task labels as we do. Meta reinforcement learning guided by expert policies has also been studied in a different framework in Mendonca et al. (2019).
Unsupervised learning of the belief state as a means of facilitating learning in POMDPs has been proposed in (Guo et al., 2019; Moreno et al., 2018). More generally, our method is related to learning with auxiliary losses which has been shown to improve the performance of memory-based architectures both in reinforcement learning (e.g. Wayne et al. (2018); Jaderberg et al. (2016)), and in sequence modelling (Trinh et al., 2018). In a tabular setting, the usefulness of relying on the belief state for meta reinforcement learning of a small number of tasks has been proposed in (Brunskill, 2012).
Several works have considered meta-learning in a hierarchical Bayesian setting where “learning” about a task distribution is realized as inferring latent causes that are shared across tasks (or data sets) in a hierarchical probabilistic model (e.g. Fei-Fei et al. (2003); Garnelo et al. (2018b, a)). This idea has recently led to alternative interpretations of “neural” approaches such as MAML (Grant et al., 2018; Yoon et al., 2018). Our aspirations are different in that we learn both the representation and the inference algorithm with the help of privileged information.
Our work is also related to reinforcement learning approaches which take advantage of the knowledge that the reward function or the environment have an additional structure related to multiple tasks (e.g. Teh et al. (2017); Wilson et al. (2007)). The problem of lifelong learning in which the goal is not just to solve multiple tasks but also to generalize to yet unseen tasks has been studied theoretically (e.g. Baxter (2000); Brunskill & Li (2014)), and recently also empirically with state-of-the-art reinforcement learning agents (e.g. Zhang et al. (2018); Nichol et al. (2018)).
Our experiments focus on demonstrating that: A. We can train recurrent policies efficiently with an off-policy algorithm; B. Supervising our agents with privileged information about the task speeds-up training across a wide range of environments including standard meta-RL testbeds (Sec. 9.1-9.5; C. Our approach scales to a complex continuous control environment requiring long-term memory (Sec. 9.6); D. Information bottleneck regularization is often an effective way for speeding-up learning (see also ablation study in Fig. H.6).
We present results for three additional environments in Appendix G.
9.1 Multi-armed bandit
We first study canonical multi-armed bandit problems which are commonly used as benchmarks for meta-RL (Duan et al., 2016; Wang et al., 2016; Mishra et al., 2018). On every step, the agent pulls one of
arms, and obtains a random reward drawn from a Bernoulli distribution with success probability, where is the arm number. The goal of the agent is to maximize the total reward collected during pulls without knowing what the arm probabilities are.
Mapping this to the POMDP formulation in Eq. 4, the task description is a vector of arm probabilities, the action space is discrete corresponding to the arms, and the agent’s input is the action taken and reward obtained on the previous step. We choose the task distribution
to be the uniform distribution on the-dimensional unit hypercube.
For this task we can calculate the belief state exactly. In fact, many well known algorithms for solving this problem such as Thompson sampling, or the Bayes optimal agent based on Gittins indices (Gittins, 1979) rely on an exact computation of the belief state. We will report the performance of these algorithms together with our results.
First we study the effect of supervision and architectures described in the previous section on learning bandit algorithms with horizon , and either or arms. Note that our baseline agent is very similar to the ones considered in Duan et al. (2016); Wang et al. (2016) except that we use PPO for training. Our results are summarized in Fig. 2. We train the agents on 100 training tasks and only report validation performance on the full task distribution
. We compare the baseline LSTM agent to agents which are supervised with the arm probabilities (i.e. task descriptions) and with expert actions which is the index of the arm with the largest arm probability. We use the belief network architecture for both auxiliary targets, and we also compare to the head architecture with the task description prediction objective. Distributions over the auxiliary targets are parameterized as a beta distribution in the case of the arm probabilities, and as a categorical distribution in the case of expert actions. We make one change to the architectures in Fig.1, and exclude the belief features from the inputs of the value network which we found unnecessary. We additionally include two agents based on the Belief network architecture which remove the LSTM from the actor, and hence the policy must rely on the learned belief estimate. If the learned belief estimate is accurate, then it should contain all the information about the past that the agent needs (except the current time). However, since the belief is learned, the features are non-stationary which could potential hinder training.
All combinations of auxiliary targets and architectures sped up learning. The agents which use a MLP instead of a LSTM for the actor performed the best.
To visualize generalization capabilities of our agents, we also trained them on training sets of tasks of smaller sizes. We summarize our results for the baseline agent, and the best performing agent in Fig. 3. We did not find any of our architectures and or auxiliary targets to be significantly better at generalizing than the rest.
9.2 Navigation to two targets with Bernoulli distributed sparse rewards
This is a novel environment which can be seen as a variant of the two-armed bandit problem in which the abstract task remains the same but the arm “pulls” need to be physically executed by a simulated “rolling ball” robot. The arm pulls are equivalent to navigating to one of two targets on a two-dimensional platform (see Fig. 4).
The robot starts at the center of the platform equidistant to the two targets. After reaching one of the targets, the agent receives a random reward drawn from a Bernoulli distribution with success probability , where is the target’s id, and then is teleported back to the center of the platform. The agent can be teleported at most 1000 times during one episode. We use the probabilities as the task description . Note that the agent needs to visit the targets multiple times and remember the outcome of the visits in order to accurately estimate the reward probability, and ensure that it commits to the more rewarding target.
Unlike in the multi-armed bandit experiments, we assume that the success probabilities are correlated so that .
The environment is implemented using the MuJoCo simulator (Todorov et al., 2012). The robot possesses a 3 dimensional continuous action space and moves by applying torque in order to rotate around the z-axis, or to accelerate in the forward direction. It can also jump by actuating an invisible slide joint, although the tasks that we consider do not require jumping. The observations consist of the robot’s position and orientation on the platform and several proprioceptive features such as joint positions and velocities which are necessary for movement control.
On average, an optimal agent requires 18 time steps to reach one of the targets (but a naïve agent may require arbitrary long). Because of this additional complexity of controlling the robot in order to ”pull” an arm, the agent needs to assign credit over a much longer time horizon than in the standard two-armed bandit problem which makes this setup considerably more difficult.
We train our agents on 100 tasks where each , and we only report performance on tasks not seen during training. We consider three different validation sets of 1000 tasks in which is sampled from the following distributions: , and . The last two are different than the training task distribution, and so they test our agents’ robustness to domain shift.
We evaluate four agents: belief architecture regularized with information bottleneck trained to predict either the task description or the ID of the more rewarding target, and the baseline LSTM agent with and without IB regularization. Our results are summarized in Fig. 5. For comparison we also report the performance that a Thompson sampling agent would obtain on the actual bandit analog of this task, i.e. we treat the robot reaching one of the targets as one step in the bandit analog. While the performance of the baseline LSTM agent is unreliable and varies across runs, the Belief network agent, as well as the baseline agent regularized with information bottleneck, consistently perform on par with Thompson sampling even when evaluated on out-of-distribution tasks.
9.3 Locomotion with unknown target speed
This is our implementation of an environment studied in Finn et al. (2017) which consists of a simulated cheetah (Tassa et al., 2018) that is supposed to run at an unknown speed sampled uniformly from the interval . On every step, the agent receives a reward
where is the cheetah’s current speed, and is the target speed which is also the task description. One episode is 10 seconds long, and consists of steps.
Our results are summarized in Fig. 6. From the perspective of meta-learning, this is the simplest environment which we study because the agent only needs to know the speed from the previous step and the corresponding reward in order to identify the task. This is likely why the belief network provides only a small improvement over the baseline LSTM agent.
9.4 Navigation to targets on a semi-circle with deterministic sparse rewards
Next we consider 2d navigation tasks in which the rolling ball robot has to discover one of many possible targets on a semi-circle of radius 3 meters which is centered at the robot’s initial position. Every time the robot reaches the target location, it receives a +1 reward, and then is teleported back to the initial position. Each episode is 50 seconds long (each step is 0.05 seconds).
The task description is an angle parameterizing locations on the semi-circle, and distributions over this task description are modelled as Gaussians.
We compare the same agents as in the previous environments with 100 training, and 1000 validation tasks. The extra supervision in the belief network agent again facilitates learning as demonstrated by the learning curves in Fig. 7.
Fig. 8 shows the dependence of the generalization gap on the number of training tasks. About 20 training tasks already lead to reasonable generalization. We show the full learning curves in Fig. H.3 and Fig. H.4.
Fig. 11A visualizes a typical trajectory of an agent which successfully solved this environment. First, it takes a longer route because it has to search for the target, but once it found it, it is able to quickly return to it from its initial position.
9.5 Navigation to targets in a square with Bernoulli distributed dense rewards
In this novel environment, the rolling ball robot is guided towards a goal location in a 6x6 square with stochastic rewards whose probability of being is inversely proportional to the distance to the target. Each episode is 60 seconds long, and the agent produces an action every 0.05 seconds. The goal location changes in each episode. Every second time step the agent receives a random Bernoulli distributed reward with success probability where is the current distance between the agent and the target.
The task description is the location of the target represented as a pair of coordinates. The distribution over the task description is a diagonal Gaussian.
The broad task distribution makes this environment quite difficult. In order to ease training, we induce a task curriculum by augmenting the agent’s observation with a cue about the task. Specifically, for each episode, there is a chance that the agent will observe the task, i.e. . On all other episodes, the agent does not observe the task, i.e. . Despite this curriculum, we will evaluate the agent only across episodes in which (a fully hidden evaluation regime). For comparison, we will also report performance on episodes in which the agents sees the task description (a fully visible evaluation regime).
We compare the baseline LSTM agent with and without IB regularization to a belief network agent with IB regularization which predicts the task description. All agents are trained on a training set of 100 tasks, and evaluated on a validation set of 1000 tasks.
Our results are summarized in Fig. 9. The extra supervision in the belief network agent improves the performance on episodes with no information about the task, i.e. . IB regularization in the baseline agent also helps to speed up learning.
Fig. 10 shows the generalization gap of our agents evaluated in the fully hidden regime for various numbers of training tasks. About 100 tasks are sufficient for reliable generalization. We show the full learning curves for each training set size in Fig. H.1 and Fig. H.2.
Fig. 11B visualizes a typical trajectory of an agent which successfully solved this environment. The agent searches for the target, and then it wanders around it in the region where the reward probability is always .
9.6 Path seeking robot
So far we considered scenarios in which the agent had to infer the task by integrating noisy rewards, sensing a dense reward, or by discovering a sparse reward. Here we study a different scenario where the feedback is a sequence of sparse rewards which gradually reveal information about the task. This environment is much more difficult than standard meta-RL environments, including the ones in the previous sections, because the agent has to remember salient events which occurred more than 100 steps ago.
The task is for the rolling ball robot to complete a sequence of movements between tiles arranged in a 3x3 grid on a two-dimensional platform (see Fig. 12). A task consists of visiting tiles in a prescribed sequence. As long as the agent visits tiles in the right order, the tiles light up, but if it touches a tile out of sequence, the lights get reset, and the agent needs to start the sequence from the beginning. If the sequences are of length 4, then the goal is to get to a state where 4 tiles are activated. The activation pattern of the tiles is part of the observation. To improve exploration during learning the agent is gradually guided towards the correct sequence by a +1 reward that is given the first time it completes a valid subsequence. Therefore, it has to estimate its belief about the true sequence by remembering both the longest successful subsequence, and failed attempts to extend this subsequence. When the sequence is completed for the first time, the rewards and lights are reset, and it can keep collecting rewards until it runs out of time.
We note that the reward function is history dependent, and so the environment is partially observed even when the task is known.
We restrict ourselves to contiguous sequences of neighboring tiles of length at most 4. The task description is a sequence of four numbers from the set , where we use as a placeholder if the sequence is shorter than . In total, there are tasks. Distributions over the task description are parameterized as four independent categorical distributions assigning probabilities to each digit.
We again consider augmenting the observation with a cue which we obtain by sampling a random mask over the task sequence. The random mask is sampled by first uniformly sampling from the set to obtain the number of digits which will be hidden from the agent, and then uniformly sampling from the set of all binary masks with that number of zeros.
We use 90% of the 704 sequences as training tasks and the rest for validation (concretely we use 90% of sequences of length 1, 90% of sequences of length 2, etc.).
In addition to evaluating our agents in the fully visible, and hidden regimes as described in Sec. 9.5, we will also report performance in a partially visible regime in which the cues are sampled according to the above prescription.
Fig. 13 compares the performance of belief network agents supervised with either the task description, expert actions, or a pre-trained task embedding to the baseline LSTM agent. All presented agents are regularized with an information bottleneck which significantly improves their performance (see Fig. H.6 where we present an ablation study of the role of information bottleneck regularization in the best performing agent).
All of the agents which have additional supervision outperform the baseline LSTM agent on training tasks, and the one which is trained to directly model the belief state by predicting the task description is clearly the best. However, in comparison to the previous environments, it is much more difficult to generalize to validation tasks (previous environments required less than 100 tasks for generalization). In particular, the agents which predict expert actions or pre-trained task embeddings are especially prone to overfitting. In fact, the agent which predicts pre-trained task embeddings barely outperforms the baseline agent on validation tasks (despite being much better on training tasks).
The agent which predicts the task description overfits less badly to the training tasks. To further analyze its generalization capabilities we study the training set size dependence of its validation performance in Fig. 14. We show the full learning curves in Fig. H.7. These results suggest that many more tasks would be required to generalize in this complex environment (which, of course, is impossible since the number of tasks is limited).
We also compared the belief network architecture to the auxiliary head architecture in the case of predicting the task description, and we found the former to be better (see Fig. H.5).
To gain an understanding of the agent’s internal representation Fig. 15 visualizes the belief state of the best agent during one episode. We see that the likelihood that the agent assigns to the true task rapidly increases once it discovers a new digit in the task description. Furthermore, the belief about the value of a particular digit reflects the contiguous structure of the tasks: for example, if the agent knows that the first digit is 7 and the second digit is 4, then the belief about the third digit is non-zero only for tiles neighboring 4 which are not 7 (see third row and third column in Fig. 15).
Motivated by the well-known connections between Bayesian inference and efficient reinforcement learning, we have applied the perspective of an agent trying to infer which task it is solving to the problem of meta reinforcement learning which attempts to learn reinforcement learning algorithms tailored to specific task distributions.
We argued that the meta learning problem can be viewed as the cooperation between two different objectives. The first is teaching the agent how to infer which task it is solving using supervised learning and privileged information about the true task. The second is teaching it how to efficiently utilize this inferred estimate via standard reinforcement learning.
We implemented this perspective using several auxiliary losses and different architectures for combining the extra supervision with both on-policy and off-policy reinforcement learning. Our experiments show that the resulting agents are better at learning reinforcement learning algorithms than LSTM baselines. One of the reasons is that the extra supervision helps with shaping the agent’s memory which is an essential component of any learning algorithm–for example, many efficient reinforcement learning algorithms keep track of how many times the agent visited each state (e.g. Strehl & Littman (2005)), a quantity which depends on the agent’s full history.
A possible downside to our approach is that we rely on the availability of ground truth task descriptions for the training tasks. However, in many problems it is the teacher who designs the tasks, in which case it is natural that task descriptions are available.
- Agrawal & Goyal (2013) Agrawal, S. and Goyal, N. Further optimal regret bounds for thompson sampling. In Artificial Intelligence and Statistics, pp. 99–107, 2013.
- Alemi et al. (2017) Alemi, A. A., Fischer, I., Dillon, J. V., and Murphy, K. Deep variational information bottleneck. International Conference on Learning Representations, 2017.
- Baxter (2000) Baxter, J. A model of inductive bias learning. Journal of Artificial Intelligence Research, 12:149–198, 2000.
- Brunskill (2012) Brunskill, E. Bayes-optimal reinforcement learning for discrete uncertainty domains. In Proceedings of the 11th International Conference on Autonomous Agents and Multiagent Systems-Volume 3, pp. 1385–1386. International Foundation for Autonomous Agents and Multiagent Systems, 2012.
Brunskill & Li (2014)
Brunskill, E. and Li, L.
Pac-inspired option discovery in lifelong reinforcement learning.
International Conference on Machine Learning, pp. 316–324, 2014.
- Chalk et al. (2016) Chalk, M., Marre, O., and Tkacik, G. Relevant sparse codes with variational information bottleneck. In Lee, D. D., Sugiyama, M., Luxburg, U. V., Guyon, I., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 29, pp. 1957–1965. Curran Associates, Inc., 2016.
- Duan et al. (2016) Duan, Y., Schulman, J., Chen, X., Bartlett, P. L., Sutskever, I., and Abbeel, P. Rl: Fast reinforcement learning via slow reinforcement learning, 2016.
- Duff (2002) Duff, M. O. Optimal Learning: Computational procedures for Bayes-adaptive Markov decision processes. PhD thesis, University of Massachusetts at Amherst, 2002.
Fei-Fei et al. (2003)
Fei-Fei, L. et al.
A bayesian approach to unsupervised one-shot learning of object
Proceedings Ninth IEEE International Conference on Computer Vision, pp. 1134–1141. IEEE, 2003.
- Finn et al. (2017) Finn, C., Abbeel, P., and Levine, S. Model-agnostic meta-learning for fast adaptation of deep networks. In Precup, D. and Teh, Y. W. (eds.), Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pp. 1126–1135, International Convention Centre, Sydney, Australia, 06–11 Aug 2017. PMLR.
- Garnelo et al. (2018a) Garnelo, M., Rosenbaum, D., Maddison, C., Ramalho, T., Saxton, D., Shanahan, M., Teh, Y. W., Rezende, D., and Eslami, S. A. Conditional neural processes. In International Conference on Machine Learning, pp. 1690–1699, 2018a.
- Garnelo et al. (2018b) Garnelo, M., Schwarz, J., Rosenbaum, D., Viola, F., Rezende, D. J., Eslami, S., and Teh, Y. W. Neural processes. arXiv preprint arXiv:1807.01622, 2018b.
- Gershman & Goodman (2014) Gershman, S. and Goodman, N. Amortized inference in probabilistic reasoning. In Proceedings of the Annual Meeting of the Cognitive Science Society, volume 36, 2014.
- Gittins (1979) Gittins, J. C. Bandit processes and dynamic allocation indices. Journal of the Royal Statistical Society. Series B (Methodological), pp. 148–177, 1979.
- Grant et al. (2018) Grant, E., Finn, C., Levine, S., Darrell, T., and Griffiths, T. Recasting gradient-based meta-learning as hierarchical bayes. In International Conference on Learning Representations, 2018.
- Guo et al. (2019) Guo, Z. D., Azar, M. G., Piot, B., Pires, B. A., and Munos, R. Neural predictive belief representations, 2019. URL https://openreview.net/forum?id=ryfz73C9KQ.
- Gupta et al. (2018) Gupta, A., Mendonca, R., Liu, Y., Abbeel, P., and Levine, S. Meta-reinforcement learning of structured exploration strategies. arXiv preprint arXiv:1802.07245, 2018.
- Heess et al. (2015) Heess, N., Wayne, G., Silver, D., Lillicrap, T., Erez, T., and Tassa, Y. Learning continuous control policies by stochastic value gradients. In Cortes, C., Lawrence, N. D., Lee, D. D., Sugiyama, M., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 28, pp. 2944–2952. Curran Associates, Inc., 2015.
- Heess et al. (2017) Heess, N., Sriram, S., Lemmon, J., Merel, J., Wayne, G., Tassa, Y., Erez, T., Wang, Z., Eslami, A., Riedmiller, M., et al. Emergence of locomotion behaviours in rich environments. arXiv preprint arXiv:1707.02286, 2017.
- Hochreiter & Schmidhuber (1997) Hochreiter, S. and Schmidhuber, J. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
- Izadi & Precup (2005) Izadi, M. T. and Precup, D. Using rewards for belief state updates in partially observable markov decision processes. In European Conference on Machine Learning, pp. 593–600. Springer, 2005.
- Jaderberg et al. (2016) Jaderberg, M., Mnih, V., Czarnecki, W. M., Schaul, T., Leibo, J. Z., Silver, D., and Kavukcuoglu, K. Reinforcement learning with unsupervised auxiliary tasks. arXiv preprint arXiv:1611.05397, 2016.
- Kaelbling et al. (1998) Kaelbling, L. P., Littman, M. L., and Cassandra, A. R. Planning and acting in partially observable stochastic domains. Artificial intelligence, 101(1-2):99–134, 1998.
- Lei Ba et al. (2016) Lei Ba, J., Kiros, J. R., and Hinton, G. E. Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
- Liu et al. (2019) Liu, S., Lever, G., Heess, N., Merel, J., Tunyasuvunakool, S., and Graepel, T. Emergent coordination through competition. In International Conference on Learning Representations, 2019.
- Mendonca et al. (2019) Mendonca, R., Gupta, A., Kralev, R., Abbeel, P., Levine, S., and Finn, C. Guided meta-policy search. arXiv preprint arXiv:1904.00956, 2019.
- Mirowski et al. (2016) Mirowski, P., Pascanu, R., Viola, F., Soyer, H., Ballard, A. J., Banino, A., Denil, M., Goroshin, R., Sifre, L., Kavukcuoglu, K., et al. Learning to navigate in complex environments. arXiv preprint arXiv:1611.03673, 2016.
- Mishra et al. (2018) Mishra, N., Rohaninejad, M., Chen, X., and Abbeel, P. A simple neural attentive meta-learner. 2018.
- Mnih et al. (2015) Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529, 2015.
- Moreno et al. (2018) Moreno, P., Humplik, J., Papamakarios, G., Buesing, L., Heess, N., and Weber, T. Neural belief states for partially observed domains. In NeurIPS 2018 workshop on Reinforcement Learning under Partial Observability, 2018.
- Munos et al. (2016) Munos, R., Stepleton, T., Harutyunyan, A., and Bellemare, M. Safe and efficient off-policy reinforcement learning. In Advances in Neural Information Processing Systems 29, pp. 1054–1062. 2016.
- Nichol et al. (2018) Nichol, A., Pfau, V., Hesse, C., Klimov, O., and Schulman, J. Gotta learn fast: A new benchmark for generalization in rl. arXiv preprint arXiv:1804.03720, 2018.
- OpenAI et al. (2018) OpenAI, :, Andrychowicz, M., Baker, B., Chociej, M., Jozefowicz, R., McGrew, B., Pachocki, J., Petron, A., Plappert, M., Powell, G., Ray, A., Schneider, J., Sidor, S., Tobin, J., Welinder, P., Weng, L., and Zaremba, W. Learning dexterous in-hand manipulation, 2018.
- Paige & Wood (2016) Paige, B. and Wood, F. Inference networks for sequential monte carlo in graphical models. In International Conference on Machine Learning, pp. 3040–3049, 2016.
- Poupart et al. (2006) Poupart, P., Vlassis, N., Hoey, J., and Regan, K. An analytic solution to discrete bayesian reinforcement learning. In Proceedings of the 23rd international conference on Machine learning, pp. 697–704. ACM, 2006.
- Rakelly et al. (2019) Rakelly, K., Zhou, A., Quillen, D., Finn, C., and Levine, S. Efficient off-policy meta-reinforcement learning via probabilistic context variables. arXiv preprint arXiv:1903.08254, 2019.
- Riedmiller et al. (2018) Riedmiller, M., Hafner, R., Lampe, T., Neunert, M., Degrave, J., van de Wiele, T., Mnih, V., Heess, N., and Springenberg, J. T. Learning by playing solving sparse reward tasks from scratch. In Proceedings of the 35th International Conference on Machine Learning, pp. 4344–4353, 2018.
- Ritter et al. (2018) Ritter, S., Wang, J., Kurth-Nelson, Z., Jayakumar, S., Blundell, C., Pascanu, R., and Botvinick, M. Been there, done that: Meta-learning with episodic recall. In Dy, J. and Krause, A. (eds.), Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pp. 4354–4363, Stockholmsmässan, Stockholm Sweden, 10–15 Jul 2018. PMLR.
- Schulman et al. (2017) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
- Silver et al. (2017) Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang, A., Guez, A., Hubert, T., Baker, L., Lai, M., Bolton, A., et al. Mastering the game of go without human knowledge. Nature, 550(7676):354, 2017.
- Strehl & Littman (2005) Strehl, A. L. and Littman, M. L. A theoretical analysis of model-based interval estimation. In Proceedings of the 22nd international conference on Machine learning, pp. 856–863. ACM, 2005.
- Strens (2000) Strens, M. J. A. A bayesian framework for reinforcement learning. In Proceedings of the Seventeenth International Conference on Machine Learning, ICML ’00, pp. 943–950, San Francisco, CA, USA, 2000. Morgan Kaufmann Publishers Inc. ISBN 1-55860-707-2.
- Tassa et al. (2018) Tassa, Y., Doron, Y., Muldal, A., Erez, T., Li, Y., Casas, D. d. L., Budden, D., Abdolmaleki, A., Merel, J., Lefrancq, A., et al. Deepmind control suite. arXiv preprint arXiv:1801.00690, 2018.
- Teh et al. (2017) Teh, Y., Bapst, V., Czarnecki, W. M., Quan, J., Kirkpatrick, J., Hadsell, R., Heess, N., and Pascanu, R. Distral: Robust multitask reinforcement learning. In Advances in Neural Information Processing Systems, pp. 4496–4506, 2017.
- Todorov et al. (2012) Todorov, E., Erez, T., and Tassa, Y. Mujoco: A physics engine for model-based control. In Intelligent Robots and Systems (IROS), 2012 IEEE/RSJ International Conference on, pp. 5026–5033. IEEE, 2012.
- Trinh et al. (2018) Trinh, T. H., Dai, A. M., Luong, M.-T., and Le, Q. V. Learning longer-term dependencies in rnns with auxiliary losses. 2018.
- Wang et al. (2016) Wang, J. X., Kurth-Nelson, Z., Tirumala, D., Soyer, H., Leibo, J. Z., Munos, R., Blundell, C., Kumaran, D., and Botvinick, M. Learning to reinforcement learn. arXiv preprint arXiv:1611.05763, 2016.
- Wayne et al. (2018) Wayne, G., Hung, C.-C., Amos, D., Mirza, M., Ahuja, A., Grabska-Barwinska, A., Rae, J., Mirowski, P., Leibo, J. Z., Santoro, A., et al. Unsupervised predictive memory in a goal-directed agent. arXiv preprint arXiv:1803.10760, 2018.
- Williams & Peng (1991) Williams, R. J. and Peng, J. Function optimization using connectionist reinforcement learning algorithms. Connection Science, 3(3):241–268, 1991.
- Wilson et al. (2007) Wilson, A., Fern, A., Ray, S., and Tadepalli, P. Multi-task reinforcement learning: a hierarchical bayesian approach. In Proceedings of the 24th international conference on Machine learning, pp. 1015–1022. ACM, 2007.
- Yoon et al. (2018) Yoon, J., Kim, T., Dia, O., Kim, S., Bengio, Y., and Ahn, S. Bayesian model-agnostic meta-learning. In Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 31, pp. 7343–7353. Curran Associates, Inc., 2018.
- Zhang et al. (2018) Zhang, C., Vinyals, O., Munos, R., and Bengio, S. A study on overfitting in deep reinforcement learning. arXiv preprint arXiv:1804.06893, 2018.
Appendix A Belief-MDP derivation
The following derivation shows that the belief only depends on the current and previous observations, previous action, previous reward and previous belief. Furthermore, if the policy is independent of the task , then the posterior is independent of the policy:
where the last line follows from the assumption that the policy does not explicitly depend on .
Appendix B Learning task embeddings
In the Path seeking robot environment, we use a pre-trained task embedding as the auxiliary target for the belief network. We learned this task embedding by jointly training a multitask policy on all training
tasks while providing a one-hot encoding of the task ID as an input. Crucially, the one-hot encoding is separately embedded via a 2-layer MLP followed by a stochastic IB layer. The columns of the output stochastic layer corresponding to a particular task ID then form a task embedding that may be more structured than the task ID itself.
Appendix C Regularization via information bottleneck
The deep variational information bottleneck (Alemi et al., 2017) regularizes neural networks by introducing a stochastic encoding of the input as well as an additional regularization term in the objective the goal of which is to minimize the mutual information .
In the supervised setting where we want to learn a mapping to minimize
IB regularization works by introducing a latent embedding , and parameterizing as
where is a stochastic encoder. The regularized objective is then to minimize the loss
Although is intractable in general it can be upper bounded and estimated from data effectively:
Here, is an arbitrary distribution which, in practice, is either set fixed or optimized to minimize the upper bound. Below we set .
While the information bottleneck regularization in (Alemi et al., 2017) was derived for supervised learning, we also regularize the policy and critic networks using a stochastic encoder and the above KL regularization even though they are trained to optimize reinforcement learning losses (see Algorithm 1).
Appendix D Results preprocessing
d.1 Bandit experiments
Reported learning curves are the mean episodic return across 100 episodes evaluated at every iteration which is smoothed with a sliding window spanning 10 iterations. Each experiment is repeated 15 times, and error bars are standard deviations of the above smoothed curves.
d.2 Continuous control experiments
Reported learning curves represent the performance averaged across distributed agents and the performance is reported at each iteration which is smoothed with a sliding window spanning 50 iterations. Each experiment is repeated 3 times, and error bars are standard deviations of the above smoothed curves.
Appendix E Algorithmic details for SVG(0)
The agents are trained in a distributed way, similar to Riedmiller et al. (2018). Several worker processes independently collect trajectories of length unroll length of the agent’s interactions with the environment, and send them to a shared replay buffer with capacity trajectories. A learner process (see Algorithm 1) then uniformly samples batches of trajectories from the replay buffer, updates the networks via a gradient descent step on the appropriate SVG(0) losses augmented with the auxiliary prediction loss from Eq. 12, KL regularization terms from Eq. 15, and policy entropy regularization. The learner then shares the updated network parameters with the workers.
Actor, critic, and belief networks all encode inputs with a 3-layer MLP with ELU activation functions except for the first layer where we apply layer normalization(Lei Ba et al., 2016)
followed by a TANH activation. In the critic network we augment the encoded inputs with the action (processed with a TANH) to be evaluated. The networks then pass the encoded inputs to a LSTM network. If we include IB regularization, then the LSTM outputs the parameters of a diagonal Gaussian distribution. Actor and critic networks linearly map either samples from the IB encoder, or outputs of the LSTM to either parameters of the policy distribution or scalar value function. Belief network maps the LSTM outputs (or IB encoder samples) to parameters of a distribution over the auxiliary target using another 2-layer MLP (200, and 100 units) with ELU activations. The head architecture processes the actor and critic’s LSTM outputs (or IB encoder samples) with a 2-layer MLP. We parameterize the policy and IB encoder with parameterswhich are mapped to the means and standard deviations of a diagonal Gaussian distribution using the mapping:
is a sigmoid function.
Unless specified otherwise, we use the following hyperparameters in our experiments:
Actor learning rate,
Critic learning rate,
Belief network learning rate,
Target update period:
Actor network encoder: MLP with sizes
Critic network encoder: MLP with sizes
Actor LSTM size:
Critic LSTM size:
Belief LSTM size:
Number of parallel actors:
Belief bottleneck parameters
Belief bottleneck dimension:
Belief bottleneck loss coefficient:
Agent bottleneck parameters (Belief agents)
Actor bottleneck dimension:
Actor bottleneck loss coefficient:
Critic bottleneck dimension:
Critic bottleneck loss coefficient:
Agent bottleneck parameters (LSTM agents)
Actor bottleneck dimension:
Actor bottleneck loss coefficient:
Critic bottleneck dimension:
Critic bottleneck loss coefficient:
Task embedding parameters
Auxiliary loss parameters
Head dimensions: MLP with sizes
Appendix F Algorithmic details for PPO
All architectures consist of a MLP with size
, and ELU activation functions, followed by a LSTM with 128 hidden units which uses layer normalization. All networks are optimized with the Adam optimizer with default TensorFlow settings except the learning rate. While the policy loss is clipped according to the PPO objective from(Schulman et al., 2017), the value function and belief network are updated using several gradient descent steps without any clipping for each batch of data.
Actor learning rate:
Value learning rate:
Belief network learning rate:
Generalized advantage lambda:
Batch size: episodes
Gradient descent steps per update:
Epsilon in PPO clipped objective:
Belief loss weight in Head architecture:
Appendix G Additional environments
g.1 Navigation to targets in a square with deterministic sparse rewards
This is the same environment as Navigation to targets on a semi-circle with deterministic sparse rewards only this time the targets are arbitrary locations in a 6x6 square as in Navigation to targets in a square with Bernoulli distributed dense rewards.
We again consider using cues as in Sec. 9.5 to ease training. Our results are summarized in Fig. G.1. The training set size dependence of the generalization gap is shown in Fig. G.2, and the full learning curves for various training set sizes in Fig. G.3 and Fig. G.4.
Fig. G.5 visualizes a typical trajectory in this environment. Again, the initial search period is long (which is why this task is rather difficult), but once the target is discovered, the agent can easily return to it.
g.2 Navigation with deterministic sparse rewards and a quadruped robot
This is the same environment as Navigation to targets on a semi-circle with deterministic sparse rewards only this time we use a more complex 12 DoF quadruped robot controlled by 8 actuators instead of the rolling ball robot.
Our results showing the advantage of the belief network architecture are summarized in Fig. G.6. We also show the learning curves for smaller training set sizes in Fig. G.7 and Fig. G.8. We have not tuned our algorithm specifically for this task which is likely why it is not very stable with respect to the number of training tasks.
g.3 Goal navigation with three targets and normally distributed rewards
The rolling ball robot is tasked with navigating to one of three targets without knowing which one. Upon reaching each target, the agent receives a random reward drawn from a Gaussian distribution, , and then is teleported back to the initial position. An episode ends once the robot has been teleported 1000 times.
The expected rewards in each target are constrained to satisfy
We define the task distribution over the task descriptions via the following sampling procedure. We randomly sample two distinct indices , and means, . Then we set . Distributions over this task description are parameterized as Gaussian distributions with diagonal covariance matrix.
We compare the baseline LSTM agent with and without IB regularization to a belief network agent with IB regularization which predicts the task description. All agents are trained on a training set of 100 tasks, and evaluated on a validation set of 1000 tasks. Our results are summarized in Fig. G.9. Only the agent with extra supervision was able to solve this environment. This is likely because the rewards are very noisy, and often negative, which makes the strategy of avoiding the targets a reasonable local optimum.
Appendix H Additional results for environments in the main text
In this section we present additional learning curves for experiments in the main text.