Meta reinforcement learning as task inference

05/15/2019 ∙ by Jan Humplik, et al. ∙ 5

Humans achieve efficient learning by relying on prior knowledge about the structure of naturally occurring tasks. There has been considerable interest in designing reinforcement learning algorithms with similar properties. This includes several proposals to learn the learning algorithm itself, an idea also referred to as meta learning. One formal interpretation of this idea is in terms of a partially observable multi-task reinforcement learning problem in which information about the task is hidden from the agent. Although agents that solve partially observable environments can be trained from rewards alone, shaping an agent's memory with additional supervision has been shown to boost learning efficiency. It is thus natural to ask what kind of supervision, if any, facilitates meta-learning. Here we explore several choices and develop an architecture that separates learning of the belief about the unknown task from learning of the policy, and that can be used effectively with privileged information about the task during training. We show that this approach can be very effective at solving standard meta-RL environments, as well as a complex continuous control environment in which a simulated robot has to execute various movement sequences.

READ FULL TEXT VIEW PDF

Authors

page 7

page 9

page 11

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Recent advances in reinforcement learning algorithms combined with deep neural networks have led to a rapid progress in several difficult domains (e.g.

Mnih et al. (2015); Silver et al. (2017); Heess et al. (2017); OpenAI et al. (2018)). Remarkably, the reinforcement learning algorithms which solved these tasks succeeded without access to prior knowledge about the structure of the tasks they were solving. Whilst the ability to learn with minimal prior knowledge is desirable, it can lead to computationally intensive training.

This inefficiency of learning should be contrasted with human behavior. When trying to master a new skill, our learning progress relies heavily on prior knowledge that we have collected while solving previous instances of similar problems. The hope is that artificial agents can similarly develop the ability to quickly learn if they have been previously trained in sufficiently rich multi-task settings in which the ability to learn is essential for success.

To study the emergence of efficient learning grounded in prior knowledge about a task distribution, several recent papers have turned to a well-established “meta” perspective on reinforcement learning (Duan et al., 2016; Wang et al., 2016; Mishra et al., 2018; Ritter et al., 2018). An optimal reinforcement learning algorithm is an agent acting in an unknownMarkov decision process (MDP) which minimizes cumulative regret, i.e. the difference between observed rewards, and the best rewards it could have gotten if it had known the environment. The problem of finding an optimal algorithm can be reformulated as maximizing future discounted rewards in a partially observable Markov decision process (POMDP) whose dynamics are the same as that of the unknown MDP but whose unobserved state also contains the parameters (e.g. reward function or transition probabilities) of the MDP (see e.g. Duff 2002; Poupart et al. 2006; Brunskill 2012). We will refer to this POMDP as a meta-RL POMDP because solving it for a given initial distribution over the parameters of the unknown MDP yields a reinforcement learning algorithm which is (on average) optimal when applied to MDPs drawn from this distribution.

In general, the optimal policy in a POMDP depends on the full history of actions, observations, and rewards. This dependence on the agent’s experience can be captured by a sufficient statistic called the belief state (Kaelbling et al., 1998). In the case of the meta-RL POMDP, the relevant part of the belief state is the posterior distribution over the unknown MDP (which we’ll refer to as a task), given the agent’s experience. Reasoning about this belief state is at the heart of Bayesian reinforcement learning (Strens, 2000)

, and many algorithms with optimal regret guarantees such as Thompson sampling

(Agrawal & Goyal, 2013)

effectively separate the algorithm for estimating the belief state from that for acting based on this estimate. In this work we aim to exploit a similar separation of concerns between task inference and acting in situations in which analytic solutions are intractable, and we thus have to learn the underlying components.

We develop a two-stream architecture for meta-reinforcement learning that augments the agent with a separate belief network the role of which is to estimate the belief state. We show that we can learn recurrent agents off-policy, and, in particular, that we can effectively train the recurrent belief network off-policy via supervised learning using a variety of predictive losses.

In a meta-learning setup in which the task distribution is under the designer’s control, the task specifications are privileged information which is available at training time. We demonstrate that training the belief network with such privileged information is particularly effective, and enables the agent to solve several meta-learning problems more efficiently than an agent without additional supervision. Note that privileged information is not required during test time. We further present similar findings for other task-related auxiliary losses which do not directly model the belief state such as: inferring actions that an optimal agent for the unknown task would take, and inferring a task label which was learned in a pre-training phase using training tasks.

As all of these objectives rely on privileged information about the unknown task provided by the teacher during training, we will consider a set-up in which we only train the agent on a finite set of training tasks from the task distribution that we are interested in, and we evaluate it on a separate holdout set of tasks from the same or similar distribution. All of our environments reflect this evaluation protocol.

In the next sections we formalize the connection between meta reinforcement learning and task inference, and then we apply these ideas to several environments of varying difficulty, including a complex continuous control environment in which a robot has to learn an efficient search strategy which requires it to remember information from more than steps ago.

Our main contribution is to demonstrate that leveraging privileged information about the unknown task to directly learn the belief state is a simple way to boost the performance of meta reinforcement learning agents, and that such privileged information can be effectively incorporated into off-policy reinforcement learning algorithms.

2 Preliminaries

Our method relies on basic results for MDPs and POMDPs. A MDP is a tuple , where is the state space, is the action space, is the transition probability between states due to an action , is the distribution of initial states, and is the probability of obtaining reward after transitioning to a state from due to an action . A POMDP is a tuple  which generalizes MDPs by including an observation space , and the probability of observing after transitioning to a state due to an action .

We denote sequences of states as , and similarly for observations, actions, and rewards. In POMDPs, we further define the observed trajectory as . Given a policy

, the joint distribution between the states and the trajectory factorizes as

(1)

The solution to a POMDP is a policy which maximizes discounted returns (Kaelbling et al., 1998), i.e.

(2)

Note that conditioning the policy on past rewards is a subtle, yet important, generalization of what is typically assumed in POMDPs (Izadi & Precup, 2005).

The optimal policy’s dependence on the trajectory can be summarized using the so-called belief state which is a distribution over the state space , and which satisfies

(3)

The belief state is a sufficient statistic for the optimal , and so (Kaelbling et al., 1998). It is important to note, however, that in many tasks it is not necessary to have access to the full belief state in order to act optimally.

3 Meta reinforcement learning and task inference

Meta reinforcement learning aims to train agents to quickly adapt to novel tasks. To model this “few shot” nature of learning, most meta-RL papers consider a set-up in which the agent is given episodes to explore and adapt to a fixed task. The performance of a meta learner is measured either as the cumulative rewards during these episodes (Duan et al., 2016), or as the cumulative rewards in new episodes after these “adaptation” episodes (Finn et al., 2017). We will be interested in the former measure. Furthermore, we will not explicitly assume the above K-shot formulation. Instead, we will only have a single episode, and the -shot structure will be present implicitly in the dynamics of the environment.

Specifically, we define meta reinforcement learning as the problem of finding a memory-based agent which, on average, maximizes future discounted rewards in MDPs from some set without knowing which of these MDPs it is solving. We refer to as the task space, and are the state and action spaces shared by all the MDPs, and , , and are the task-specific transition matrices, reward distributions, and initial state distributions. We will be interested in the average performance of the agent with respect to some prior distribution over tasks .

The solution to the meta learning problem can be formulated as a solution to a POMDP which shares actions with the above MDPs, has states , transition matrix , reward distribution , initial state distribution , and deterministic observations . Therefore, the optimal agent solves:

(4)

where

(5)

This agent should be interpreted as the optimal memory-based policy for the tasks in which is constrained to not have access to the task label .

The belief state of this POMDP is

(6)

where is the posterior over tasks given what the agent has observed so far. We will overload notation and refer to the posterior alone as the belief state since it is the only interesting part. The fact that the belief state is a sufficient statistic of the past has a natural interpretation in the meta learning setup–when acting at time , the optimal meta-learner only makes decisions based on the current observation , and the current belief about which task it is solving. Without relying on the general POMDP result, this can be seen by rewriting Eq. 4 as:

(7)

where , and

(8)

The posterior is independent of the policy , and the pair is Markovian with transition law:

(9)

Details and proof are provided in Appendix A. Note that the belief is a deterministic function of the past. The Markovian nature of implies that in order to solve Eq. 7 we can restrict ourselves to policies which only depend on these variables, i.e. .

The above discussion can be extended to the more general case in which the tasks in are POMDPs rather than MDPs, and this will be the case in some of our environments. If the tasks are POMDPs, then, in addition to and , might need to have access to additional information about the past (i.e. the belief state of the task-specific POMDP). For this reason we will typically model as a recurrent network.

The posterior is independent of the policy which generated because does not explicitly depend on . We will rely on this observation in the next section as it will allow us to learn an approximation to from off-policy data. If the tasks in are MDPs, then is invariant to permutations of the order of transitions in the trajectory . However, since some of our task distributions consist of POMDPs, we will not utilize this property.

4 Off-policy meta-RL with a learned belief state

In practice, in order to solve the problem in Eq. 4, we will instead solve

(10)

where is a training set of tasks (but we will evaluate on a holdout set of validation tasks).

Based on the above theory, we hypothesize that we can speed up learning in this problem by giving the policy access to a representation of the belief state . Unfortunately, the belief state is usually not available. Even in situations where the system dynamics and reward distributions are known, the exact posterior is often computationally intractable.

We propose to learn an approximate representation of the belief state by training a neural network which processes the agent’s trajectory and, at every time step , predicts one of the following task-related auxiliary targets :

  • The unknown task description, . Note that the task description is typically structured, e.g. pair of spatial coordinates of a target location, which allows for generalization across tasks.

  • An expert action. We assume that during training we have access to expert policies for each training task. We define the auxiliary target as .

  • Pre-trained task embedding , where is the ID of the current training task. The function can be arbitrary, however we learn it in a pre-training phase (see Appendix B for details).

Given a neural network which predicts this auxiliary target, we can share its representations (e.g. the last layer activations) with the policy (see Section 5). If we can predict the task description, then this representation is guaranteed to represent the belief state. In the case of the other auxiliary targets, we still expect the representation to be an accurate approximation of the belief state because the auxiliary targets are closely related to the task.

In detail, let be the posterior of the auxiliary target . We constrain ourselves to auxiliary targets such that can be expressed as a policy-independent function of the task posterior . This condition is satisfied trivially when is the task description or a pre-trained task embedding. It also holds for expert actions because .

We consider learning a parameterized approximation to by minimizing

(11)

where is an arbitrary behavioral policy which is not conditioned on the task label . Averaging over off-policy trajectories is justified by the previous observation that the posterior , and hence also , is independent of the policy which generated the data as long as was not conditioned on . In cases where

is a vector valued variable we approximate the posterior using a factorized distribution which typically overestimates uncertainty.

In order to minimize the objective in Eq. 11, we assume that the target is privileged information, i.e. information available at training time but not during evaluation. Optimizing this posterior can then be implemented in an algorithmically convenient way: We can optimize Eq. 11 with respect to using off-policy data and supervised learning (c.f. amortized inference (Gershman & Goodman, 2014; Paige & Wood, 2016))

(12)

5 Architectures

Figure 1: Agent architectures. A. Baseline LSTM agent. B. Belief network agent. C. Auxiliary head agent. IB represents optional stochastic layers with information bottleneck regularization.

We consider two different ways of sharing a representation of the learned belief state with the policy and value networks, both of which are augmentations of our baseline agent. The architectures that we use are outlined below and in Fig. 1 (see also Appendices E and F):

  • Baseline LSTM agent: An architecture similar to the one used in the algorithm (Duan et al., 2016). It’s an actor-critic architecture which does not utilize the learned belief state. Actor and critic are separate networks that each process observations with an MLP encoder followed by a LSTM network (Hochreiter & Schmidhuber, 1997). The output features are then linearly mapped to output the parameters of the policy distribution, or a scalar value function. In case of a Q-value function, we also concatenate the action with the output of the MLP before passing the result to the LSTM.

  • Belief network agent. The approximate belief state is modeled with a separate recurrent belief network which outputs parameterization of a distribution over the auxiliary variable , and is trained to solve the supervised learning problem in Eq. 12. On every step, we augment the inputs fed into the actor and critic with the belief network’s top layer’s features from the previous step.111The belief network is trained exclusively using the supervised objective in Eq.  12 and we do not propagate gradients from the agent into the belief network. Note that both policy and value are deterministic functions of the belief state, i.e. we do not rely on samples from the belief distribution.

  • Auxiliary head agent. A traditional architecture in which an additional MLP is attached to the outputs of the actor’s and critic’s LSTMs, and which outputs a parameterization of a distribution over the auxiliary variable . The agent is trained by optimizing the reinforcement learning losses to which we add the auxiliary likelihood loss in Eq. 12

    weighted by some hyperparameter 

    .

One practical advantage of the belief network agent over the auxiliary head agent is that the former does not require us to tune the balance between the reinforcement learning and supervised learning objectives, and interference between gradients of competing losses is not an issue.

6 Regularization via information bottleneck

The belief, optimal policy, and value function at time are all deterministic functions of the trajectory prefix . However, overfitting could significantly impair performance, and so we regularize some of our architectures using an information bottleneck (IB) similar to (Alemi et al., 2017; Chalk et al., 2016): We add a stochastic layer on top of the LSTM in the value function, policy, and belief network (see Fig. 1), and we regularize the noise in this layer by adding KL terms to each of these networks losses. For instance, for the belief network this corresponds to maximizing the following objective:

(13)

We discuss IB regularization in more detail in Appendix C.

7 Algorithms

For environments with a continuous action space we train the agents with a distributed version of the off-policy SVG(0) algorithm (Heess et al., 2015) which utilizes the Retrace operator (Munos et al., 2016) for learning the action-value function and includes an entropy regularization term in the policy update (Williams & Peng, 1991; Riedmiller et al., 2018). The supervised learning loss from Eq. 12 is added to the actor and critic losses on every iteration. We use the same distributed setup and strategy for initializing recurrent networks during off-policy learning as in Liu et al. (2019).

When the action space is discrete (bandit experiments), our agents are trained with the on-policy PPO algorithm (Schulman et al., 2017). The only change to the original algorithm is that the loss which is being optimized on every iteration also includes the supervised loss from Eq. 12.

Hyperparameters and additional details about the algorithms are provided in Appendix E and F.

8 Related work

Meta reinforcement learning in which the goal is to learn an algorithm for quickly solving tasks drawn from some prespecified distribution has recently received considerable amount of attention. The various approaches can be roughly divided into two classes based on how much inductive bias is incorporated into the meta learner. Meta learners based on policy gradients typically utilize the MAML framework (Finn et al., 2017; Gupta et al., 2018)

. Our work is more related to the other class which implements the meta learner as a recurrent neural network without relying on any prior knowledge about learning algorithms

(Duan et al., 2016; Wang et al., 2016; Mishra et al., 2018; Ritter et al., 2018). These approaches are closely related to learning to navigate in partially observable domains (Mirowski et al., 2016).

In a concurrent work, Rakelly et al. (2019) consider learning the meta learner off-policy by separating control and task inference, and their algorithm also relies on privileged information about the training task ID. Such privileged information is closely related to having access to expert policies, or to pre-trained task labels as we do. Meta reinforcement learning guided by expert policies has also been studied in a different framework in Mendonca et al. (2019).

Unsupervised learning of the belief state as a means of facilitating learning in POMDPs has been proposed in (Guo et al., 2019; Moreno et al., 2018). More generally, our method is related to learning with auxiliary losses which has been shown to improve the performance of memory-based architectures both in reinforcement learning (e.g. Wayne et al. (2018); Jaderberg et al. (2016)), and in sequence modelling (Trinh et al., 2018). In a tabular setting, the usefulness of relying on the belief state for meta reinforcement learning of a small number of tasks has been proposed in (Brunskill, 2012).

Several works have considered meta-learning in a hierarchical Bayesian setting where “learning” about a task distribution is realized as inferring latent causes that are shared across tasks (or data sets) in a hierarchical probabilistic model (e.g. Fei-Fei et al. (2003); Garnelo et al. (2018b, a)). This idea has recently led to alternative interpretations of “neural” approaches such as MAML (Grant et al., 2018; Yoon et al., 2018). Our aspirations are different in that we learn both the representation and the inference algorithm with the help of privileged information.

Our work is also related to reinforcement learning approaches which take advantage of the knowledge that the reward function or the environment have an additional structure related to multiple tasks (e.g. Teh et al. (2017); Wilson et al. (2007)). The problem of lifelong learning in which the goal is not just to solve multiple tasks but also to generalize to yet unseen tasks has been studied theoretically (e.g. Baxter (2000); Brunskill & Li (2014)), and recently also empirically with state-of-the-art reinforcement learning agents (e.g. Zhang et al. (2018); Nichol et al. (2018)).

9 Experiments

Our experiments focus on demonstrating that: A. We can train recurrent policies efficiently with an off-policy algorithm; B. Supervising our agents with privileged information about the task speeds-up training across a wide range of environments including standard meta-RL testbeds (Sec. 9.1-9.5; C. Our approach scales to a complex continuous control environment requiring long-term memory (Sec. 9.6); D. Information bottleneck regularization is often an effective way for speeding-up learning (see also ablation study in Fig. H.6).

We present results for three additional environments in Appendix G.

9.1 Multi-armed bandit

Description

We first study canonical multi-armed bandit problems which are commonly used as benchmarks for meta-RL (Duan et al., 2016; Wang et al., 2016; Mishra et al., 2018). On every step, the agent pulls one of

arms, and obtains a random reward drawn from a Bernoulli distribution with success probability

, where is the arm number. The goal of the agent is to maximize the total reward collected during pulls without knowing what the arm probabilities are.

Mapping this to the POMDP formulation in Eq. 4, the task description is a vector of arm probabilities, the action space is discrete corresponding to the arms, and the agent’s input is the action taken and reward obtained on the previous step. We choose the task distribution

to be the uniform distribution on the

-dimensional unit hypercube.

For this task we can calculate the belief state exactly. In fact, many well known algorithms for solving this problem such as Thompson sampling, or the Bayes optimal agent based on Gittins indices (Gittins, 1979) rely on an exact computation of the belief state. We will report the performance of these algorithms together with our results.

Results

Figure 2: Summary of the effect of auxiliary losses on validation performance during learning of multi-armed bandit algorithms. For comparison we also show the performance of Thompson sampling, as well as that of an agent based on Gittins indices.

First we study the effect of supervision and architectures described in the previous section on learning bandit algorithms with horizon , and either or arms. Note that our baseline agent is very similar to the ones considered in Duan et al. (2016); Wang et al. (2016) except that we use PPO for training. Our results are summarized in Fig. 2. We train the agents on 100 training tasks and only report validation performance on the full task distribution

. We compare the baseline LSTM agent to agents which are supervised with the arm probabilities (i.e. task descriptions) and with expert actions which is the index of the arm with the largest arm probability. We use the belief network architecture for both auxiliary targets, and we also compare to the head architecture with the task description prediction objective. Distributions over the auxiliary targets are parameterized as a beta distribution in the case of the arm probabilities, and as a categorical distribution in the case of expert actions. We make one change to the architectures in Fig. 

1, and exclude the belief features from the inputs of the value network which we found unnecessary. We additionally include two agents based on the Belief network architecture which remove the LSTM from the actor, and hence the policy must rely on the learned belief estimate. If the learned belief estimate is accurate, then it should contain all the information about the past that the agent needs (except the current time). However, since the belief is learned, the features are non-stationary which could potential hinder training.

Figure 3: Training vs. validation performance of the baseline LSTM agent and the best performing agent on multi-armed bandit problems. Agents are evaluated after 500 iterations of training. The number in parentheses indicates the number of arms.

All combinations of auxiliary targets and architectures sped up learning. The agents which use a MLP instead of a LSTM for the actor performed the best.

To visualize generalization capabilities of our agents, we also trained them on training sets of tasks of smaller sizes. We summarize our results for the baseline agent, and the best performing agent in Fig. 3. We did not find any of our architectures and or auxiliary targets to be significantly better at generalizing than the rest.

9.2 Navigation to two targets with Bernoulli distributed sparse rewards

Description

Figure 4: Navigation to two targets with Bernoulli sparse rewards. The rolling ball robot (large red sphere) has to discover and navigate to a target (small red and green spheres) which, on average, yields more rewards.

This is a novel environment which can be seen as a variant of the two-armed bandit problem in which the abstract task remains the same but the arm “pulls” need to be physically executed by a simulated “rolling ball” robot. The arm pulls are equivalent to navigating to one of two targets on a two-dimensional platform (see Fig. 4).

The robot starts at the center of the platform equidistant to the two targets. After reaching one of the targets, the agent receives a random reward drawn from a Bernoulli distribution with success probability , where is the target’s id, and then is teleported back to the center of the platform. The agent can be teleported at most 1000 times during one episode. We use the probabilities as the task description . Note that the agent needs to visit the targets multiple times and remember the outcome of the visits in order to accurately estimate the reward probability, and ensure that it commits to the more rewarding target.

Unlike in the multi-armed bandit experiments, we assume that the success probabilities are correlated so that .

The environment is implemented using the MuJoCo simulator (Todorov et al., 2012). The robot possesses a 3 dimensional continuous action space and moves by applying torque in order to rotate around the z-axis, or to accelerate in the forward direction. It can also jump by actuating an invisible slide joint, although the tasks that we consider do not require jumping. The observations consist of the robot’s position and orientation on the platform and several proprioceptive features such as joint positions and velocities which are necessary for movement control.

On average, an optimal agent requires 18 time steps to reach one of the targets (but a naïve agent may require arbitrary long). Because of this additional complexity of controlling the robot in order to ”pull” an arm, the agent needs to assign credit over a much longer time horizon than in the standard two-armed bandit problem which makes this setup considerably more difficult.

Results

Figure 5: Performance on various validation task distributions of the baseline LSTM agent with and without information bottleneck, and Belief network agents with information bottleneck in the Navigation to two targets with Bernoulli sparse rewards environment. TS represents the Thomson sampling strategy ran on the analogous discrete setup.

We train our agents on 100 tasks where each , and we only report performance on tasks not seen during training. We consider three different validation sets of 1000 tasks in which is sampled from the following distributions: , and . The last two are different than the training task distribution, and so they test our agents’ robustness to domain shift.

We evaluate four agents: belief architecture regularized with information bottleneck trained to predict either the task description or the ID of the more rewarding target, and the baseline LSTM agent with and without IB regularization. Our results are summarized in Fig. 5. For comparison we also report the performance that a Thompson sampling agent would obtain on the actual bandit analog of this task, i.e. we treat the robot reaching one of the targets as one step in the bandit analog. While the performance of the baseline LSTM agent is unreliable and varies across runs, the Belief network agent, as well as the baseline agent regularized with information bottleneck, consistently perform on par with Thompson sampling even when evaluated on out-of-distribution tasks.

In Appendix G.3

, we present further results for a similar environment with three targets and normally distributed rewards.

9.3 Locomotion with unknown target speed

Description

This is our implementation of an environment studied in Finn et al. (2017) which consists of a simulated cheetah (Tassa et al., 2018) that is supposed to run at an unknown speed sampled uniformly from the interval . On every step, the agent receives a reward

where is the cheetah’s current speed, and is the target speed which is also the task description. One episode is 10 seconds long, and consists of steps.

Results

Figure 6: Learning curves for various agents in the Locomotion with unknown speed environment. Solid curves correspond to training performance (100 tasks), while dashed curves to validation performance.

Our results are summarized in Fig. 6. From the perspective of meta-learning, this is the simplest environment which we study because the agent only needs to know the speed from the previous step and the corresponding reward in order to identify the task. This is likely why the belief network provides only a small improvement over the baseline LSTM agent.

9.4 Navigation to targets on a semi-circle with deterministic sparse rewards

Description

Next we consider 2d navigation tasks in which the rolling ball robot has to discover one of many possible targets on a semi-circle of radius 3 meters which is centered at the robot’s initial position. Every time the robot reaches the target location, it receives a +1 reward, and then is teleported back to the initial position. Each episode is 50 seconds long (each step is 0.05 seconds).

Unlike previous papers which studied this environment (Gupta et al., 2018; Rakelly et al., 2019), we train all agents using sparse rewards only. This makes training considerably more difficult.

The task description is an angle parameterizing locations on the semi-circle, and distributions over this task description are modelled as Gaussians.

Results

Figure 7: Learning curves for various agents in the Navigation to targets on a semi-circle with deterministic sparse rewards environment. Solid curves correspond to training performance (100 tasks), while dashed curves to validation performance.

We compare the same agents as in the previous environments with 100 training, and 1000 validation tasks. The extra supervision in the belief network agent again facilitates learning as demonstrated by the learning curves in Fig. 7.

Figure 8: Dependence of the generalization gap on the number of training tasks in the Navigation to targets on a semi-circle with deterministic sparse rewards environment. We show the performance after learner updates.

Fig. 8 shows the dependence of the generalization gap on the number of training tasks. About 20 training tasks already lead to reasonable generalization. We show the full learning curves in Fig. H.3 and Fig. H.4.

Fig. 11A visualizes a typical trajectory of an agent which successfully solved this environment. First, it takes a longer route because it has to search for the target, but once it found it, it is able to quickly return to it from its initial position.

In Appendix G.1 we present results for a similar environment but with a broader distribution of possible targets, and in Appendix G.2, we study the same navigation tasks with a simulated quadruped instead of the rolling ball robot.

9.5 Navigation to targets in a square with Bernoulli distributed dense rewards

Description

In this novel environment, the rolling ball robot is guided towards a goal location in a 6x6 square with stochastic rewards whose probability of being is inversely proportional to the distance to the target. Each episode is 60 seconds long, and the agent produces an action every 0.05 seconds. The goal location changes in each episode. Every second time step the agent receives a random Bernoulli distributed reward with success probability where is the current distance between the agent and the target.

The task description is the location of the target represented as a pair of coordinates. The distribution over the task description is a diagonal Gaussian.

The broad task distribution makes this environment quite difficult. In order to ease training, we induce a task curriculum by augmenting the agent’s observation with a cue about the task. Specifically, for each episode, there is a chance that the agent will observe the task, i.e. . On all other episodes, the agent does not observe the task, i.e. . Despite this curriculum, we will evaluate the agent only across episodes in which (a fully hidden evaluation regime). For comparison, we will also report performance on episodes in which the agents sees the task description (a fully visible evaluation regime).

Results

Figure 9: Learning curves for various agents in the Navigation to targets in a square with Bernoulli dense rewards task. Solid curves correspond to training performance (100 tasks), while dashed curves to validation performance.

We compare the baseline LSTM agent with and without IB regularization to a belief network agent with IB regularization which predicts the task description. All agents are trained on a training set of 100 tasks, and evaluated on a validation set of 1000 tasks.

Our results are summarized in Fig. 9. The extra supervision in the belief network agent improves the performance on episodes with no information about the task, i.e. . IB regularization in the baseline agent also helps to speed up learning.

Figure 10: Dependence of the generalization gap on the number of training tasks in the Navigation to targets in a square with Bernoulli distributed dense rewards environment. We show performance in the fully hidden evaluation regime after learner updates.

Fig. 10 shows the generalization gap of our agents evaluated in the fully hidden regime for various numbers of training tasks. About 100 tasks are sufficient for reliable generalization. We show the full learning curves for each training set size in Fig. H.1 and Fig. H.2.

Figure 11: Example trajectories in navigation tasks. A. Navigation to targets on a semi-circle with deterministic sparse rewards. We show three trajectory segments separated by the robot being teleported from the target to the initial position. B. Navigation to targets in a square with Bernoulli distributed dense rewards.

Fig. 11B visualizes a typical trajectory of an agent which successfully solved this environment. The agent searches for the target, and then it wanders around it in the region where the reward probability is always .

9.6 Path seeking robot

Description

So far we considered scenarios in which the agent had to infer the task by integrating noisy rewards, sensing a dense reward, or by discovering a sparse reward. Here we study a different scenario where the feedback is a sequence of sparse rewards which gradually reveal information about the task. This environment is much more difficult than standard meta-RL environments, including the ones in the previous sections, because the agent has to remember salient events which occurred more than 100 steps ago.

Figure 12: A. Path seeking robot environment. B. The logic of the tasks. We show a simplified example of transitions on a 2x2 grid. In the actual environment, the grid is 3x3, and the robot must learn to move from tile to tile by actuating the rolling ball shown in A.

The task is for the rolling ball robot to complete a sequence of movements between tiles arranged in a 3x3 grid on a two-dimensional platform (see Fig. 12). A task consists of visiting tiles in a prescribed sequence. As long as the agent visits tiles in the right order, the tiles light up, but if it touches a tile out of sequence, the lights get reset, and the agent needs to start the sequence from the beginning. If the sequences are of length 4, then the goal is to get to a state where 4 tiles are activated. The activation pattern of the tiles is part of the observation. To improve exploration during learning the agent is gradually guided towards the correct sequence by a +1 reward that is given the first time it completes a valid subsequence. Therefore, it has to estimate its belief about the true sequence by remembering both the longest successful subsequence, and failed attempts to extend this subsequence. When the sequence is completed for the first time, the rewards and lights are reset, and it can keep collecting rewards until it runs out of time.

We note that the reward function is history dependent, and so the environment is partially observed even when the task is known.

We restrict ourselves to contiguous sequences of neighboring tiles of length at most 4. The task description is a sequence of four numbers from the set , where we use as a placeholder if the sequence is shorter than . In total, there are tasks. Distributions over the task description are parameterized as four independent categorical distributions assigning probabilities to each digit.

We again consider augmenting the observation with a cue which we obtain by sampling a random mask over the task sequence. The random mask is sampled by first uniformly sampling from the set to obtain the number of digits which will be hidden from the agent, and then uniformly sampling from the set of all binary masks with that number of zeros.

Results

We use 90% of the 704 sequences as training tasks and the rest for validation (concretely we use 90% of sequences of length 1, 90% of sequences of length 2, etc.).

In addition to evaluating our agents in the fully visible, and hidden regimes as described in Sec. 9.5, we will also report performance in a partially visible regime in which the cues are sampled according to the above prescription.

Figure 13: Learning curves for various agents in the Path seeking robot environment. Solid curves correspond to training performance (90% of all tasks), while dashed curves to validation performance (the remaining tasks).
Figure 14: Dependence of the generalization gap on the number of training tasks in the Path seeking robot environment. We show the performance in the fully hidden evaluation regime after learner updates.
Figure 15: Top. A visualization of the belief state of an agent solving the Path seeking robot environment. Each row corresponds to a time in the episode just before discovering a new digit in the unobserved task (the last one is after discovering all digits). Each column corresponds to one of the four digits in the task description. Each platform represents the belief about this digit at that particular time and is visualized as a Hinton diagram: the areas of the squares on the platform are proportional to the belief that they correspond to the digit at that position in the task description. Bottom. Visualization of the log-likelihood that the belief state assigns to the true task during an episode. Events when the agent discovers a new digit in the task are marked with dashed lines.

Fig. 13 compares the performance of belief network agents supervised with either the task description, expert actions, or a pre-trained task embedding to the baseline LSTM agent. All presented agents are regularized with an information bottleneck which significantly improves their performance (see Fig. H.6 where we present an ablation study of the role of information bottleneck regularization in the best performing agent).

All of the agents which have additional supervision outperform the baseline LSTM agent on training tasks, and the one which is trained to directly model the belief state by predicting the task description is clearly the best. However, in comparison to the previous environments, it is much more difficult to generalize to validation tasks (previous environments required less than 100 tasks for generalization). In particular, the agents which predict expert actions or pre-trained task embeddings are especially prone to overfitting. In fact, the agent which predicts pre-trained task embeddings barely outperforms the baseline agent on validation tasks (despite being much better on training tasks).

The agent which predicts the task description overfits less badly to the training tasks. To further analyze its generalization capabilities we study the training set size dependence of its validation performance in Fig. 14. We show the full learning curves in Fig. H.7. These results suggest that many more tasks would be required to generalize in this complex environment (which, of course, is impossible since the number of tasks is limited).

We also compared the belief network architecture to the auxiliary head architecture in the case of predicting the task description, and we found the former to be better (see Fig. H.5).

To gain an understanding of the agent’s internal representation Fig. 15 visualizes the belief state of the best agent during one episode. We see that the likelihood that the agent assigns to the true task rapidly increases once it discovers a new digit in the task description. Furthermore, the belief about the value of a particular digit reflects the contiguous structure of the tasks: for example, if the agent knows that the first digit is 7 and the second digit is 4, then the belief about the third digit is non-zero only for tiles neighboring 4 which are not 7 (see third row and third column in Fig. 15).

10 Discussion

Motivated by the well-known connections between Bayesian inference and efficient reinforcement learning, we have applied the perspective of an agent trying to infer which task it is solving to the problem of meta reinforcement learning which attempts to learn reinforcement learning algorithms tailored to specific task distributions.

We argued that the meta learning problem can be viewed as the cooperation between two different objectives. The first is teaching the agent how to infer which task it is solving using supervised learning and privileged information about the true task. The second is teaching it how to efficiently utilize this inferred estimate via standard reinforcement learning.

We implemented this perspective using several auxiliary losses and different architectures for combining the extra supervision with both on-policy and off-policy reinforcement learning. Our experiments show that the resulting agents are better at learning reinforcement learning algorithms than LSTM baselines. One of the reasons is that the extra supervision helps with shaping the agent’s memory which is an essential component of any learning algorithm–for example, many efficient reinforcement learning algorithms keep track of how many times the agent visited each state (e.g. Strehl & Littman (2005)), a quantity which depends on the agent’s full history.

A possible downside to our approach is that we rely on the availability of ground truth task descriptions for the training tasks. However, in many problems it is the teacher who designs the tasks, in which case it is natural that task descriptions are available.

References

Appendix A Belief-MDP derivation

The following derivation shows that the belief only depends on the current and previous observations, previous action, previous reward and previous belief. Furthermore, if the policy is independent of the task , then the posterior is independent of the policy:

(14)

where the last line follows from the assumption that the policy does not explicitly depend on .

Appendix B Learning task embeddings

In the Path seeking robot environment, we use a pre-trained task embedding as the auxiliary target for the belief network. We learned this task embedding by jointly training a multitask policy on all training

tasks while providing a one-hot encoding of the task ID as an input. Crucially, the one-hot encoding is separately embedded via a 2-layer MLP followed by a stochastic IB layer. The columns of the output stochastic layer corresponding to a particular task ID then form a task embedding that may be more structured than the task ID itself.

Appendix C Regularization via information bottleneck

The deep variational information bottleneck (Alemi et al., 2017) regularizes neural networks by introducing a stochastic encoding of the input as well as an additional regularization term in the objective the goal of which is to minimize the mutual information .

In the supervised setting where we want to learn a mapping to minimize

IB regularization works by introducing a latent embedding , and parameterizing as

where is a stochastic encoder. The regularized objective is then to minimize the loss

Although is intractable in general it can be upper bounded and estimated from data effectively:

(15)

Here, is an arbitrary distribution which, in practice, is either set fixed or optimized to minimize the upper bound. Below we set .

While the information bottleneck regularization in (Alemi et al., 2017) was derived for supervised learning, we also regularize the policy and critic networks using a stochastic encoder and the above KL regularization even though they are trained to optimize reinforcement learning losses (see Algorithm 1).

Appendix D Results preprocessing

d.1 Bandit experiments

Reported learning curves are the mean episodic return across 100 episodes evaluated at every iteration which is smoothed with a sliding window spanning 10 iterations. Each experiment is repeated 15 times, and error bars are standard deviations of the above smoothed curves.

d.2 Continuous control experiments

Reported learning curves represent the performance averaged across distributed agents and the performance is reported at each iteration which is smoothed with a sliding window spanning 50 iterations. Each experiment is repeated 3 times, and error bars are standard deviations of the above smoothed curves.

Appendix E Algorithmic details for SVG(0)

Distributed SVG(0))

  initial states: , ,
  policy: ,
  Q: ,
  belief: ,
  online parameters: ,
  target parameters: ,
  replay buffer:
  batch size:
  unroll length:
  target update period:
  for  to  do
      samples from
     
     
     for  to  do
        
         features
        
        
        
        
        
        
        
        
        
        
        
     end for
     
     
     
     if  then
        
        
     end if
  end for
Algorithm 1 Belief net SVG(0) with IB (learner)

The agents are trained in a distributed way, similar to Riedmiller et al. (2018). Several worker processes independently collect trajectories of length unroll length of the agent’s interactions with the environment, and send them to a shared replay buffer with capacity trajectories. A learner process (see Algorithm 1) then uniformly samples batches of trajectories from the replay buffer, updates the networks via a gradient descent step on the appropriate SVG(0) losses augmented with the auxiliary prediction loss from Eq. 12, KL regularization terms from Eq. 15, and policy entropy regularization. The learner then shares the updated network parameters with the workers.

Network architectures

Actor, critic, and belief networks all encode inputs with a 3-layer MLP with ELU activation functions except for the first layer where we apply layer normalization

(Lei Ba et al., 2016)

followed by a TANH activation. In the critic network we augment the encoded inputs with the action (processed with a TANH) to be evaluated. The networks then pass the encoded inputs to a LSTM network. If we include IB regularization, then the LSTM outputs the parameters of a diagonal Gaussian distribution. Actor and critic networks linearly map either samples from the IB encoder, or outputs of the LSTM to either parameters of the policy distribution or scalar value function. Belief network maps the LSTM outputs (or IB encoder samples) to parameters of a distribution over the auxiliary target using another 2-layer MLP (200, and 100 units) with ELU activations. The head architecture processes the actor and critic’s LSTM outputs (or IB encoder samples) with a 2-layer MLP. We parameterize the policy and IB encoder with parameters

which are mapped to the means and standard deviations of a diagonal Gaussian distribution using the mapping:

where

is a sigmoid function.

Hyperparameters

Unless specified otherwise, we use the following hyperparameters in our experiments:

Default hyperparameters
Actor learning rate,
Critic learning rate,
Belief network learning rate,
Target update period:
Actor network encoder: MLP with sizes
Critic network encoder: MLP with sizes
Actor LSTM size:
Critic LSTM size:
Belief LSTM size:
Batch size:
Unroll length:
Entropy bonus:
Discount factor:
Number of parallel actors:
Belief bottleneck parameters
Belief bottleneck dimension:
Belief bottleneck loss coefficient:

Agent bottleneck parameters (Belief agents)
Actor bottleneck dimension:
Actor bottleneck loss coefficient:
Critic bottleneck dimension:
Critic bottleneck loss coefficient:

Agent bottleneck parameters (LSTM agents)
Actor bottleneck dimension:
Actor bottleneck loss coefficient:
Critic bottleneck dimension:
Critic bottleneck loss coefficient:

Task embedding parameters
Embedding size:
Auxiliary loss parameters
Head dimensions: MLP with sizes

Appendix F Algorithmic details for PPO

All architectures consist of a MLP with size

, and ELU activation functions, followed by a LSTM with 128 hidden units which uses layer normalization. All networks are optimized with the Adam optimizer with default TensorFlow settings except the learning rate. While the policy loss is clipped according to the PPO objective from

(Schulman et al., 2017), the value function and belief network are updated using several gradient descent steps without any clipping for each batch of data.

Hyperparameters
Actor learning rate:
Value learning rate:
Belief network learning rate:
Discount:
Entropy:
Generalized advantage lambda:
Batch size: episodes
Gradient descent steps per update:
Epsilon in PPO clipped objective:
Belief loss weight in Head architecture:

Appendix G Additional environments

g.1 Navigation to targets in a square with deterministic sparse rewards

Description

This is the same environment as Navigation to targets on a semi-circle with deterministic sparse rewards only this time the targets are arbitrary locations in a 6x6 square as in Navigation to targets in a square with Bernoulli distributed dense rewards.

Results

Figure G.1: Learning curves for various agents in the Navigation to targets in a square with deterministic sparse rewards environment. Solid curves correspond to training performance (100 tasks), while dashed curves to validation performance.
Figure G.2: Dependence of the generalization gap on the number of training tasks in the Navigation to targets in a square with deterministic sparse rewards environment. We show the performance after learner steps.
Figure G.3: Influence of training set size on the validation performance of the baseline LSTM architecture with and without IB regularization in the Navigation to targets in a square with deterministic sparse rewards environment.
Figure G.4: Influence of training set size on the validation performance of the belief architecture predicting the task description with and without IB regularization in the Navigation to targets in a square with deterministic sparse rewards environment.
Figure G.5: Example trajectory for go to target task.

We again consider using cues as in Sec. 9.5 to ease training. Our results are summarized in Fig. G.1. The training set size dependence of the generalization gap is shown in Fig. G.2, and the full learning curves for various training set sizes in Fig. G.3 and Fig. G.4.

Fig. G.5 visualizes a typical trajectory in this environment. Again, the initial search period is long (which is why this task is rather difficult), but once the target is discovered, the agent can easily return to it.

g.2 Navigation with deterministic sparse rewards and a quadruped robot

Description

This is the same environment as Navigation to targets on a semi-circle with deterministic sparse rewards only this time we use a more complex 12 DoF quadruped robot controlled by 8 actuators instead of the rolling ball robot.

Results

Figure G.6: Learning curves for various agents in the Navigation to targets on a semi-circle with deterministic sparse rewards and a quadruped robot environment. Solid curves correspond to training performance (100 tasks), while dashed curves to validation performance.
Figure G.7: Influence of training set size on the validation performance of the baseline LSTM architecture with and without IB regularization in the Navigation to targets on a semi-circle with deterministic sparse rewards and a quadruped robot environment.
Figure G.8: Influence of training set size on the validation performance of the belief architecture predicting the task description with and without IB regularization in the Navigation to targets on a semi-circle with deterministic sparse rewards and a quadruped robot environment.

Our results showing the advantage of the belief network architecture are summarized in Fig. G.6. We also show the learning curves for smaller training set sizes in Fig. G.7 and Fig. G.8. We have not tuned our algorithm specifically for this task which is likely why it is not very stable with respect to the number of training tasks.

g.3 Goal navigation with three targets and normally distributed rewards

Description

The rolling ball robot is tasked with navigating to one of three targets without knowing which one. Upon reaching each target, the agent receives a random reward drawn from a Gaussian distribution, , and then is teleported back to the initial position. An episode ends once the robot has been teleported 1000 times.

The expected rewards in each target are constrained to satisfy

We define the task distribution over the task descriptions via the following sampling procedure. We randomly sample two distinct indices , and means, . Then we set . Distributions over this task description are parameterized as Gaussian distributions with diagonal covariance matrix.

Results

Figure G.9: Learning curves for various agents trained to solve the goal navigation with three targets and normally distributed rewards task.

We compare the baseline LSTM agent with and without IB regularization to a belief network agent with IB regularization which predicts the task description. All agents are trained on a training set of 100 tasks, and evaluated on a validation set of 1000 tasks. Our results are summarized in Fig. G.9. Only the agent with extra supervision was able to solve this environment. This is likely because the rewards are very noisy, and often negative, which makes the strategy of avoiding the targets a reasonable local optimum.

Appendix H Additional results for environments in the main text

In this section we present additional learning curves for experiments in the main text.

Figure H.1: Influence of training set size on the validation learning curves of the baseline LSTM architecture with and without IB regularization in the Navigation to targets in a square with Bernoulli distributed dense rewards environment.
Figure H.2: Influence of training set size on the validation learning curves of the belief architecture predicting the task description with and without IB regularization in the Navigation to targets in a square with Bernoulli distributed dense rewards environment.
Figure H.3: Influence of training set size on the validation learning curves of the baseline LSTM architecture with and without IB regularization in the Navigation to targets on a semi-circle with deterministic sparse rewards environment.
Figure H.4: Influence of training set size on the validation learning curves of the belief architecture predicting the task description with and without IB regularization in the Navigation to targets on a semi-circle with deterministic sparse rewards environment.
Figure H.5: Comparison of the belief network agent, the auxiliary head agent, and their combination when supervised with the task description in the Path seeking robot environment.
Figure H.6: Effects of information bottleneck regularization on agents trained in the Path seeking robot environment. We also compare to a belief network agent which has IB regularization in the belief network but not in the actor and critic networks.
Figure H.7: Dependency of the validation learning curves on the training set size for the baseline agent (upper curve), and the belief agent (bottom curve) with and without information bottleneck regularization.