Log In Sign Up

Learning Context-aware Task Reasoning for Efficient Meta-reinforcement Learning

by   Haozhe Wang, et al.

Despite recent success of deep network-based Reinforcement Learning (RL), it remains elusive to achieve human-level efficiency in learning novel tasks. While previous efforts attempt to address this challenge using meta-learning strategies, they typically suffer from sampling inefficiency with on-policy RL algorithms or meta-overfitting with off-policy learning. In this work, we propose a novel meta-RL strategy to address those limitations. In particular, we decompose the meta-RL problem into three sub-tasks, task-exploration, task-inference and task-fulfillment, instantiated with two deep network agents and a task encoder. During meta-training, our method learns a task-conditioned actor network for task-fulfillment, an explorer network with a self-supervised reward shaping that encourages task-informative experiences in task-exploration, and a context-aware graph-based task encoder for task inference. We validate our approach with extensive experiments on several public benchmarks and the results show that our algorithm effectively performs exploration for task inference, improves sample efficiency during both training and testing, and mitigates the meta-overfitting problem.


earning Context-aware Task Reasoning for Efficient Meta Reinforcement Learning

Despite recent success of deep network-based Reinforcement Learning (RL)...

Learn to Effectively Explore in Context-Based Meta-RL

Meta reinforcement learning (meta-RL) provides a principled approach for...

Improved Context-Based Offline Meta-RL with Attention and Contrastive Learning

Meta-learning for offline reinforcement learning (OMRL) is an understudi...

Alchemy: A structured task distribution for meta-reinforcement learning

There has been rapidly growing interest in meta-learning as a method for...

Human-Timescale Adaptation in an Open-Ended Task Space

Foundation models have shown impressive adaptation and scalability in su...

Coupling Retrieval and Meta-Learning for Context-Dependent Semantic Parsing

In this paper, we present an approach to incorporate retrieved datapoint...

On the Effectiveness of Fine-tuning Versus Meta-reinforcement Learning

Intelligent agents should have the ability to leverage knowledge from pr...

1. Introduction

Modern reinforcement learning has achieved great successes in solving certain complex tasks by utilizing deep neural networks, which can even be trained from scratch 

alphazero; poker; alphastar. Such successes, however, require a large amount of training experiences for new tasks. In contrast, human learners are able to exploit past experiences when facing a novel problem and quickly learn skills for a related task lake2017building. To achieve such fast adaptation is a critical step towards building a general AI agent capable of solving multiple tasks in real-world environments.

A principled way to tackling the problem of efficient adaptation is the meta learning framework finn2017model, which aims to capture shared knowledge across tasks and hence enables an agent to learn a similar task using only a few experiences. In the reinforcement learning setting, however, as the learning agent also needs to explore in each novel task, it is particularly challenging to design an efficient meta-RL algorithm. A majority of prior works adopt on-policy RL algorithms rl-squared; finn2017model; gupta2018meta; rothfuss2018promp, which are data-inefficient during meta-training rakelly2019efficient. To remedy this, rakelly2019efficient propose an alternative strategy, PEARL, that relies on off-policy RL methods to achieve sample efficiency in meta-training. By introducing a latent variable to represent a task, their method decomposes the problem into online task inference and task-conditioned policy learning that uses experiences from a replay buffer (i.e., off-policy learning).

Nevertheless, such an off-policy strategy has several limitations during meta-test stage, particularly for the few-shot setting. First, it ignores the role of exploration in the task inference (cf. meta-episodes in rl-squared), which is critical in efficient adaptation as the exploration is responsible for collecting informative experiences within the few episodes to infer tasks. In addition, the agent has to explore in an online fashion for task inference during meta-test, which typically has a different sample distribution from the replay buffer that provides adaptation data at meta-train stage. PEARL partially alleviates this problem of train-test mismatch vinyals2016matching by adopting a replay buffer of recently collected data. Such an in-between strategy rakelly2019efficient, however, still leads to severe “meta-overfitting”, particularly for online task inference (cf. Sec 4.1&5.3). Furthermore, the adaptation data acquired by exploration are simply aggregated by a weighted average pooling for task inference. As the experience samples are not iid, such aggregation fails to capture their dependency relations, which is informative for task inference.

In this work, we propose a context-aware task reasoning strategy for meta-reinforcement learning to address the aforementioned limitations. Adopting a latent representation for tasks, we formulate the meta-learning as a POMDP, which learns an approximate posterior distribution of the latent task variable jointly with a task-dependent action policy that maximizes the expected return for each task. Our main focus is to develop an adaptive task inference strategy that is able to effectively map from few-shot experiences of a task to its representation without suffering from the meta-overfitting.

To achieve this, we design a novel task inference network that consists of an exploration policy network and a structured task encoder shared by all tasks. Learning an exploration policy allows us to explore a task environment more efficiently and to introduce regularization to bridge the gap between meta-train and meta-test, leading to better generalization. The structured task encoder is built on a context-aware graph network, which is input permutation-invariant and size-agnostic in order to cope with variable number of exploration episodes. Our graph network encodes task-related dependency in experience data samples, capable of capturing complex task statistics from exploration to achieve more data-efficient task inference.

To train our meta-learner, we develop a variational EM formulation that alternately optimizes the exploration policy network that collects experiences for a task, the context-aware graph network that performs task inference based on the collected task data, and an action policy network that aims to complete tasks towards maximum rewards given the inferred task information. Our meta-learning objective, formulated under the POMDP framework, is composed of two state-action value functions for the two respective policies. The state-action value for the exploration policy is guided by a reward shaping strategy that encourages the policy to collect task-informative experiences, and is learned with Soft Actor Critic (SAC) haarnoja2018soft for randomized behavior to narrow the gap between online rollouts and offline buffers. In essence, our method decomposes the meta-RL problem into three sub-tasks, task-exploration, task-inference and task-fulfillment, in which we learn two separate policies for different purposes and a task encoder for task inference.

We evaluate our meta-reinforcement learning framework on four benchmarks with a diverse task distributions, in which our approach outperforms prior works by improving training efficiency (up to 400x) and testing efficiency with better asymptotic performance (up to 300%) while effectively mitigating the meta-overfitting by a large margin (up to 75%). Our contributions can be summarized as follows:

  • [leftmargin=4mm]

  • We propose a sample-efficient meta-RL algorithm that achieves the state-of-the-art performances on multiple benchmarks with diverse task distributions.

  • We present a dual-agents design with a reward shaping strategy that explicitly optimizes for the ability for task exploration and mitigates meta-overfitting.

  • We develop a context-aware graph network for task inference that models the dependency relations between experience data in order to efficiently infer task posterior.

2. Related work

2.1. Meta-reinforcement Learning

Prior meta reinforcement learning methods can be categorized into the following three lines of work.

The first line of work adopts a learning-to-learn strategy finn2017model; stadie2018some; gupta2018meta; rothfuss2018promp; xu2018learning. In particular, MAML finn2017model meta-learns a model initialization from which to adapt the parameters with policy gradient methods. To tackle issues in computing second-order derivatives for MAML, ProMP rothfuss2018promp

further proposes a low-variance curvature estimator to perform proximal policy optimization that bounds the policy changeRecently, MAESN 

gupta2018meta improves MAML with more temporally coherent exploration by injecting structured noise via a latent variable conditioned policy, and enables fast learning of exploration strategies.

The second category of approaches uses recurrent or recursive models to directly meta-learn a function that takes experiences as input and generates a policy for an agent mishra2017simple; reiforcelearn; rl-squared. Among them,  rl-squared trains a recurrent agent with on-policy meta-episodes that comprise a sequence of exploration and task-fulfillment episodes, which aims to maximize the expected return of the task-fulfillment episode. Their method essentially learns to extract task information from the first few rounds of exploration, encoded in the latent states of its RNN, and complete the task in the final episode based on the known task information (latent states).

The first two groups of work often learn a single policy to perform exploration for policy adaptation and task-fulfillment, which overloads the agent with two distinct objectives. By contrast, we learn two separate policies, one focusing on exploration for policy adaptation and the other for task-fulfillment.

In the third line of work, rakelly2019efficient and taskinf propose to first infer task with task experiences and adapts the agent according to the task knowledge. Framing meta-RL as a POMDP problem with probabilistic inference, taskinf formulate task inference as belief state in POMDP, and learn the inference procedure in a supervised manner with privileged information as ground truth. PEARL rakelly2019efficient learns to infer tasks in an unsupervised manner by introducing an extra reconstruction term and incorporates off-policy learning to improve sample efficiency. However, both approaches ignore the role of exploration in task inference, and PEARL suffers from train-test mismatch in the data distribution for task inference.

By contrast, we propose to further disentangle meta-RL into task-exploration, task-inference and task-fulfillment, and introduce an exploration policy within a variational EM formulation. Our method enhances the inference procedure via active task exploration and effective task inference, achieves data-efficiency both in meta-train and meta-test, and mitigates meta-overfitting.

2.2. Relational Modeling on Sets

The problem of inference on a set of samples has been widely explored in literature battaglia2018relational; gilmer2017neural; li2015gated; bruna2013spectral; kipf2016semi; hamilton2017inductive, and here we focus on those approaches that learn a permutation-invariant and input size-agnostic model.

DeepSets zaheer2017deep proposes a general design principles for permutation invariant functions on sets. As an instantiation, Sets2Sets vinyals2015order

encodes a set using an attention mechanism, which retrieves vectors in a manner immune to the shuffle of memory and accepts variable size of inputs. However, they typically do not consider relations among input elements, which is critical for modeling short experiences for task inference.

Self-attention module internet; vain; santoro2017simple; wang2018non is commonly used to model pairwise relationships between entities of a set. Among them, Nonlocal Neural Networks wang2018non are capable of capturing global context using the scaled dot product attention vaswani2017attention. Along this direction, graph networks battaglia2018relational; gilmer2017neural provide a flexible framework that can model arbitrary relationships among entities and is invariant to graph isomorphism. Self-attention on sets, a special case of graphs, can naturally be assimilated into the graph network framework zhang2019latentgnn; battaglia2018relational; velivckovic2017graph. Recently, LatentGNN zhang2019latentgnn develops a novel undirected graph model to encode non-local relationships through a shared latent space and admits efficient message passing due to its sparse connection. In this work, the design of our graph-based inference network is inspired by LatentGNN and DeepSets, aiming to efficiently model the dependency relationship in experience data.

3. Variational Meta-Reinforcement Learning

We aim to address the fast adaptation problem in reinforcement learning, which allows the learned agent to explore a new task (up to a budget) and adapts its policy to the new task. We adopt the meta-learning framework to enable fast learning, incorporating a prior of past experiences in structurally similar tasks.

Formally, we assume a task distribution from which multiple tasks are i.i.d sampled for meta-training and testing. Any task can naturally be defined by its MDP in RL, where is the state space, is the action space, is the distribution of the initial states, is the transition distribution, is the discount in rewards which we will omit for clarity, and is the reward function. Here the task distribution is typically defined by the distribution of the reward function and the transition function, which can vary due to some underlying parameters. We introduce a latent variable to represent the cause of task variations, and formulate the meta RL problem as a POMDP in which serves as the unobserved part of the state space.

We first introduce two policies, including an action policy that aims to maximize expected reward under a given task and an exploration policy for generating task-informative samples by interacting with the environment. We then adopt the off-policy Actor-Critic framework dpg, and learn two Q-functions for the action and exploration policy respectively. To achieve this, we define the following learning objective, which maximizes the expected log likelihood of returns under two policies (we omit the expectation over tasks for clarity):


where denotes the distribution visited by the two behavior policies for , denotes the environment, and is the reshaped reward for (Sec. 4.1). Typically, is approximated by the experience replay buffers of , denoted as respectively mnih2013playing; ddpg; haarnoja2018soft. We will slightly abuse the notation

to denote the joint distribution of

for brievity.

Here denotes the likelihood of return given a state-action pair, and is the target value estimated using the optimal Bellman Equation. Note that by assuming where and a constant, then maximizing the likelihood is equivalent to the Temporal Difference learning objective tesauro1995temporal; konda2000actor; mnih2013playing:


where leads to the constants .

We utilize the variational EM learning strategy bishop2006pattern to maximize the objective. Denoting the sampled experiences as , we introduce a variational distribution to approximate the intractable posterior , in which we alternate between optimizing for and for other parameterized functions . We refer to as the task encoder in our model.

For E-step, we maximize a variational lower bound derived from (1) to find the optimal . In order to decouple the action and exploration policy, we also introduce an auxiliary experience distribution and minimize the following free energy:


where in the first term the TD Error derives from , and the constant term comes from the second term in (1) which is irrelevant to . For the auxiliary distribution , we adopt the exploration policy’s sample distribution , as aims to sample task-informative experiences for task inference. Hence the E-step objective has the following form:


For M-step, given , we minimize the following free-energy objective derived from (1) :


Based on this variational EM algorithm, our approach learns dual agents for task-exploration and task-fulfillment respectively. The meta-training process interleaves the data collection process with the alternating EM optimization process, as shown in Alg. 1.

Figure 1. Context-aware task reasoning for RL adaptation. We separate the task into task-exploration, task inference and task-fulfillment. The explorer interacts with the environment to collect experiences for the task encoder to update the belif of task. After an iterative process of rounds, the task encoder takes all the collections and gives the final task hypothesis to adapt the actor.
input : Meta-train tasks , training steps , E-step , M-step , meta-batch size , task sample size for buffer update, learning rate .
output :  .
1 Initialize with network parameters , buffer for , buffer for .
2 for each task  do
3       collect episodes into buffers with the policies respectively.
4 end for
5while not converged do
6       randomly sample tasks from to form a set .
7       for task  do
8             add episodes into buffers with the policies.
9       end for
10      for each step in training steps do
11             randomly sample tasks from to form a set .
12             if   then
13                   get the inputs (E-step):
14                   .
15                   compute the loss: .
16                   update the model:
17            else
18                   get the inputs (M-step): , , .
19                   compute the loss: .
20                   update the model:
21             end if
23       end for
25 end while
Algorithm 1 The Meta-training algorithm

In contrast to taskinf; rakelly2019efficient, our variational EM formulation derives from a unified learning objective and enables fast learning of dual agents with different roles. The explorer learns the ability of task-exploration that targets at efficient exploration to (actively) collect sufficient task information for task inference, while the actor learns for task-fulfillment that accomplishes the task towards high rewards. Both the explorer and the actor are task-conditioned, so that they are able to perform temporally-coherent exploration for different goals, given the current task hypothesis gupta2018meta.

With such a formulation, we need to answer the following questions to achieve efficient and effective task inference to solve the meta-RL problem:

  • [leftmargin=4mm]

  • How to guide the explorer towards active exploration for sufficient task information that is crucial in task inference with few experiences?

  • How to achieve effective task inference that captures the relationships between experiences and reduce task uncertainty?

  • Since we incorporate off-policy learning algorithms, how to mitigate the train-test mismatch issue due to the use of replay buffer v.s. online rollouts during testing?

We will address the above three questions in Sec. 4, which introduces our task inference strategy in detail.


At the meta-test time, we apply the same data collection procedure as in the meta-training stage. For each task, the explorer first samples a hypothesis from the posterior (initialized as ), and then explores optimally according to the hypothesis. The collected experiences during the exploration are used by the task encoder to update the posterior of . This process iterates until the explorer uses up the maximum number of rollouts. All the experiences are fed to the task encoder to produce a final posterior representing the belief of MDPs. We then sample from the posterior to generate a task-conditioned action policy , which interacts with the environment in an attempt to fulfill the task. This process is shown in Fig. 1.

4. Task Inference Strategy

We now introduce our task reasoning strategy for meta-RL with limited data. Our goal is to use the limited exploration to generate experiences with sufficient information for task inference and to effectively infer the task posterior from the given task experiences. In Sec. 4.1, we explain how we learn an explorer that pursues task-informative experiences with randomized behavior and mitigates train-test mismatch. In Sec. 4.2, we elaborate on the network design of the task encoder, which enables relational modeling between task experiences.

4.1. Learning the Exploration Policy

Our exploration policy aims to enrich the task-related information of experiences within limited episodes. To this end, below we introduce two reward shaping methods: first, we increase the coverage of task experiences to obtain task-informative experiences; second, we improve the quality of each sample to have as much information gain as possible.

4.1.1. Increasing Coverage

The first idea of improving exploration is to increase its coverage of task experiences by adding stochasity into the agent’s behavior, which is motivated by the empirical observations of the performance obtain with replay buffer (see the off-policy curve in Fig. 9 ). Off-policy data have more coverage of experiences as the samples are uniformly drawn from multiple different trajectories, compared to samples in online rollouts with smaller variations.

To this end, we encourage the randomized behavior of the explorer by leveraging SAC haarnoja2018soft. SAC derives from the entropy-regularized RL objective levine2018reinforcement, which essentially adds entropy to the reward (and value) functions. Note that we optimize both and with SAC, but for different purposes. SAC favors exploration in policy learning for the actor , which will run deterministically at meta-test time. In contrast, here SAC is used to guide the explorer towards randomized task-exploration, which is a stochastic policy during deployment.

It is worth noting that training an exploration policy with randomized behavior also helps mitigate train-test mismatch, which is caused by the different data distributions for policy adaptation (buffer data v.s. online rollouts). Our task-exploration explicitly learns the ability for randomized exploration that (empirically) brings the online rollout data distribution closer to the off-policy buffer data.

4.1.2. Improving Quality

To improve the quality of exploration, we also design a reward shaping that guides the explorer towards highly informative experiences. However, it is non-trivial to quantify the informativeness in a sample. In our case, the explorer collects experiences given a task hypothesis, so we are more interested in evaluating the mutual information as the reward for the explorer,


where we differentiate the new task hypothesis against previously hypothesized task , denotes the repository of collected experiences and is the sample for assessment. Intuitively, we expect the credit to reflect the information gain from the sample given the current hypothesis of task .

However, directly computing the mutual information is impractical, as it involves evaluating the new posterior after incorporating each new sample . Instead, we adopt the following proxy for the mutual information:


This proxy implies that we believe a sample brings larger information gain if the new hypothesis is less likely to be the same as the prior belief . To compute , we apply the Bayes Rule:


As the joint distribution of is constant given , the posterior of is proportional to the inverse of the likelihood. Similar to Sec. 3, we use the state-action value function to compute the likelihood , but we instead assume a laplace distribution as we empirically find it is more stable to use the L1-norm than the L2-norm induced by a Gaussian. As a result, we have the following score function that gives the shaped reward :


where the hyperparameter

is the reward scale, and the greedy policy is learned to compute  dpg; ddpg; haarnoja2018soft.

4.2. Context-aware Task Encoder

Our task inference network computes the posterior of latent variable given a set of experience data, aiming to extract task information from experiences. To this end, we design a network module with the following properties:

  1. [leftmargin=4mm]

  2. Permutation-invariant, as the output should not vary with the order of the inputs.

  3. Input size-agnostic, as the network would encounter variable size of inputs within the arbitrary number of rollouts.

  4. Context-aware, as extracting cues from a single sample should incorporate the context formed by other samples111Imagine in a 2d-navigation task where the agent aims to navigate to a goal location, a sample may indicate the possible location of the goal due to the high rewards, and can further eliminate possibilities by another sample that shows what locations are not possible by its low rewards..

Figure 2. The task encoder.The first aggregation constructs a bipartite graph with full connections from the nodes in to the latent nodes in . Self-attention operates on , which are assembled to one latent node in the second aggregation.

Specifically, we adopt a latent Graph Neural Network architecture zhang2019latentgnn, which integrates self-attention with learned weighted-sum aggregation layers. Formally, we introduce a set of latent node features where is a c-channel feature vector same as the set of input node features . Note that can be arbitrary number while should be a fixed hyperparameter. We construct a graph with the nodes and full connections from all of the input nodes to each of the latent nodes, i.e., where refers to the set of input nodes and latent nodes respectively. The graph network module is illustrated in Fig. 2.

The output of the aggregation layer is the latent node features , which are computed as follows:

where is a learned affinity function parameterized by that encodes the affinity between any input node and the latent node. In practice, we instantiate this function as the dot product followed with normalization, i.e., .

We then combine the above aggregation layer with the following self-attention layer :

where we use the the scaled dot product attention vaswani2017attention.

Following zhang2019latentgnn, we propagate messages through a shared space with full-connections between latent nodes. We first pass the input nodes through an aggregation layer with latent nodes, where , then perform self-attention on the latent nodes (for multiple iterations), and finally pass the latent nodes through another aggregation layer with final latent node to obtain the final output, i.e., . The final output provides the parameters for

which is a Gaussian distribution. The network can be viewed as a multi-stage summarization process, in which we group the inputs into several summaries, and operate on these summaries to compute the relationships between entities, and produce a final summary of the entities and relationships.

Figure 3. Training efficiency. The test-task performance versus the number of interactions with the environment during meta-training. The dash lines represent the asymptotic performance of each method.
Figure 4. Testing efficiency. The x-axis denotes the number of trajectories used as adaptation data.

5. Experiments

In this section, we demonstrate the efficacy of our design and the behavior of our method, termed CASTER (shorthand for Context-Aware task encoder with Self-supervised Task ExploRation for efficient meta-reinforcement learning), through a series of experiments. We first introduce our experimental setup in Sec. 5.1. In Sec. 5.2, we evaluate our CASTER against three meta-RL algorithms in terms of sample efficiency. We then compare the behavior of CASTER with PEARL regarding overfitting and exploration in Sec. 5.3. Finally, in Sec. 5.4, we conduct ablation study on our design of the exploration strategy and the task encoder.

5.1. Experiemtal Setup

We evaluate our method on four benchmarks proposed in rothfuss2018promp222Two experiments “cheetah-forward-backward” and “ant-forward-backward” are not used because they only include two tasks (the goal of going forward and backward), and do not match the meta-learning assumption that there is a task distribution from which we sample the meta-train task set and a held-out meta-test task set. Such benchmark does not provide convincing evidence for the efficacy of meta-learning algorithms. and an environment introduced in rakelly2019efficient for ablation. They are implemented on the OpenAI Gym brockman2016openai with the MuJoCo simulator todorov2012mujoco. All of the experiments characterize locomotion tasks, which may vary either in the reward function or the transition function. We briefly describe the environments as follows.

  • [leftmargin=4mm]

  • Half-cheetah-velocity. Each task requires the agent to reach a different target velocity.

  • Ant-goal. Each task requires the agent to reach a goal location on a 2D plane .

  • Humanoid-direction. Each task requires the agent to keep high velocity without falling off in a specified direction.

  • Walker-random-params. Each task requires the agent to keep high velocity without falling off in different system configurations.

  • Point-robot. Each task requires a point-mass robot to navigate to a different goal location on a 2D plane.

We adopt the following evaluation protocol throughout this section: first, the estimated per-episode performance on each task is averaged over (at least) three trials with different random seeds; second, the test-task performance is evaluated on the held-out meta-test tasks and is an average of the estimated per-episode performance over all tasks with at least three trials. More details about the experiments can be found at our Github repository.

5.2. Performance

In this section, we compare with three meta-RL algorithms that are the representatives of the three lines of works mentioned in Sec. 2.1: 1) inference-based method, PEARL rakelly2019efficient; 2) optimization-based method, ProMP rothfuss2018promp; 3) black-box method,  rl-squared.

For those baselines, we reproduce the experimental results via their officially released code, following their proposed meta-training and testing pipelines. We note that prior works rl-squared; rothfuss2018promp are not designed to optimize for sample efficiency but we keep their default hyperparameter settings in order to reproduce their results.

Figure 5. Overfitting in off-policy meta-RL. Each column in the plot corresponds to a different environment. We pick three environments most prone to meta-overfitting, i.e., ‘cheetah-vel’, ‘ant-goal’ and ‘point-robot’ from left to right.

To demonstrate the training efficiency and testing efficiency, we plot the test-task performance as a function of the number of samples. Fig. 3 shows the comparison results on the training efficiency. Here the x-axis indicates the number of interactions with the environment used to collect buffer data. At each x-tick, we evaluate the test-task performance with 2 episodes for all methods. While for testing efficiency, the x-axis refers to the number of adaptation episodes used by different methods to perform policy adaptation (or task inference).

We can see CASTER outperforms other methods by a sizable margin: CASTER achieves the same status of performance with much fewer environmental interactions (up to 400%) while being able to reach higher performance (up to ). PEARL and CASTER incorporate off-policy learning and naturally enjoy an advantage in training sample efficiency. Our CASTER achieves better performances due to two reasons. On the one hand, efficient exploration potentially offers CASTER richer information within limited episodes (two in Fig. 3), and pushes higher the upper bound of accurate task inference. On the other hand, context-aware relational reasoning extracts information effectively from the task experiences, and thus improves the ability of task inference.

Testing efficiency is shown in Fig. 4, and CASTER still stands out against other methods. Notably, CASTER achieves large performance boost with the first several episodes, indicating that the learned exploration policy is critical in task inference. By contrast, other models either fail to improve from zero-shot performance (e.g. flat lines of , PEARL in humanoid-dir, ProMP in walker-rand-params) via exploration or have unstable performance that drops after improvement (e.g., PEARL in cheetah-vel and ant-goal).

5.3. Understanding CASTER’s Behavior

5.3.1. Meta-overfitting

As mentioned in 4.1, meta-overfitting arises due to train-test mismatch, an issue particular for meta-RL methods that incorporate off-policy learning. We compare the test-task performance obtained with different adaptation data distribution, i.e., the off-policy buffer data v.s. the on-policy exploration data, to see the performance gap when the data distribution shifts. We pick three environments that we find most prone to train-test mismatch, i.e., cheetah-vel, ant-goal, point-robot. Note that in the experiments, the number of transitions in the off-policy data equals the number of transitions in two episodes of the on-policy data.

Fig. 5 shows the results on the three environments333The ant-goal curve reproduced with PEARL’s public repo has a discrepancy from the result reported in rakelly2019efficient, which we suspect to be the train-task performance with off-policy data. We recommend the reader to check with the publicly available code.. CASTER takes a large leap towards bridging the gap between train and test sample distribution, reducing up to performance drop as compared to PEARL. This can be credited to better stochasity inherent in the explorer’s episodes (Sec. 5.3.2) and better task inference of CASTER. The stochasity in the online rollouts brings its distribution closer to the off-policy data distribution, since at meta-train time data is randomly sampled from the buffer. Task inference that is aware of relations between samples better extracts the information in the few trajectories, which also narrow the gap by improving task inference.

5.3.2. The Learned Exploration Strategy

We investigate the exploration process by visualizing the histogram of the rewards in the collected trajectories. We take three consecutive trajectories rolled out by the explorer on two benchmarks : 1) the reward-varying environment ant-goal, 2) the transition-varying environment walker-rand-params. We compare with PEARL since it also resorts to posterior sampling for exploration strens2000bayesian; osband2013more; osband2016deep.

Figure 6. Learned exploration behavior in ant-goal. The reward histogram consists of 3 consecutive trajectories. Left: CASTER and Right: PEARL. Different color denotes different trajectories.

In Fig. 6, it is shown that CASTER’s exploration spans a large reward region of , while PEARL ranges from to . This is consistent with our goal to increase the coverage of task experiences within few episodes, which potentially increases the chances to reach the goal and enables the task encoder to reason with both high rewards and low rewards, i.e., what region of state space might be the goal and might not. By contrast, PEARL tends to exploit the higher extreme values at the risk of following a wrong direction (e.g., in the second plot, PEARL proceeds to explore in a narrow region of relatively lower reward than the preceding episode).

Figure 7. Learned exploration behavior in walker-rand-params.

In Fig. 7, the two methods are evaluated in the transition-varying environment ant-goal in which higher rewards do not necessarily carry the information of the system parameters. In this case, PEARL clings to higher reward regions as expected, while CASTER favors lower reward region. Supported by the results in Fig. 3,4, PEARL’s exploration is sub-optimal. CASTER performs better because the exploration is powered by information gain, and the explorer discovers task-discriminative experiences that happen to be around lower reward region.

5.4. Ablations

In this section, we investigate our design choices of the exploration strategy and the task encoder via a set of ablative experiments on the point-robot environment. We also provide the test-task performance with off-policy data (adaptation data from a buffer collected beforehand) to eliminate the effect of insufficient data source for task inference that induces train-test mismatch.

5.4.1. The Task Encoder

Figure 8. Ablation study of the task encoder. We show the test-task performance with on-policy data (left) and off-policy data (right).

PEARL rakelly2019efficient builds its task encoder by stacking a Gaussian product (Gp) aggregator on top of an MLP, while we propose to combine self-attention with a learned weighted-sum aggregator for better relational modeling. We examine the following models: 1) Enc(Gp)-Exp(None) that uses the Gp task encoder, 3) Enc(WS)-Exp(None) that uses a learned weighted-sum aggregator without self-attention for graph message-passing, 3) Enc(GNN)-Exp(None) that uses the proposed GNN encoder, 4) and Enc(GNN)-Exp(RS) the proposed model. Note that for the first three baselines, the explorer is disabled to eliminate the impact of the exploration policy.

In Fig. 8, all models tend to converge to similar asymptotic performance with off-policy data. However Enc(WS)-Exp(None) learns much slower than its counterparts, while Enc(Gp)-Exp(None) seems a reasonable design w.r.t. the off-policy performance. We conjecture that it can be hard for Enc(Gp)-Exp(None) to learn a weighting strategy as Gp, in which the weights are determined by the relative importance of a sample w.r.t. the whole pool of samples (), and Enc(Gp)-Exp(None) has no access to other samples when computing the weight of each sample. By contrast, Enc(GNN)-Exp(None) provides more flexibility for the final weighted average pooling, by incorporating the interactions between samples. Such relational modeling enables it to extract more task statistics within the few episodes, superior to Gp in the on-policy performance.

5.4.2. The Exploration Strategy

Figure 9. Ablation study of exploration. We show the test-task performance with on-policy data (left) and off-policy data (right).

We aim to show the efficacy of the proposed reward shaping for task-exploration. The baseline models are: 1) Enc(Gp)-Exp(None) that uses no exploration policy, 2) Enc(Gp)-Exp(Rand) that uses a (uniformly) random explorer, 3) Enc(Gp)-Exp(RS) that uses an explorer guided by the proposed reward shaping. Here the Gp task encoder is used by default.

As shown in Fig. 9, all models perform equally in terms of asymptotic performance with off-policy data, since sufficient coverage of experiences are inherent in the data randomly sampled from buffers. This demonstrates the significance of improving coverage of task experiences for task inference.

For the on-policy performance, we can see Enc(Gp)-Exp(Rand) suffers from significant overfitting, as the coverage brought by the randomness doesn’t suffice to benefit it within few episodes. Enc(Gp)-Exp(RS) performs much better than Enc(Gp)-Exp(Rand) because high rewards is an informative guidance in a reward-varying environment. Our approach combines the merits of both broad coverage over task experiences and informative guidance regarding task relevance, and hence achieves the best performance.

6. Conclusion

We divide the problem of meta-RL into three sub-tasks, and take on probabilistic inference via variational EM learning. We thus present CASTER, a novel meta-RL method that learns dual agents with a task encoder for task inference. CASTER performs efficient task-exploration via a curiosity-driven exploration policy, from which the collected experiences are exploited by a context-aware task encoder. The encoder is equipped with the capacity for relational reasoning, with which the action policy adapts to complete the current task. Through extensive experiments, we show the superiority of CASTER over prior methods in sample efficiency, and empirically reveal that the learned exploration strategy efficiently acquires task-informative experiences with randomized behavior, which effectively helps mitigate meta-overfitting.

7. Acknowledgement

This work is supported by Shanghai NSF Grant (No. 18ZR1425100) and NSFC Grant (No. 61703195).