1. Introduction
Modern reinforcement learning has achieved great successes in solving certain complex tasks by utilizing deep neural networks, which can even be trained from scratch
alphazero; poker; alphastar. Such successes, however, require a large amount of training experiences for new tasks. In contrast, human learners are able to exploit past experiences when facing a novel problem and quickly learn skills for a related task lake2017building. To achieve such fast adaptation is a critical step towards building a general AI agent capable of solving multiple tasks in realworld environments.A principled way to tackling the problem of efficient adaptation is the meta learning framework finn2017model, which aims to capture shared knowledge across tasks and hence enables an agent to learn a similar task using only a few experiences. In the reinforcement learning setting, however, as the learning agent also needs to explore in each novel task, it is particularly challenging to design an efficient metaRL algorithm. A majority of prior works adopt onpolicy RL algorithms rlsquared; finn2017model; gupta2018meta; rothfuss2018promp, which are datainefficient during metatraining rakelly2019efficient. To remedy this, rakelly2019efficient propose an alternative strategy, PEARL, that relies on offpolicy RL methods to achieve sample efficiency in metatraining. By introducing a latent variable to represent a task, their method decomposes the problem into online task inference and taskconditioned policy learning that uses experiences from a replay buffer (i.e., offpolicy learning).
Nevertheless, such an offpolicy strategy has several limitations during metatest stage, particularly for the fewshot setting. First, it ignores the role of exploration in the task inference (cf. metaepisodes in rlsquared), which is critical in efficient adaptation as the exploration is responsible for collecting informative experiences within the few episodes to infer tasks. In addition, the agent has to explore in an online fashion for task inference during metatest, which typically has a different sample distribution from the replay buffer that provides adaptation data at metatrain stage. PEARL partially alleviates this problem of traintest mismatch vinyals2016matching by adopting a replay buffer of recently collected data. Such an inbetween strategy rakelly2019efficient, however, still leads to severe “metaoverfitting”, particularly for online task inference (cf. Sec 4.1&5.3). Furthermore, the adaptation data acquired by exploration are simply aggregated by a weighted average pooling for task inference. As the experience samples are not iid, such aggregation fails to capture their dependency relations, which is informative for task inference.
In this work, we propose a contextaware task reasoning strategy for metareinforcement learning to address the aforementioned limitations. Adopting a latent representation for tasks, we formulate the metalearning as a POMDP, which learns an approximate posterior distribution of the latent task variable jointly with a taskdependent action policy that maximizes the expected return for each task. Our main focus is to develop an adaptive task inference strategy that is able to effectively map from fewshot experiences of a task to its representation without suffering from the metaoverfitting.
To achieve this, we design a novel task inference network that consists of an exploration policy network and a structured task encoder shared by all tasks. Learning an exploration policy allows us to explore a task environment more efficiently and to introduce regularization to bridge the gap between metatrain and metatest, leading to better generalization. The structured task encoder is built on a contextaware graph network, which is input permutationinvariant and sizeagnostic in order to cope with variable number of exploration episodes. Our graph network encodes taskrelated dependency in experience data samples, capable of capturing complex task statistics from exploration to achieve more dataefficient task inference.
To train our metalearner, we develop a variational EM formulation that alternately optimizes the exploration policy network that collects experiences for a task, the contextaware graph network that performs task inference based on the collected task data, and an action policy network that aims to complete tasks towards maximum rewards given the inferred task information. Our metalearning objective, formulated under the POMDP framework, is composed of two stateaction value functions for the two respective policies. The stateaction value for the exploration policy is guided by a reward shaping strategy that encourages the policy to collect taskinformative experiences, and is learned with Soft Actor Critic (SAC) haarnoja2018soft for randomized behavior to narrow the gap between online rollouts and offline buffers. In essence, our method decomposes the metaRL problem into three subtasks, taskexploration, taskinference and taskfulfillment, in which we learn two separate policies for different purposes and a task encoder for task inference.
We evaluate our metareinforcement learning framework on four benchmarks with a diverse task distributions, in which our approach outperforms prior works by improving training efficiency (up to 400x) and testing efficiency with better asymptotic performance (up to 300%) while effectively mitigating the metaoverfitting by a large margin (up to 75%). Our contributions can be summarized as follows:

[leftmargin=4mm]

We propose a sampleefficient metaRL algorithm that achieves the stateoftheart performances on multiple benchmarks with diverse task distributions.

We present a dualagents design with a reward shaping strategy that explicitly optimizes for the ability for task exploration and mitigates metaoverfitting.

We develop a contextaware graph network for task inference that models the dependency relations between experience data in order to efficiently infer task posterior.
2. Related work
2.1. Metareinforcement Learning
Prior meta reinforcement learning methods can be categorized into the following three lines of work.
The first line of work adopts a learningtolearn strategy finn2017model; stadie2018some; gupta2018meta; rothfuss2018promp; xu2018learning. In particular, MAML finn2017model metalearns a model initialization from which to adapt the parameters with policy gradient methods. To tackle issues in computing secondorder derivatives for MAML, ProMP rothfuss2018promp
further proposes a lowvariance curvature estimator to perform proximal policy optimization that bounds the policy changeRecently, MAESN
gupta2018meta improves MAML with more temporally coherent exploration by injecting structured noise via a latent variable conditioned policy, and enables fast learning of exploration strategies.The second category of approaches uses recurrent or recursive models to directly metalearn a function that takes experiences as input and generates a policy for an agent mishra2017simple; reiforcelearn; rlsquared. Among them, rlsquared trains a recurrent agent with onpolicy metaepisodes that comprise a sequence of exploration and taskfulfillment episodes, which aims to maximize the expected return of the taskfulfillment episode. Their method essentially learns to extract task information from the first few rounds of exploration, encoded in the latent states of its RNN, and complete the task in the final episode based on the known task information (latent states).
The first two groups of work often learn a single policy to perform exploration for policy adaptation and taskfulfillment, which overloads the agent with two distinct objectives. By contrast, we learn two separate policies, one focusing on exploration for policy adaptation and the other for taskfulfillment.
In the third line of work, rakelly2019efficient and taskinf propose to first infer task with task experiences and adapts the agent according to the task knowledge. Framing metaRL as a POMDP problem with probabilistic inference, taskinf formulate task inference as belief state in POMDP, and learn the inference procedure in a supervised manner with privileged information as ground truth. PEARL rakelly2019efficient learns to infer tasks in an unsupervised manner by introducing an extra reconstruction term and incorporates offpolicy learning to improve sample efficiency. However, both approaches ignore the role of exploration in task inference, and PEARL suffers from traintest mismatch in the data distribution for task inference.
By contrast, we propose to further disentangle metaRL into taskexploration, taskinference and taskfulfillment, and introduce an exploration policy within a variational EM formulation. Our method enhances the inference procedure via active task exploration and effective task inference, achieves dataefficiency both in metatrain and metatest, and mitigates metaoverfitting.
2.2. Relational Modeling on Sets
The problem of inference on a set of samples has been widely explored in literature battaglia2018relational; gilmer2017neural; li2015gated; bruna2013spectral; kipf2016semi; hamilton2017inductive, and here we focus on those approaches that learn a permutationinvariant and input sizeagnostic model.
DeepSets zaheer2017deep proposes a general design principles for permutation invariant functions on sets. As an instantiation, Sets2Sets vinyals2015order
encodes a set using an attention mechanism, which retrieves vectors in a manner immune to the shuffle of memory and accepts variable size of inputs. However, they typically do not consider relations among input elements, which is critical for modeling short experiences for task inference.
Selfattention module internet; vain; santoro2017simple; wang2018non is commonly used to model pairwise relationships between entities of a set. Among them, Nonlocal Neural Networks wang2018non are capable of capturing global context using the scaled dot product attention vaswani2017attention. Along this direction, graph networks battaglia2018relational; gilmer2017neural provide a flexible framework that can model arbitrary relationships among entities and is invariant to graph isomorphism. Selfattention on sets, a special case of graphs, can naturally be assimilated into the graph network framework zhang2019latentgnn; battaglia2018relational; velivckovic2017graph. Recently, LatentGNN zhang2019latentgnn develops a novel undirected graph model to encode nonlocal relationships through a shared latent space and admits efficient message passing due to its sparse connection. In this work, the design of our graphbased inference network is inspired by LatentGNN and DeepSets, aiming to efficiently model the dependency relationship in experience data.
3. Variational MetaReinforcement Learning
We aim to address the fast adaptation problem in reinforcement learning, which allows the learned agent to explore a new task (up to a budget) and adapts its policy to the new task. We adopt the metalearning framework to enable fast learning, incorporating a prior of past experiences in structurally similar tasks.
Formally, we assume a task distribution from which multiple tasks are i.i.d sampled for metatraining and testing. Any task can naturally be defined by its MDP in RL, where is the state space, is the action space, is the distribution of the initial states, is the transition distribution, is the discount in rewards which we will omit for clarity, and is the reward function. Here the task distribution is typically defined by the distribution of the reward function and the transition function, which can vary due to some underlying parameters. We introduce a latent variable to represent the cause of task variations, and formulate the meta RL problem as a POMDP in which serves as the unobserved part of the state space.
We first introduce two policies, including an action policy that aims to maximize expected reward under a given task and an exploration policy for generating taskinformative samples by interacting with the environment. We then adopt the offpolicy ActorCritic framework dpg, and learn two Qfunctions for the action and exploration policy respectively. To achieve this, we define the following learning objective, which maximizes the expected log likelihood of returns under two policies (we omit the expectation over tasks for clarity):
(1) 
where denotes the distribution visited by the two behavior policies for , denotes the environment, and is the reshaped reward for (Sec. 4.1). Typically, is approximated by the experience replay buffers of , denoted as respectively mnih2013playing; ddpg; haarnoja2018soft. We will slightly abuse the notation
to denote the joint distribution of
for brievity.Here denotes the likelihood of return given a stateaction pair, and is the target value estimated using the optimal Bellman Equation. Note that by assuming where and a constant, then maximizing the likelihood is equivalent to the Temporal Difference learning objective tesauro1995temporal; konda2000actor; mnih2013playing:
(2) 
where leads to the constants .
We utilize the variational EM learning strategy bishop2006pattern to maximize the objective. Denoting the sampled experiences as , we introduce a variational distribution to approximate the intractable posterior , in which we alternate between optimizing for and for other parameterized functions . We refer to as the task encoder in our model.
For Estep, we maximize a variational lower bound derived from (1) to find the optimal . In order to decouple the action and exploration policy, we also introduce an auxiliary experience distribution and minimize the following free energy:
(3) 
where in the first term the TD Error derives from , and the constant term comes from the second term in (1) which is irrelevant to . For the auxiliary distribution , we adopt the exploration policy’s sample distribution , as aims to sample taskinformative experiences for task inference. Hence the Estep objective has the following form:
(4) 
For Mstep, given , we minimize the following freeenergy objective derived from (1) :
(5) 
Based on this variational EM algorithm, our approach learns dual agents for taskexploration and taskfulfillment respectively. The metatraining process interleaves the data collection process with the alternating EM optimization process, as shown in Alg. 1.
In contrast to taskinf; rakelly2019efficient, our variational EM formulation derives from a unified learning objective and enables fast learning of dual agents with different roles. The explorer learns the ability of taskexploration that targets at efficient exploration to (actively) collect sufficient task information for task inference, while the actor learns for taskfulfillment that accomplishes the task towards high rewards. Both the explorer and the actor are taskconditioned, so that they are able to perform temporallycoherent exploration for different goals, given the current task hypothesis gupta2018meta.
With such a formulation, we need to answer the following questions to achieve efficient and effective task inference to solve the metaRL problem:

[leftmargin=4mm]

How to guide the explorer towards active exploration for sufficient task information that is crucial in task inference with few experiences?

How to achieve effective task inference that captures the relationships between experiences and reduce task uncertainty?

Since we incorporate offpolicy learning algorithms, how to mitigate the traintest mismatch issue due to the use of replay buffer v.s. online rollouts during testing?
We will address the above three questions in Sec. 4, which introduces our task inference strategy in detail.
Metatest
At the metatest time, we apply the same data collection procedure as in the metatraining stage. For each task, the explorer first samples a hypothesis from the posterior (initialized as ), and then explores optimally according to the hypothesis. The collected experiences during the exploration are used by the task encoder to update the posterior of . This process iterates until the explorer uses up the maximum number of rollouts. All the experiences are fed to the task encoder to produce a final posterior representing the belief of MDPs. We then sample from the posterior to generate a taskconditioned action policy , which interacts with the environment in an attempt to fulfill the task. This process is shown in Fig. 1.
4. Task Inference Strategy
We now introduce our task reasoning strategy for metaRL with limited data. Our goal is to use the limited exploration to generate experiences with sufficient information for task inference and to effectively infer the task posterior from the given task experiences. In Sec. 4.1, we explain how we learn an explorer that pursues taskinformative experiences with randomized behavior and mitigates traintest mismatch. In Sec. 4.2, we elaborate on the network design of the task encoder, which enables relational modeling between task experiences.
4.1. Learning the Exploration Policy
Our exploration policy aims to enrich the taskrelated information of experiences within limited episodes. To this end, below we introduce two reward shaping methods: first, we increase the coverage of task experiences to obtain taskinformative experiences; second, we improve the quality of each sample to have as much information gain as possible.
4.1.1. Increasing Coverage
The first idea of improving exploration is to increase its coverage of task experiences by adding stochasity into the agent’s behavior, which is motivated by the empirical observations of the performance obtain with replay buffer (see the offpolicy curve in Fig. 9 ). Offpolicy data have more coverage of experiences as the samples are uniformly drawn from multiple different trajectories, compared to samples in online rollouts with smaller variations.
To this end, we encourage the randomized behavior of the explorer by leveraging SAC haarnoja2018soft. SAC derives from the entropyregularized RL objective levine2018reinforcement, which essentially adds entropy to the reward (and value) functions. Note that we optimize both and with SAC, but for different purposes. SAC favors exploration in policy learning for the actor , which will run deterministically at metatest time. In contrast, here SAC is used to guide the explorer towards randomized taskexploration, which is a stochastic policy during deployment.
It is worth noting that training an exploration policy with randomized behavior also helps mitigate traintest mismatch, which is caused by the different data distributions for policy adaptation (buffer data v.s. online rollouts). Our taskexploration explicitly learns the ability for randomized exploration that (empirically) brings the online rollout data distribution closer to the offpolicy buffer data.
4.1.2. Improving Quality
To improve the quality of exploration, we also design a reward shaping that guides the explorer towards highly informative experiences. However, it is nontrivial to quantify the informativeness in a sample. In our case, the explorer collects experiences given a task hypothesis, so we are more interested in evaluating the mutual information as the reward for the explorer,
(6) 
where we differentiate the new task hypothesis against previously hypothesized task , denotes the repository of collected experiences and is the sample for assessment. Intuitively, we expect the credit to reflect the information gain from the sample given the current hypothesis of task .
However, directly computing the mutual information is impractical, as it involves evaluating the new posterior after incorporating each new sample . Instead, we adopt the following proxy for the mutual information:
(7) 
This proxy implies that we believe a sample brings larger information gain if the new hypothesis is less likely to be the same as the prior belief . To compute , we apply the Bayes Rule:
(8) 
As the joint distribution of is constant given , the posterior of is proportional to the inverse of the likelihood. Similar to Sec. 3, we use the stateaction value function to compute the likelihood , but we instead assume a laplace distribution as we empirically find it is more stable to use the L1norm than the L2norm induced by a Gaussian. As a result, we have the following score function that gives the shaped reward :
(9)  
(10) 
where the hyperparameter
is the reward scale, and the greedy policy is learned to compute dpg; ddpg; haarnoja2018soft.4.2. Contextaware Task Encoder
Our task inference network computes the posterior of latent variable given a set of experience data, aiming to extract task information from experiences. To this end, we design a network module with the following properties:

[leftmargin=4mm]

Permutationinvariant, as the output should not vary with the order of the inputs.

Input sizeagnostic, as the network would encounter variable size of inputs within the arbitrary number of rollouts.

Contextaware, as extracting cues from a single sample should incorporate the context formed by other samples^{1}^{1}1Imagine in a 2dnavigation task where the agent aims to navigate to a goal location, a sample may indicate the possible location of the goal due to the high rewards, and can further eliminate possibilities by another sample that shows what locations are not possible by its low rewards..
Specifically, we adopt a latent Graph Neural Network architecture zhang2019latentgnn, which integrates selfattention with learned weightedsum aggregation layers. Formally, we introduce a set of latent node features where is a cchannel feature vector same as the set of input node features . Note that can be arbitrary number while should be a fixed hyperparameter. We construct a graph with the nodes and full connections from all of the input nodes to each of the latent nodes, i.e., where refers to the set of input nodes and latent nodes respectively. The graph network module is illustrated in Fig. 2.
The output of the aggregation layer is the latent node features , which are computed as follows:
where is a learned affinity function parameterized by that encodes the affinity between any input node and the latent node. In practice, we instantiate this function as the dot product followed with normalization, i.e., .
We then combine the above aggregation layer with the following selfattention layer :
where we use the the scaled dot product attention vaswani2017attention.
Following zhang2019latentgnn, we propagate messages through a shared space with fullconnections between latent nodes. We first pass the input nodes through an aggregation layer with latent nodes, where , then perform selfattention on the latent nodes (for multiple iterations), and finally pass the latent nodes through another aggregation layer with final latent node to obtain the final output, i.e., . The final output provides the parameters for
which is a Gaussian distribution. The network can be viewed as a multistage summarization process, in which we group the inputs into several summaries, and operate on these summaries to compute the relationships between entities, and produce a final summary of the entities and relationships.
5. Experiments
In this section, we demonstrate the efficacy of our design and the behavior of our method, termed CASTER (shorthand for ContextAware task encoder with Selfsupervised Task ExploRation for efficient metareinforcement learning), through a series of experiments. We first introduce our experimental setup in Sec. 5.1. In Sec. 5.2, we evaluate our CASTER against three metaRL algorithms in terms of sample efficiency. We then compare the behavior of CASTER with PEARL regarding overfitting and exploration in Sec. 5.3. Finally, in Sec. 5.4, we conduct ablation study on our design of the exploration strategy and the task encoder.
5.1. Experiemtal Setup
We evaluate our method on four benchmarks proposed in rothfuss2018promp^{2}^{2}2Two experiments “cheetahforwardbackward” and “antforwardbackward” are not used because they only include two tasks (the goal of going forward and backward), and do not match the metalearning assumption that there is a task distribution from which we sample the metatrain task set and a heldout metatest task set. Such benchmark does not provide convincing evidence for the efficacy of metalearning algorithms. and an environment introduced in rakelly2019efficient for ablation. They are implemented on the OpenAI Gym brockman2016openai with the MuJoCo simulator todorov2012mujoco. All of the experiments characterize locomotion tasks, which may vary either in the reward function or the transition function. We briefly describe the environments as follows.

[leftmargin=4mm]

Halfcheetahvelocity. Each task requires the agent to reach a different target velocity.

Antgoal. Each task requires the agent to reach a goal location on a 2D plane .

Humanoiddirection. Each task requires the agent to keep high velocity without falling off in a specified direction.

Walkerrandomparams. Each task requires the agent to keep high velocity without falling off in different system configurations.

Pointrobot. Each task requires a pointmass robot to navigate to a different goal location on a 2D plane.
We adopt the following evaluation protocol throughout this section: first, the estimated perepisode performance on each task is averaged over (at least) three trials with different random seeds; second, the testtask performance is evaluated on the heldout metatest tasks and is an average of the estimated perepisode performance over all tasks with at least three trials. More details about the experiments can be found at our Github repository.
5.2. Performance
In this section, we compare with three metaRL algorithms that are the representatives of the three lines of works mentioned in Sec. 2.1: 1) inferencebased method, PEARL rakelly2019efficient; 2) optimizationbased method, ProMP rothfuss2018promp; 3) blackbox method, rlsquared.
For those baselines, we reproduce the experimental results via their officially released code, following their proposed metatraining and testing pipelines. We note that prior works rlsquared; rothfuss2018promp are not designed to optimize for sample efficiency but we keep their default hyperparameter settings in order to reproduce their results.
To demonstrate the training efficiency and testing efficiency, we plot the testtask performance as a function of the number of samples. Fig. 3 shows the comparison results on the training efficiency. Here the xaxis indicates the number of interactions with the environment used to collect buffer data. At each xtick, we evaluate the testtask performance with 2 episodes for all methods. While for testing efficiency, the xaxis refers to the number of adaptation episodes used by different methods to perform policy adaptation (or task inference).
We can see CASTER outperforms other methods by a sizable margin: CASTER achieves the same status of performance with much fewer environmental interactions (up to 400%) while being able to reach higher performance (up to ). PEARL and CASTER incorporate offpolicy learning and naturally enjoy an advantage in training sample efficiency. Our CASTER achieves better performances due to two reasons. On the one hand, efficient exploration potentially offers CASTER richer information within limited episodes (two in Fig. 3), and pushes higher the upper bound of accurate task inference. On the other hand, contextaware relational reasoning extracts information effectively from the task experiences, and thus improves the ability of task inference.
Testing efficiency is shown in Fig. 4, and CASTER still stands out against other methods. Notably, CASTER achieves large performance boost with the first several episodes, indicating that the learned exploration policy is critical in task inference. By contrast, other models either fail to improve from zeroshot performance (e.g. flat lines of , PEARL in humanoiddir, ProMP in walkerrandparams) via exploration or have unstable performance that drops after improvement (e.g., PEARL in cheetahvel and antgoal).
5.3. Understanding CASTER’s Behavior
5.3.1. Metaoverfitting
As mentioned in 4.1, metaoverfitting arises due to traintest mismatch, an issue particular for metaRL methods that incorporate offpolicy learning. We compare the testtask performance obtained with different adaptation data distribution, i.e., the offpolicy buffer data v.s. the onpolicy exploration data, to see the performance gap when the data distribution shifts. We pick three environments that we find most prone to traintest mismatch, i.e., cheetahvel, antgoal, pointrobot. Note that in the experiments, the number of transitions in the offpolicy data equals the number of transitions in two episodes of the onpolicy data.
Fig. 5 shows the results on the three environments^{3}^{3}3The antgoal curve reproduced with PEARL’s public repo has a discrepancy from the result reported in rakelly2019efficient, which we suspect to be the traintask performance with offpolicy data. We recommend the reader to check with the publicly available code.. CASTER takes a large leap towards bridging the gap between train and test sample distribution, reducing up to performance drop as compared to PEARL. This can be credited to better stochasity inherent in the explorer’s episodes (Sec. 5.3.2) and better task inference of CASTER. The stochasity in the online rollouts brings its distribution closer to the offpolicy data distribution, since at metatrain time data is randomly sampled from the buffer. Task inference that is aware of relations between samples better extracts the information in the few trajectories, which also narrow the gap by improving task inference.
5.3.2. The Learned Exploration Strategy
We investigate the exploration process by visualizing the histogram of the rewards in the collected trajectories. We take three consecutive trajectories rolled out by the explorer on two benchmarks : 1) the rewardvarying environment antgoal, 2) the transitionvarying environment walkerrandparams. We compare with PEARL since it also resorts to posterior sampling for exploration strens2000bayesian; osband2013more; osband2016deep.
In Fig. 6, it is shown that CASTER’s exploration spans a large reward region of , while PEARL ranges from to . This is consistent with our goal to increase the coverage of task experiences within few episodes, which potentially increases the chances to reach the goal and enables the task encoder to reason with both high rewards and low rewards, i.e., what region of state space might be the goal and might not. By contrast, PEARL tends to exploit the higher extreme values at the risk of following a wrong direction (e.g., in the second plot, PEARL proceeds to explore in a narrow region of relatively lower reward than the preceding episode).
In Fig. 7, the two methods are evaluated in the transitionvarying environment antgoal in which higher rewards do not necessarily carry the information of the system parameters. In this case, PEARL clings to higher reward regions as expected, while CASTER favors lower reward region. Supported by the results in Fig. 3,4, PEARL’s exploration is suboptimal. CASTER performs better because the exploration is powered by information gain, and the explorer discovers taskdiscriminative experiences that happen to be around lower reward region.
5.4. Ablations
In this section, we investigate our design choices of the exploration strategy and the task encoder via a set of ablative experiments on the pointrobot environment. We also provide the testtask performance with offpolicy data (adaptation data from a buffer collected beforehand) to eliminate the effect of insufficient data source for task inference that induces traintest mismatch.
5.4.1. The Task Encoder
PEARL rakelly2019efficient builds its task encoder by stacking a Gaussian product (Gp) aggregator on top of an MLP, while we propose to combine selfattention with a learned weightedsum aggregator for better relational modeling. We examine the following models: 1) Enc(Gp)Exp(None) that uses the Gp task encoder, 3) Enc(WS)Exp(None) that uses a learned weightedsum aggregator without selfattention for graph messagepassing, 3) Enc(GNN)Exp(None) that uses the proposed GNN encoder, 4) and Enc(GNN)Exp(RS) the proposed model. Note that for the first three baselines, the explorer is disabled to eliminate the impact of the exploration policy.
In Fig. 8, all models tend to converge to similar asymptotic performance with offpolicy data. However Enc(WS)Exp(None) learns much slower than its counterparts, while Enc(Gp)Exp(None) seems a reasonable design w.r.t. the offpolicy performance. We conjecture that it can be hard for Enc(Gp)Exp(None) to learn a weighting strategy as Gp, in which the weights are determined by the relative importance of a sample w.r.t. the whole pool of samples (), and Enc(Gp)Exp(None) has no access to other samples when computing the weight of each sample. By contrast, Enc(GNN)Exp(None) provides more flexibility for the final weighted average pooling, by incorporating the interactions between samples. Such relational modeling enables it to extract more task statistics within the few episodes, superior to Gp in the onpolicy performance.
5.4.2. The Exploration Strategy
We aim to show the efficacy of the proposed reward shaping for taskexploration. The baseline models are: 1) Enc(Gp)Exp(None) that uses no exploration policy, 2) Enc(Gp)Exp(Rand) that uses a (uniformly) random explorer, 3) Enc(Gp)Exp(RS) that uses an explorer guided by the proposed reward shaping. Here the Gp task encoder is used by default.
As shown in Fig. 9, all models perform equally in terms of asymptotic performance with offpolicy data, since sufficient coverage of experiences are inherent in the data randomly sampled from buffers. This demonstrates the significance of improving coverage of task experiences for task inference.
For the onpolicy performance, we can see Enc(Gp)Exp(Rand) suffers from significant overfitting, as the coverage brought by the randomness doesn’t suffice to benefit it within few episodes. Enc(Gp)Exp(RS) performs much better than Enc(Gp)Exp(Rand) because high rewards is an informative guidance in a rewardvarying environment. Our approach combines the merits of both broad coverage over task experiences and informative guidance regarding task relevance, and hence achieves the best performance.
6. Conclusion
We divide the problem of metaRL into three subtasks, and take on probabilistic inference via variational EM learning. We thus present CASTER, a novel metaRL method that learns dual agents with a task encoder for task inference. CASTER performs efficient taskexploration via a curiositydriven exploration policy, from which the collected experiences are exploited by a contextaware task encoder. The encoder is equipped with the capacity for relational reasoning, with which the action policy adapts to complete the current task. Through extensive experiments, we show the superiority of CASTER over prior methods in sample efficiency, and empirically reveal that the learned exploration strategy efficiently acquires taskinformative experiences with randomized behavior, which effectively helps mitigate metaoverfitting.
7. Acknowledgement
This work is supported by Shanghai NSF Grant (No. 18ZR1425100) and NSFC Grant (No. 61703195).