1 Introduction
MetaReinforcement Learning (metaRL) is an approach towards quickly solving similar tasks while only gathering a few samples for each task. The fundamental idea behind approaches such as the popular Model Agnostic MetaLearning (MAML) finn2017model is to determine a universal metapolicy by combining the information gained over working on several tasks. This metapolicy is optimized in a manner such that, when adapted using a small number of samples gathered for a given task, it is able to perform near optimally on that specific task. While the dependence of MAML on only a small number of samples for taskspecific adaptation is very attractive, its also means that these samples must be highly representative of that task for meaningful adaptation and metalearning. However, in many realworld problems, rewards provided are sparse in that they might only provide limited feedback. For instance, there might be a reward only if a robot gets close to designated way point, with no reward otherwise. Hence, MAML is likely to fail on both counts of taskspecific adaptation and optimization of the metapolicy over tasks in sparse reward environments. Making progress towards learning a viable metapolicy in such settings is challenging without additional information.
Many realworld tasks are associated with empirically determined polices used in practice. Such polices could be inexpert, but even limited demonstration data gathered from applying these policies could contain valuable information in the sparse reward context. While the fact that the policy generating the data could be inexpert suggests that direct imitation might not be optimal, supervised learning over demonstration data could be used for enhancing adaptation and learning. How best should we use demonstration data to enhance the process of metaRL in the sparse reward setting?
Our goal is a principled design of a class of metaRL algorithms that can exploit demonstrations from inexpert policies in the sparse reward setting. Our general approach follows twostep algorithms like MAML that employ: (i) Taskspecific Adaptation: execute the current metapolicy on a task and adapt it based on the samples gathered to obtain a taskspecific policy, and (ii) Metapolicy Optimization: execute taskspecific policies on an ensemble of tasks to which they are adapted, and use the samples gathered to optimize the metapolicy from whence they were adapted. Our key insight is that we can enhance RLbased policy adaptation with behavior cloning of the inexpert policy to guide taskspecific adaptation in the right direction when rewards are sparse. Furthermore, execution of such an enhanced adapted policy should yield an informative sample set that indicates how best to obtain the sparse rewards for metapolicy optimization. We aim at analytically and empirically capturing this progression of policy improvement starting from taskspecific policy adaption to the ultimate metapolicy optimization. Thus, as long the inexpert policy has an advantage, we must be able to exploit it for metapolicy optimization.
Our main contributions are as follows. (i) We derive a policy improvement guarantee result for MAMLlike twostep metaRL algorithms. We show that the inclusion of demonstration data can further increase the policy improvement bound as long as the inexpert policy that generated the data has an advantage. (ii) We propose an approach entitled Enhanced MetaRL using Demonstrations (EMRLD) that combines RLbased policy improvement and behavior cloning from demonstrations for taskspecific adaptation. We further observe that directly applying the metapolicy to a new sparsereward task sometimes does not yield informative samples, and a warmstart to the metapolicy using the demonstrations significantly improves the quality of the samples, resulting in a variant that we call EMRLDWS. (iii) We show on standard MuJoCo and twowheeled robot environments that our algorithms work exceptionally well, even when only provided with just one trajectory of suboptimal demonstration data per task. Additionally, the algorithms work well even when exposed only to a small number of tasks for metapolicy optimization. (iv) Our approach is amenable to a variety of metaRL problems wherein tasks can be distinguished across rewards (e.g., whether forward or backward motion yields a reward for a task) or across environment dynamics (e.g., the amount of environmental drift that a wheeled robot experiences changes across tasks). To illustrate the versatility of EMRLD, we not only show simulations on different continuous control multitask environments, but also demonstrate its excellent performance via realworld experiments on a TurtleBot robot amsters2019turtlebot. We provide videos of the robot experiments and code at https://github.com/DesikRengarajan/EMRLD.
Related Work: Here, we provide a brief overview of the related works. We leave a more thorough discussion on related works to the Appendix.
Meta learning: Basic ideas on the metalearning framework are discussed in hochreiter2001learning; thrun2012learning; duan2016rl. Modelagnostic metalearning (MAML) finn2017model
introduced the twostep approach described above, and can be used in the supervised learning and RL contexts. However, in its native form, the RL variant of MAML can suffer from issues of inefficient gradient estimation, exploration, and dependence on a rich reward function. Among others, algorithms like ProMP
rothfuss2018promp and DiCE foerster2018dice address the issue of inefficient gradient estimation. Similarly, EMAML al2017continuous; stadie2018some and MAESN gupta2018meta deal with the issue of exploration in metaRL. PEARL rakelly2019efficient takes a different approach to metaRL, wherein task specific contexts are learnt during training, and interpreted from trajectories during testing to solve the task. HTR packer2021hindsight relabels the experience replay data of any offpolicy algorithm such as PEARL rakelly2019efficient to overcome exploration difficulties in sparse reward goal reaching environments. Different from this approach, we use demonstration data to aid learning and are not restricted to goal reaching tasks.RL with demonstration: Leveraging demonstrations is an attractive approach to aid learning hester2018deep; vecerik2017leveraging; nair2018overcoming. Earlier work has incorporated data from both expert and inexpert policies to assist with policy learning in sparse reward environments nair2018overcoming; hester2017learning; vecerik2017leveraging; kang2018policy; rengarajan2022reinforcement. In particular, hester2017learning utilizes demonstration data by adding it to the replay buffer for Qlearning, rajeswaran2017learning proposes an online finetuning algorithm by combining policy gradient and behavior cloning, while rengarajan2022reinforcement proposes a twostep guidance approach where demonstration data is used to guide the policy in the initial phase of learning.
MetaRL with demonstration:
Meta Imitation Learning
finn2017one extends MAML finn2017model to imitation learning from expert video demonstrations. WTL zhou2019watch uses demonstrations to generate an exploration algorithm, and uses the exploration data along with demonstration data to solve the task. GMPS mendonca2019guided extends MAML finn2017model to leverage expert demonstration data by performing metapolicy optimization via supervised learning. Closest to our approach are GMPS mendonca2019guided and Meta Imitation Learning finn2017one, and we will focus on comparisons with versions of these algorithms, along with the original MAML finn2017model.Our work differs from prior work on metaRL with demonstration in several ways. First, existing works assume the availability of data generated by an expert policy, which severely limits their ability to improve beyond the quality of the policy that generated the data. This degrades their performance significantly when they are presented with suboptimal data generated by an inexpert policy that might be used in practice. Second, these works use demonstration data in a purely supervised learning manner, without exploiting the RL structure. We use a combination of loss functions associated with RL and supervised learning to aid the RL policy gradient, which enables our approach to utilize any reward information available. This makes it superior to existing work in the sparse reward environment, which we illustrate in several simulation settings and realworld robot experiments.
2 Preliminaries
A Markov Decision Processes (MDP) is typically represented as a tuple
, where is the state space, is the action space, is the reward function,is the transition probability function,
is the discount factor, and is the initial state distribution. The value function and the stateaction value function of a policy are defined as and . The advantage function is defined as . A policy generates a trajectory where . Since the randomness of is specified by and , we denote it as . The goal of a reinforcement learning algorithm is to learn a policy that maximizes the expected infinite horizon discounted reward defined as . It is easy to see that .The discounted stateaction visitation frequency of policy is defined as . The discounted state visitation frequency is defined as the marginal . It is straightforward to see that .
The total variation (TV) distance between two distributions and is defined as , the average TV distance between two policies and , averaged w.r.t. the is defined as .
Gradientbased metalearning: The goal of a metalearning algorithm is to learn to perform optimally in a new (testing) task using only limited data, by leveraging the experience (data) from similar (training) tasks seen during training. Gradientbased metalearning algorithms achieve this goal by learning a metaparameter which will yield a good task specific parameter after performing only a few gradient steps w.r.t. the task specific loss function using the limited task specific data.
Metalearning algorithms consider a set of tasks with a distribution over . Each task is also associated with a data set , which is typically divided into training data used for task specific adaptation and validation data used for metaparameter update. The objective of the gradientbased metalearning is typically formulated as
(1) 
where is the loss function corresponding to task and is the learning rate. Here, is the task specific parameter obtained by one step gradient update starting from the metaparameter , and the goal is to find the best metaparameter which will minimize the meta loss function
Gradientbased metareinforcement learning:
The gradientbased metalearning framework is applicable both in supervised learning and reinforcement learning. In RL, each task corresponds to an MDP with task specific model and reward function . We assume that the stateaction spaces are uniform across the tasks, thus ensuring the first level of task similarity. Task specific data is the trajectories for generated according to some policy . Since the randomness of the trajectory is specified by and , we denote it as . We consider the function approximation setting where each policy is represented by a function parameterized by and is denoted as . The task specific loss in metaRL is defined as . The gradient can then be computed using policy gradient theorem.
The standard metaRL training is done follows. A task (usually a batch of tasks) is sampled at each iterate of the algorithm. Now, starting with metaparameter , the training data for task adaptation is generated as the trajectories , where , and the updated parameter for task is computed by policy gradient evaluated on . The validation data is then collected as trajectories , where , and the metaparameter is updated by policy gradient evaluated on . In the next section, we will introduce a modified approach which will leverage the demonstration data for task adaptation and metaparameter update.
3 MetaRL using Demonstration Data
Most gradientbased metaRL algorithms learn the optimal metaparameter and the taskspecific parameter from scratch using onpolicy approaches. These algorithms exclusively rely on the reward feedback obtained from the training and validation data trajectories collected through the onpolicy rollouts of the metapolicy and taskspecific policy. However, in RL problems with sparse rewards, a nonzero reward is typically achieved only when the task is completed or nearcompleted. In such sparse rewards settings, trajectories generated according to a policy that is still learning may not achieve any useful reward feedback, especially in the early phase of learning. In other words, since the reward feedback is zero or nearzero, the policy gradient will also be similar, resulting in nonmeaningful improvement in the policy. Hence, standard metaRL algorithms such as MAML, which rely crucially on reward feedback, will not be able to make much progress towards learning a valuable taskspecific or metapolicy in sparse reward settings.
Learning the optimal control policy in sparse reward environments has been recognized as a challenging problem even in the standard RL setting, since most stateoftheart RL algorithms fail to learn any meaningful polices even after a large number of training episodes rajeswaran2017learning; kang2018policy; rengarajan2022reinforcement. One widely accepted approach to overcome this challenge is known as learning from demonstration, wherein demonstration data obtained from an expert rajeswaran2017learning or inexpert policy kang2018policy; rengarajan2022reinforcement is used to aid online learning. The intuitive idea is that, even though the demonstration data does not contain any reward feedback, it can be used to guide the learning agent to reach nonzero reward regions of stateaction spaces. This guidance, usually in the direction of the goal/target, is achieved by inferring some pseudo reward signal through supervised learning approaches using demonstration data.
Can we enhance the performance of metaRL algorithms in sparse reward environments by using demonstration data from suboptimal experts? MetaRL in sparse reward environments is significantly more challenging than that of the standard RL setting. This is because the reward feedback serves the dual objectives of adapting the metaparameter to specific tasks and for updating the metaparameter itself. We note that demonstration data helps with both of these objectives. Firstly, use of demonstration data to guide taskspecific adaptation becomes important because adaptation is achieved in one or a few gradient steps, and policy resulting from each adaptation step might not achieve meaningful reward in a sparse reward setting. Secondly, making use of demonstration data for metaparameter update is equally important because of the role of metapolicy as a rewardyielding exploratory policy. Intuitively, the metapolicy should yield trajectories that reach in the vicinity of the rewardachieving region of the stateaction spaces. This does not happen in sparse reward environments. However, using the guidance from demonstration data, the taskspecific policy obtained after the task adaptation may be able to generate trajectories that will reach within the rewardachieving region resulting in performance acceleration of the metapolicy.
For metalearning with demonstration, we assume that each task is associated with demonstration data which contains a trajectory generated according to a demonstration policy in an environment with model . We do not assume that is the optimal policy for task because in many realworld applications could be generated using an inexpert policy. Our key idea is to enhance task adaptation using the demonstration data by introducing an additional gradient term corresponding to the supervised learning guidance loss. We define the supervised learning loss function for task as . We note that, though this loss function is the same as in behavior cloning (BC), we use it directly in the gradient update instead of performing a simple warm start. This approach is known to achieve superior performance than naive BC warm starting in standard RL problem under the sparse reward setting rajeswaran2017learning; kang2018policy; rengarajan2022reinforcement. The task adaptation step at iteration , starting with metaparameter is now obtained as
(2) 
where and
are hyperparameters that control the extant to which RL and demonstration data influence the gradient.
The next question is: how do we use demonstration data in the metaparameter update? One approach is to use only the demonstration data with a supervised learning loss function for updating the metaparameter as done in mendonca2019guided. We conjecture that such a reduction to supervised learning will severely limit the learning capability of the algorithm. Firstly, if demonstration data is obtained from an inexpert policy, this approach will never be able to achieve the optimal performance. This is because the role of the metapolicy as a rewardyielding exploratory policy will be limited by true performance of the inexpert policy. Secondly, the taskspecific policies obtained according to (2) may be able to reach within the rewardyielding region of stateaction space as we mentioned before. Hence, the validation data collected through the rollout of the policies obtained after task adaptation might contain extremely valuable reward feedback. Utilizing this data could potentially have a significant impact on improving the learning of the metaparameter. Thus, in our approach, we update the metaparameter using the RL loss with policy gradient as follows.
(3) 
We note that the demonstration data is indeed used in the metaparameter update implicitly, as its impact can be observed in (3). We found empirically that the double use of the demonstration data, either by adding an additional gradient through a BC loss function, or by replacing with results in similar or worse performance than the approach described above.
We now formally present our algorithm called Enhanced MetaRL using Demonstrations (EMRLD).
We now present a theoretical justification of why EMRLD should have a superior performance in the sparse reward setting as compared to other gradient based algorithms that do not use demonstration data. First, we introduce some notation. Let be the metapolicy used at iteration of our algorithm. Also, let be the policy obtained after taskspecific adaptation for task . Recall that is the value of the policy for the MDP corresponding to task . Similarly, we can define the stateaction value function and advantage function of policy for task as and , respectively. Also, let be the visitation frequency of policy for task . Now, we can define the value of the metapolicy over the ensemble of all tasks as .
If the demonstration data has to be useful, it should provide a reasonable amount of guidance. In particular, we would like the taskspecific policy adapted using this data to collect feedback that would ensure good metapolicy updates, particularly in the initial stages of metatraining. Since the capability of the demonstration data to guide adaptation will depend on the demonstration policy according to which it is generated, we make the following assumption about .
Assumption 3.1.
During the initial stages of metatraining, , for all and .
Assumption 3.1 implies that during the initial stages of metatraining, the demonstration policy can provide a higher advantage on average than the current policy adapted to that task. This is a reasonable assumption, since any reasonable demonstration policy is likely to perform much better than an untrained policy in the initial phase of learning. We also note that a similar assumption was used in learning from demonstration literature kang2018policy; rengarajan2022reinforcement
We now present the performance improvement result for EMRLD.
Theorem 3.2.
Let be the metapolicy used at iteration of our algorithm and let be the policy obtained after task adaptation in task . Let Assumption 3.1 holds for . Then,
Theorem 3.2 presents a lower bound for the meta policy improvement as a sum of two groups of terms. Maximizing the first term in group one with a constraint on its second term will ensure a higher lower bound and hence an improvement in the metaparameter training. We notice that this is indeed achieved by the TRPO step used in the metaparameter update. Hence, this first group is the same for any MAMLtype of algorithm. The advantage of the demonstration data is revealed in the second group of terms. The term adds a positive quantity to the lower bound, and this contribution from this second group of term can be maximized by minimizing . However, this minimization is hard to perform in practice because estimating requires sampling the data according to , and this is not feasible at iteration . Hence, in practice, we replace that term by . This can be easily achieved by including the standard maximum likelihood objective in the adaptation step. Thus, EMRLD both exploits the advantage offered by an RL step, as well as that of behavior cloning for metapolicy optimization.
We can further improve the performance of EMRLD by including a behavior cloning warm starting step before performing the update (2). We simplify this warm start to a one step gradient as, , and then do the task adaptation as in (2) starting with . We call this version of our algorithm as EMRLDWS. Such a warm start is likely to provide more meaningful samples than directly rolling out the metapolicy to obtain samples for taskspecific adaptation. In the next section, we will see empirically how our design choices for EMRLD and EMRLDWS enable them to learn policies that provide higher rewards using only a small amount of (even suboptimal) demonstration data.
4 Experimental Evaluation
We evaluate the performance of EMRLD based on whether the metapolicy it generates is a good initial condition for taskspecific adaptation in sparsereward environments over (i) Tasks already seen in training and (ii) New unseen tasks. We seek to validate the conjecture that in the sparsereward setting, EMRLD should be able to leverage even demonstrations of inexpert policies to attain high test performance over previously unseen tasks. We do so with regard to two classes of tasks, namely,
Tasks that differ in their reward functions: Simulation experiments on Point2D Navigation finn2017model, TwoWheeled Locomotion gupta2018meta, and HalfCheetah wawrzynski2009cat; todorov2012mujoco
Tasks that differ in the environment dynamics: Realworld experiments using a TurtleBot, which is a twowheeled differential drive robot amsters2019turtlebot.
4.1 Experiments on simulated environments
Sparse multitask environments
We present simulation results for three standard environments shown in Figure 1 and described below. We train over a small number of tasks that differ in their reward functions. We generate unknown tasks for test by randomly modifying the reward function.
Point2D Navigation is a 2 dimensional goalreaching environment. The states are the location of the agent on a 2D plane. The actions are appropriate 2D displacements . Training tasks are defined by a fixed set of goal locations on a semicircle of radius . The agent is given a zero reward everywhere except when it is a certain distance near the goal location, making the reward function highly sparse. Within a single task, the objective of the agent is to reach the goal location in the least number of time steps starting from origin. Test tasks are generated by sampling any point on the semicircle as the goal.
TwoWheeled Locomotion environment is a goalreaching with sparse rewards, similar to Point2D Navigation. However, the robot is constrained by the permissible actions (limits on angular and linear velocity) and trajectories feasible based on the turning radius of the robot. Here, our training tasks are a fixed set of goal locations on a semicircle of radius while test goals are sampled randomly. Further details on statespace and dynamics are provided in the Appendix.
HalfCheetah ForwardBackward consists of two tasks in which the HalfCheetah agent learns to either move in the forward (task 1) or backward (task 2) directions with as high velocity as possible. The agent gets a reward only after it has moved a certain number of units along the xaxis in the correct direction, making the rewards sparse. Training and test are under the same two tasks.
Optimal data and suboptimal data
We provide a limited amount of demonstration data in the form of just one trajectory per task for guidance. Optimal data consists of transitions generated by an expert policy trained using TRPO. Suboptimal data is generated by an inexpert, partially trained TRPO policy with induced action noise and truncated trajectories as shown in Figure 2.
Baselines We compare the performance of our algorithm against the following gradient based metareinforcement learning algorithms: (i) MAML: finn2017model The standard MAML algorithm for metaRL (ii) MetaBC: A variant of finn2017one; this is a supervised learning/behavior cloning version of MAML, where the maximum likelihood loss is used in the adaptation as well as the metaoptimization steps. (iii) GMPS: Guided meta policy search mendonca2019guided, which uses RL for gradient based adaptation, and uses demonstration data for supervised metaparameter optimization. The implemention of our algorithms and baselines is based on a publicly available metalearning code base Arnold2020ss licensed under the MIT License.
seeds and the shaded region corresponds to the standard deviation over them. For the test plots, a solid line corresponds to the mean performance over all testing tasks, and the shaded region corresponds to the standard deviation over them.
Performance with optimal demonstration data:
We illustrate the training and testing performance of the different algorithms trained and tested with optimal data in Figure 3. The top row of Figure 3 shows the average adapted return across training tasks of the metapolicy during training iterations. The bottom row of Figure 3 shows the average return of the trained metapolicy adapted across testing tasks over adaptation steps. We see that our algorithms out perform the others by obtaining the highest average return, and are able to quickly adapt to testing tasks with just one adaptation step and one trajectory of demonstration data. Additionally, our algorithms demonstrate a nearlymonotone improvement in average return demonstrating stable learning. MetaBC fails and has unstable training performance as the amount of demonstration data available per task is very small. Training over only a small number of tasks further hampers the performance of MetaBC. MAML and GMPS fail to learn due to sparsity of the environment as the purely RL adaptation step incurs almost zero reward, and hence, negligible learning signal. Furthermore, GMPS is hampered in the meta update step due to availability of only a small amount of demonstration data pertask.
Performance with suboptimal demonstration data:
EMRLD uses a combination of RL and imitation, which is valuable when presented with suboptimal demonstrations. For the Point2D Navigation environment, we collect suboptimal data for each task using a partially trained agent with induced action noise, and truncate the trajectories short of the reward region. Hence, pure imitation cannot reach the goal. For the TwoWheeled Locomotion environment, we collect data in a similar fashion for all tasks, but remove stateaction pairs at the beginning of each trajectory. Since the first few stateaction pairs contain information on how to orient the twowheeled agent towards the goal, this truncation eliminates the possibility of direct imitation being successful. Similarly, in HalfCheetah we use a partially trained policy and truncate trajectories before they reach the reward bearing region. Figure 4 illustrates that EMRLD outperforms all the baselines and is quickly able to adapt to unseen tasks, emphasizing the benefit of its RL component. MetaBC and GMPS fail because they are restricted by the optimality of the data, and the absence of crucial information greatly impacts their performance. MAML again fails due to the sparsity of the reward.
We conclude by presenting in Figure 5, sets of trajectories generated during testing in the TwoWheeledLococmotion environment when provided with optimal or suboptimal demonstration data. The variants of EMRLD clearly outperform the others, showing their strength in the sparse reward setting.
4.2 Realworld Experiments on TurtleBot
We demonstrate the ability of EMRLD variants to adapt to sparsereward tasks when they differ in environment dynamics and have sparse reward feedback. We do so via performance evaluation in the realworld using a TurtleBot shown in Figure 6 (left). We first we modify the TwoWheeled Locomotion sparse reward environment by fixing the goal, and changing the dynamics by inducing a residual angular velocity which mimics drifts in the environment. This environmental drift is what differentiates each task. In other words, for a given task, the environment would cause the robot to drift in some specific unknown direction. We train on a set of tasks with different angular velocity values (i.e., different driving environments). We use one trajectory of demonstration data per task collected using an expert policy trained using TRPO. Note that all the training and data collection is done in simulation. The results are shown in Figure 6 (middle), where we see that the EMRLD variants clearly outperform the others.
For testing, we consider the environment where the Turtlebot experiences a fixed but unknown residual angular velocity representing environmental drift. Thus, we bias the angular velocity control of the TurtleBot by some amount unknown to the algorithm under test. We first execute the meta policy on the TurtleBot in the real world to collect trajectories. We also provide one trajectory of simulated demonstration data. We use these samples to adapt the metapolicy, and execute the adapted taskspecific policy on the TurtleBot. The results are shown in Figure 6 (right), where the origin is at and the goal is indicated by a star. It is clear that the variants of EMRLD are the best at quickly adapting to the drift in the environment and are successful with just one step of adaptation.
5 Conclusion
We studied the problem of metaRL algorithm design for sparsereward problems, in which demonstration data generated by a possibly inexpert policy is also provided. Our key observation was that simple application of an untrained metapolicy in a sparsereward environment might not provide meaningful samples, and guidance provided by imitating the inexpert policy can greatly assuage this effect. We first showed analytically that this insight is accurate and that metapolicy improvement might be feasible as long as the inexpert demonstration policy has an advantage. We then developed two metaRL algorithms, EMRLD and EMRLDWS that are enhanced by using demonstration data. We show through extensive simulations, as well as real world robot experiments that EMRLD is able to offer a considerable advantage over existing approaches in sparse reward scenarios.
6 Limitations and Future Work
EMRLD inherits the limitations of the gradientbased metaRL approaches like MAML namely onpolicy training, and data collection and gradient computation during test adaptation. A limitation specific to our proposed algorithms is the assumption on availability of task specific demonstration data. However, we reiterate that for a small number of train tasks, this assumption is quite practical, further, our framework allows for this data to be suboptimal.
A possible future direction to explore is the context based metaRL (that does’t require gradient computation during testing) with demonstration data. Another future work direction is to explore usage of demonstration data in offpolicy metaRL algorithms.
7 Acknowledgement
This work was supported in part by the National Science Foundation (NSF) grants NSFCAREEREPCN2045783 and NSF ECCS 2038963, and U.S. Army Research Office (ARO) grant W911NF1910367. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the sponsoring agencies.
References
Ethics Statement and Societal Impacts
Our work considers the theory and instantiation of metaRL algorithms that were trained and tested on simulation and robot environments. No human subjects or human generated data were involved. Thus, we do not perceive ethical concerns with our research approach.
While reinforcement learning shows much promise for application to societally valuable systems, applying it to environments that include human interaction must proceed with caution. This is because guarantees are probabilistic, and ensuring that the risk is kept within acceptable limits is a must to ensure safe deployments.
Checklist

For all authors…

Do the main claims made in the abstract and introduction accurately reflect the paper’s contributions and scope? Section 1

Did you describe the limitations of your work? Section 4

Did you discuss any potential negative societal impacts of your work?

Have you read the ethics review guidelines and ensured that your paper conforms to them?


If you are including theoretical results…

Did you state the full set of assumptions of all theoretical results? Section 3

Did you include complete proofs of all theoretical results? Appendix


If you ran experiments…

Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? Supplementary Material (https://github.com/DesikRengarajan/EMRLD)

Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? Appendix

Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)? Section 4

Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? Appendix


If you are using existing assets (e.g., code, data, models) or curating/releasing new assets…

If your work uses existing assets, did you cite the creators? Section 4

Did you mention the license of the assets?

Did you include any new assets either in the supplemental material or as a URL? Appendix

Did you discuss whether and how consent was obtained from people whose data you’re using/curating?

Did you discuss whether the data you are using/curating contains personally identifiable information or offensive content?


If you used crowdsourcing or conducted research with human subjects…

Did you include the full text of instructions given to participants and screenshots, if applicable?

Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable?

Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation?

Appendix A Proof of Theorem 3.2
We will use the well known Performance Difference Lemma [kakade2002approximately] in our analysis.
Lemma A.1 (Performance difference lemma, [kakade2002approximately]).
For policies any two policies and ,
(4) 
where , for .
Proof of Theorem 3.2.
Recall the following notations: is the metapolicy used at iteration of our algorithm, is the policy obtained after taskspecific adaptation for task , is the statevisitation frequency of policy for task , and is the value of the policy for the MDP corresponding to task . The value of the metapolicy is defined as .
Appendix B Environments
In this section, we describe all the simulation and realworld environments in detail.
b.1 Simulation Environments
Point 2D Navigation: Point 2D Navigation [finn2017model] is a 2 dimensional goal reaching environment with , , and the following dynamics,
Where and are the and location of the agent, and are the actions taken which correspond to the displacement in the and direction respectively, all taken at time step . The goals are located on a semi circle of radius , and the episode terminates when the agent reaches the goal or spends more than time steps in the environment. The sparse reward function for the agent is defined as follows,
where and are the location of the goal. The agent is given a zero reward everywhere except when it is a certain distance near the goal location. Within the distance , the agent is given two kinds of rewards. If the agent is very close to the goal, say a distance , then it rewarded with a positive bonus of Number_of_times_steps_remaining_in_episode
. This is done to create a sink near goal location to trap the agent inside it, rather than letting it wander in the region to keep collecting misleading positive reward. For distances between 0.02 and 0.2, the agent is given a positive reward of 1dist(agent,goal)
.
TwoWheeled Locomotion: The TwoWheeled Locomotion environment [gupta2018meta] is designed based on the two wheeled differential drive model with , , and the following dynamics,
with , where correspond to the and coordinate of the agent, and are the actions corresponding to the linear and angular velocity of the agent all at time , and is the time discretization factor. Goals are located on a semicircle of radius , and the episode terminates if the agent reaches the goal, or spends more than time steps in the environment, or moves out of region, which is a square box of side . The sparse reward function for the agent is defined as follows,
where and are the location of the goal.
Half Cheetah ForwardBackward: The Half Cheetah ForwardBackward environment [finn2017model], is a modified version of the standard MuJoCo[todorov2012mujoco] HalfCheetah environment with and , where the agent is tasked with moving forward or backward, with the episode terminating if the agent spends more than time steps in the environment. The sparse reward function is as follows,
where corresponds to the position of the agent, is the control cost, all at time step , is the time discretization factor, and is the goal direction, which is for the forward task and for the backward task.
TwoWheeled Locomotion  Changing Dynamics: We modify the TwoWheeled locomotion environment by fixing the goal to , and adding a residual angular velocity,
with , where is the residual angular velocity, which corresponds to different task, and mimics drift in the environment. The sparse reward function is similar to the one described in section B.1.
b.2 RealWorld TurtleBot Platform and Experiments
We deploy the policy trained on the environment described in section B.1 on a TurtleBot 3 [amsters2019turtlebot]
, a real world open source differential drive robot. We use ROS as a middleware to set up communication between the bot and a custom built OpenAI Gym environment. The OpenAI Gym environment acts as an interface between the policy being deployed and the bot. The custom built environment, subscribes to ROS topics (
/odom for , ), which are used to communicate the state of the bot, and publish (/cmd_vel for , ) actions. This is done asynchronously through a callback driven mechanism. The bot transmits its state information over a wireless network to an Intel NUC, which transmits back the corresponding action according to the policy being deployed.The trajectories executed by the adapted policies are plotted in figure 7 ( note that figure 7 is the same as figure 6, replotted here for clarity). During policy execution on the TurtleBot, we set the residual angular velocity that mimics drift to , we note that our algorithms (EMRLD and EMRLDWS) are able to adapt to the drift in the environment and reach the goal. We further note that MAML, takes a longer suboptimal route to reach the reward region, but misses the goal.
We have provided a link to realworld demonstration with our code^{1}^{1}1https://github.com/DesikRengarajan/EMRLD. For EMRLD, we show the execution of the meta policy used to collect data, and the adapted policy. It can be clearly seen that the the meta policy collects rewards in the vicinity of the goal region, which is then used for adaptation. The adapted policy then reaches the goal. We further show the execution of the adapted policies for the baseline algorithms on the TurtleBot, and we can observe that EMRLD and EMRLDWS outperform all the baseline algorithms and reach the goal.
Appendix C Experimental Setup
Computing infrastructure and run time: The experiments are run on computers with AMD Ryzen Threadripper 3960X 24Core Processor with max CPU speed of 3800MHz. Our implementation does not make use of GPUs. Insead, the implementation is CPU thread intensive. On an average, EMRLD and EMRLDWS take 3h to run on smaller environments, and take 5h on HalfCheetah. We train goal conditioned expert policies using TRPO. Expert policy training takes 0.5h to run. Our code is based on learn2learn^{2}^{2}2https://github.com/learnables/learn2learn [Arnold2020ss]
, a software library built using PyTorch
[paszke2019pytorch] for MetaRL research.Neural Network and Hyperparameters: In our work, the meta policy and the adapted policies
are stochastic Gaussian policies parameterized by neural networks. The input for each policy network is the state vector
and the output is a Gaussian mean vector . The standard deviation is kept fixed, and is not learnable. During training, an action is sampled from .For value baseline (used for advantage computation) of metalearning algorithms, we use a linear baseline function of the form , where and is discounted sum of rewards starting from state till the end of an episode. This was first proposed in [duan2016benchmarking] and is used in MAML [finn2017model]
. This is preferred as a learnable baseline can add additional gradient computation and backpropagation overheads in metalearning.
We use TRPO on goal conditioned policies to obtain optimal and suboptimal experts for all the tasks in an environment at once. For each environment, the task context variable, i.e.
, a vector that contains differentiating information on a task, is appended to the input state vector of the policy network. The rest of the policy mechanism is same as described above for metapolicies. A learnable value network is used to cut variance in advantage estimation. Once the expert policy is trained to the desired amount, just one trajectory per task is sampled to construct demonstration data.
All the models used in this work are multilayer perceptrons (MLPs). The policy models for all the metalearning algorithms have two layers of 100 neurons each with Rectified Linear Unit (ReLU) nonlinearities. The data generating policy and value models use two layers of 128 neurons each.
Table 1 lists the set of hyperparameters used for EMRLD, EMRLDWS and the baseline algorithms. In addition to the ones listed in Table 1, meta batch size is dependant on the training environment: it is for Point2D Navigation, for TwoWheeled Locomotion and for HalfCheetah ForwardBackward. In Table 1, Meta LR specified as ‘TRPO’ means that the learning rate is determined by stepsize rule coming from TRPO. The meta optimization steps in MetaBC and GMPS use ADAM [kingma2014adam] optimizer with a learning rate of . We use CPU cores to parallelize policy rollouts for adaptation. The hyperparameters and are kept fixed across environments for EMRLD and EMRLDWS. The parameter is kept at for both optimal and suboptimal data, and across environments. The parameter takes a lower value of across environments for optimal data as in practise optimal data is expected to be highly informative. Hence, we desire the gradient component arising from optimal data to hold more value while adaptation. For suboptimal data, the agent is required to explore to obtain performance beyond data, and hence, is kept at . We further show in section D that our algorithm is robust to choice of and .
Hyperparameter  EMRLD  EMRLDWS  MAML  MetaBC  GMPS 

Adaptation LR  0.01  0.01  0.01  0.01  0.01 
Meta LR  TRPO  TRPO  TRPO  0.01(ADAM)  0.01(ADAM) 
Adapt Steps  1  1  1  1  1 
Adapt batch size  20  20  20  20  20 
GAE  1  1  1  1  1 
0.95  0.95  0.95  0.95  0.95  
CPU thread No.  20  20  20  20  20 
0.2/1  0.2/1  N/A  N/A  N/A  
1  1  N/A  N/A  N/A 
Appendix D Sensitivity Analysis
We perform sensitivity analysis for parameters and on our algorithms EMRLD and EMRLDWS for optimal data on Point2D Navigation. The results for the same are included in Fig. 8. All the plots are averaged over three random seed runs. To assess the sensitivity of our algorithms to , we fix and vary to take values from . Similarly, to assess how sensitive our algorithm’s performance to is, we fix and vary to take values from . All the hyperparameters are kept fixed to the values listed in Table 1. We observe that our algorithms are fairly robust to variations in and for three random seeds. Since demonstration data is leveraged to extract useful information regarding the environment and the reward structure, our algorithms are slightly more sensitive to variation than variation.
Appendix E Ablation experiments
We perform ablation experiments for EMRLD by setting and on the Point2D Navigation environment with the optimal and the suboptimal demonstration data. We observe from figure 9, that setting hampers the performance to a greater extant as the agent is unable to extract useful information from the environment due to the sparse reward structure. We also observe that setting hampers the performance, as the agent is unable to exploit the RL structure of the problem to achieve high rewards.
Appendix F Related Work
MetaLearning:
Reinforcement learning (RL) has become popular as a tool to perform learning from interaction in complex problem domains like autonomous navigation of stratospheric balloons [bellemare2020autonomous] and autonomously solving a game of Go [silver2016mastering]. In large scale complex environments, one requires a large amount of data to learn any meaningful RL policy [botvinick2019reinforcement]. This is in stark contrast to how we as humans behave and learn  by translating our prior knowledge of past exposure to same/similar tasks into behavioural policies for a new task at hand. The initial work [schmidhuber1996simple]
took to addressing the above mentioned gap and proposed the paradigm of metalearning. The idea has been extended to obtain gradient based algorithms in supervised learning, unsupervised learning, control, and reinforcement learning
[schweighofer2003meta, hochreiter2001learning, thrun2012learning, wang2016learning, duan2016rl]. More recently, modelagnostic metalearning (MAML) [finn2017model] introduced a gradient based twostep approach to metalearning: an inner adaptation step to learn specific task policies, and an outer metaoptimization loop that implicitly makes use of the inner policies. MAML can be used both in the supervised learning and RL contexts. Reptile [nichol2018first] introduced efficient first order metalearning algorithms. PEARL [rakelly2019efficient] takes a different approach to metaRL, wherein task specific contexts are learned during training, and interpreted from trajectories during testing to solve the task. In its native form, the RL variant of MAML can suffer from issues of inefficient gradient estimation, exploration, and dependence on a rich reward function. Among others, algorithms like ProMP [rothfuss2018promp] and DiCE [foerster2018dice] address the issue of inefficient gradient estimation. Similarly, EMAML [al2017continuous, stadie2018some] and MAESN [gupta2018meta] deal with the issue of exploration in metaRL. Inadequate reward information or sparse rewards is a particularly challenging problem setting for RL , and hence, for metaRL. Very recently, HTR [packer2021hindsight] proposed to relabel the experience replay data of any offpolicy algorithm to overcome exploration difficulties in sparse reward goal reaching environments. Different from this approach, we leverage the popular learning from demonstration idea to aid learning of metapolicies on tasks including and beyond goal reaching ones.RL with demonstration: ‘Learning from demonstrations’ (LfD) [schaal1996learning] first proposed the use of demonstrations in RL to speed up learning. Since then, leveraging demonstrations has become an attractive approach to aid learning [hester2018deep, vecerik2017leveraging, nair2018overcoming]. Earlier work has incorporated data from both expert and inexpert policies to assist with policy learning in sparse reward environments [nair2018overcoming, hester2017learning, vecerik2017leveraging, kang2018policy, rengarajan2022reinforcement]. In particular, DQfD [hester2017learning] utilizes demonstration data by adding it to the replay buffer for Qlearning. DDPGfD[vecerik2017leveraging] extend use of demonstration data to continuous action spaces, and is built upon DDPG [lillicrap2015continuous]. DAPG [rajeswaran2017learning] proposes an online finetuning algorithm by combining policy gradient and behavior cloning. POfD [kang2018policy] propose an approach to use demonstration data through an appropriate loss function into the RL policy optimization step to implicitly reshape sparse reward function. LOGO [rengarajan2022reinforcement] proposes a twostep guidance approach where demonstration data is used to guide the RL policy in the initial phase of learning.
MetaRL with demonstration: Use of demonstration data in metaRL is new, and the works in this area are rather few. Meta Imitation Learning [finn2017one] extends MAML [finn2017model] to imitation learning from expert video demonstrations. WTL [zhou2019watch] uses demonstrations to generate an exploration algorithm, and uses the exploration data along with demonstration data to solve the task. ODA [zhao2021offline] use demonstration data to perform offline metaRL for industrial insertion, and [arulkumaran2022all] propose generalized ‘upside down RL’ algorithms that use demonstration data to perform offlinemetaRL. GMPS [mendonca2019guided] extends MAML [finn2017model] to leverage expert demonstration data by performing metapolicy optimization via supervised learning. Closest to our approach are GMPS [mendonca2019guided] and Meta Imitation Learning [finn2017one], and we will focus on comparisons with versions of these algorithms, along with the original MAML [finn2017model].