Meta-Reinforcement Learning (meta-RL) is an approach towards quickly solving similar tasks while only gathering a few samples for each task. The fundamental idea behind approaches such as the popular Model Agnostic Meta-Learning (MAML) finn2017model is to determine a universal meta-policy by combining the information gained over working on several tasks. This meta-policy is optimized in a manner such that, when adapted using a small number of samples gathered for a given task, it is able to perform near optimally on that specific task. While the dependence of MAML on only a small number of samples for task-specific adaptation is very attractive, its also means that these samples must be highly representative of that task for meaningful adaptation and meta-learning. However, in many real-world problems, rewards provided are sparse in that they might only provide limited feedback. For instance, there might be a reward only if a robot gets close to designated way point, with no reward otherwise. Hence, MAML is likely to fail on both counts of task-specific adaptation and optimization of the meta-policy over tasks in sparse reward environments. Making progress towards learning a viable meta-policy in such settings is challenging without additional information.
Many real-world tasks are associated with empirically determined polices used in practice. Such polices could be inexpert, but even limited demonstration data gathered from applying these policies could contain valuable information in the sparse reward context. While the fact that the policy generating the data could be inexpert suggests that direct imitation might not be optimal, supervised learning over demonstration data could be used for enhancing adaptation and learning. How best should we use demonstration data to enhance the process of meta-RL in the sparse reward setting?
Our goal is a principled design of a class of meta-RL algorithms that can exploit demonstrations from inexpert policies in the sparse reward setting. Our general approach follows two-step algorithms like MAML that employ: (i) Task-specific Adaptation: execute the current meta-policy on a task and adapt it based on the samples gathered to obtain a task-specific policy, and (ii) Meta-policy Optimization: execute task-specific policies on an ensemble of tasks to which they are adapted, and use the samples gathered to optimize the meta-policy from whence they were adapted. Our key insight is that we can enhance RL-based policy adaptation with behavior cloning of the inexpert policy to guide task-specific adaptation in the right direction when rewards are sparse. Furthermore, execution of such an enhanced adapted policy should yield an informative sample set that indicates how best to obtain the sparse rewards for meta-policy optimization. We aim at analytically and empirically capturing this progression of policy improvement starting from task-specific policy adaption to the ultimate meta-policy optimization. Thus, as long the inexpert policy has an advantage, we must be able to exploit it for meta-policy optimization.
Our main contributions are as follows. (i) We derive a policy improvement guarantee result for MAML-like two-step meta-RL algorithms. We show that the inclusion of demonstration data can further increase the policy improvement bound as long as the inexpert policy that generated the data has an advantage. (ii) We propose an approach entitled Enhanced Meta-RL using Demonstrations (EMRLD) that combines RL-based policy improvement and behavior cloning from demonstrations for task-specific adaptation. We further observe that directly applying the meta-policy to a new sparse-reward task sometimes does not yield informative samples, and a warm-start to the meta-policy using the demonstrations significantly improves the quality of the samples, resulting in a variant that we call EMRLD-WS. (iii) We show on standard MuJoCo and two-wheeled robot environments that our algorithms work exceptionally well, even when only provided with just one trajectory of sub-optimal demonstration data per task. Additionally, the algorithms work well even when exposed only to a small number of tasks for meta-policy optimization. (iv) Our approach is amenable to a variety of meta-RL problems wherein tasks can be distinguished across rewards (e.g., whether forward or backward motion yields a reward for a task) or across environment dynamics (e.g., the amount of environmental drift that a wheeled robot experiences changes across tasks). To illustrate the versatility of EMRLD, we not only show simulations on different continuous control multi-task environments, but also demonstrate its excellent performance via real-world experiments on a TurtleBot robot amsters2019turtlebot. We provide videos of the robot experiments and code at https://github.com/DesikRengarajan/EMRLD.
Related Work: Here, we provide a brief overview of the related works. We leave a more thorough discussion on related works to the Appendix.
Meta learning: Basic ideas on the meta-learning framework are discussed in hochreiter2001learning; thrun2012learning; duan2016rl. Model-agnostic meta-learning (MAML) finn2017model
introduced the two-step approach described above, and can be used in the supervised learning and RL contexts. However, in its native form, the RL variant of MAML can suffer from issues of inefficient gradient estimation, exploration, and dependence on a rich reward function. Among others, algorithms like ProMProthfuss2018promp and DiCE foerster2018dice address the issue of inefficient gradient estimation. Similarly, E-MAML al2017continuous; stadie2018some and MAESN gupta2018meta deal with the issue of exploration in meta-RL. PEARL rakelly2019efficient takes a different approach to meta-RL, wherein task specific contexts are learnt during training, and interpreted from trajectories during testing to solve the task. HTR packer2021hindsight relabels the experience replay data of any off-policy algorithm such as PEARL rakelly2019efficient to overcome exploration difficulties in sparse reward goal reaching environments. Different from this approach, we use demonstration data to aid learning and are not restricted to goal reaching tasks.
RL with demonstration: Leveraging demonstrations is an attractive approach to aid learning hester2018deep; vecerik2017leveraging; nair2018overcoming. Earlier work has incorporated data from both expert and inexpert policies to assist with policy learning in sparse reward environments nair2018overcoming; hester2017learning; vecerik2017leveraging; kang2018policy; rengarajan2022reinforcement. In particular, hester2017learning utilizes demonstration data by adding it to the replay buffer for Q-learning, rajeswaran2017learning proposes an online fine-tuning algorithm by combining policy gradient and behavior cloning, while rengarajan2022reinforcement proposes a two-step guidance approach where demonstration data is used to guide the policy in the initial phase of learning.
Meta-RL with demonstration:
Meta Imitation Learningfinn2017one extends MAML finn2017model to imitation learning from expert video demonstrations. WTL zhou2019watch uses demonstrations to generate an exploration algorithm, and uses the exploration data along with demonstration data to solve the task. GMPS mendonca2019guided extends MAML finn2017model to leverage expert demonstration data by performing meta-policy optimization via supervised learning. Closest to our approach are GMPS mendonca2019guided and Meta Imitation Learning finn2017one, and we will focus on comparisons with versions of these algorithms, along with the original MAML finn2017model.
Our work differs from prior work on meta-RL with demonstration in several ways. First, existing works assume the availability of data generated by an expert policy, which severely limits their ability to improve beyond the quality of the policy that generated the data. This degrades their performance significantly when they are presented with sub-optimal data generated by an inexpert policy that might be used in practice. Second, these works use demonstration data in a purely supervised learning manner, without exploiting the RL structure. We use a combination of loss functions associated with RL and supervised learning to aid the RL policy gradient, which enables our approach to utilize any reward information available. This makes it superior to existing work in the sparse reward environment, which we illustrate in several simulation settings and real-world robot experiments.
A Markov Decision Processes (MDP) is typically represented as a tuple, where is the state space, is the action space, is the reward function,
is the transition probability function,is the discount factor, and is the initial state distribution. The value function and the state-action value function of a policy are defined as and . The advantage function is defined as . A policy generates a trajectory where . Since the randomness of is specified by and , we denote it as . The goal of a reinforcement learning algorithm is to learn a policy that maximizes the expected infinite horizon discounted reward defined as . It is easy to see that .
The discounted state-action visitation frequency of policy is defined as . The discounted state visitation frequency is defined as the marginal . It is straightforward to see that .
The total variation (TV) distance between two distributions and is defined as , the average TV distance between two policies and , averaged w.r.t. the is defined as .
Gradient-based meta-learning: The goal of a meta-learning algorithm is to learn to perform optimally in a new (testing) task using only limited data, by leveraging the experience (data) from similar (training) tasks seen during training. Gradient-based meta-learning algorithms achieve this goal by learning a meta-parameter which will yield a good task specific parameter after performing only a few gradient steps w.r.t. the task specific loss function using the limited task specific data.
Meta-learning algorithms consider a set of tasks with a distribution over . Each task is also associated with a data set , which is typically divided into training data used for task specific adaptation and validation data used for meta-parameter update. The objective of the gradient-based meta-learning is typically formulated as
where is the loss function corresponding to task and is the learning rate. Here, is the task specific parameter obtained by one step gradient update starting from the meta-parameter , and the goal is to find the best meta-parameter which will minimize the meta loss function
Gradient-based meta-reinforcement learning:
The gradient-based meta-learning framework is applicable both in supervised learning and reinforcement learning. In RL, each task corresponds to an MDP with task specific model and reward function . We assume that the state-action spaces are uniform across the tasks, thus ensuring the first level of task similarity. Task specific data is the trajectories for generated according to some policy . Since the randomness of the trajectory is specified by and , we denote it as . We consider the function approximation setting where each policy is represented by a function parameterized by and is denoted as . The task specific loss in meta-RL is defined as . The gradient can then be computed using policy gradient theorem.
The standard meta-RL training is done follows. A task (usually a batch of tasks) is sampled at each iterate of the algorithm. Now, starting with meta-parameter , the training data for task adaptation is generated as the trajectories , where , and the updated parameter for task is computed by policy gradient evaluated on . The validation data is then collected as trajectories , where , and the meta-parameter is updated by policy gradient evaluated on . In the next section, we will introduce a modified approach which will leverage the demonstration data for task adaptation and meta-parameter update.
3 Meta-RL using Demonstration Data
Most gradient-based meta-RL algorithms learn the optimal meta-parameter and the task-specific parameter from scratch using on-policy approaches. These algorithms exclusively rely on the reward feedback obtained from the training and validation data trajectories collected through the on-policy roll-outs of the meta-policy and task-specific policy. However, in RL problems with sparse rewards, a non-zero reward is typically achieved only when the task is completed or near-completed. In such sparse rewards settings, trajectories generated according to a policy that is still learning may not achieve any useful reward feedback, especially in the early phase of learning. In other words, since the reward feedback is zero or near-zero, the policy gradient will also be similar, resulting in non-meaningful improvement in the policy. Hence, standard meta-RL algorithms such as MAML, which rely crucially on reward feedback, will not be able to make much progress towards learning a valuable task-specific or meta-policy in sparse reward settings.
Learning the optimal control policy in sparse reward environments has been recognized as a challenging problem even in the standard RL setting, since most state-of-the-art RL algorithms fail to learn any meaningful polices even after a large number of training episodes rajeswaran2017learning; kang2018policy; rengarajan2022reinforcement. One widely accepted approach to overcome this challenge is known as learning from demonstration, wherein demonstration data obtained from an expert rajeswaran2017learning or inexpert policy kang2018policy; rengarajan2022reinforcement is used to aid online learning. The intuitive idea is that, even though the demonstration data does not contain any reward feedback, it can be used to guide the learning agent to reach non-zero reward regions of state-action spaces. This guidance, usually in the direction of the goal/target, is achieved by inferring some pseudo reward signal through supervised learning approaches using demonstration data.
Can we enhance the performance of meta-RL algorithms in sparse reward environments by using demonstration data from sub-optimal experts? Meta-RL in sparse reward environments is significantly more challenging than that of the standard RL setting. This is because the reward feedback serves the dual objectives of adapting the meta-parameter to specific tasks and for updating the meta-parameter itself. We note that demonstration data helps with both of these objectives. Firstly, use of demonstration data to guide task-specific adaptation becomes important because adaptation is achieved in one or a few gradient steps, and policy resulting from each adaptation step might not achieve meaningful reward in a sparse reward setting. Secondly, making use of demonstration data for meta-parameter update is equally important because of the role of meta-policy as a reward-yielding exploratory policy. Intuitively, the meta-policy should yield trajectories that reach in the vicinity of the reward-achieving region of the state-action spaces. This does not happen in sparse reward environments. However, using the guidance from demonstration data, the task-specific policy obtained after the task adaptation may be able to generate trajectories that will reach within the reward-achieving region resulting in performance acceleration of the meta-policy.
For meta-learning with demonstration, we assume that each task is associated with demonstration data which contains a trajectory generated according to a demonstration policy in an environment with model . We do not assume that is the optimal policy for task because in many real-world applications could be generated using an inexpert policy. Our key idea is to enhance task adaptation using the demonstration data by introducing an additional gradient term corresponding to the supervised learning guidance loss. We define the supervised learning loss function for task as . We note that, though this loss function is the same as in behavior cloning (BC), we use it directly in the gradient update instead of performing a simple warm start. This approach is known to achieve superior performance than naive BC warm starting in standard RL problem under the sparse reward setting rajeswaran2017learning; kang2018policy; rengarajan2022reinforcement. The task adaptation step at iteration , starting with meta-parameter is now obtained as
are hyperparameters that control the extant to which RL and demonstration data influence the gradient.
The next question is: how do we use demonstration data in the meta-parameter update? One approach is to use only the demonstration data with a supervised learning loss function for updating the meta-parameter as done in mendonca2019guided. We conjecture that such a reduction to supervised learning will severely limit the learning capability of the algorithm. Firstly, if demonstration data is obtained from an inexpert policy, this approach will never be able to achieve the optimal performance. This is because the role of the meta-policy as a reward-yielding exploratory policy will be limited by true performance of the inexpert policy. Secondly, the task-specific policies obtained according to (2) may be able to reach within the reward-yielding region of state-action space as we mentioned before. Hence, the validation data collected through the roll-out of the policies obtained after task adaptation might contain extremely valuable reward feedback. Utilizing this data could potentially have a significant impact on improving the learning of the meta-parameter. Thus, in our approach, we update the meta-parameter using the RL loss with policy gradient as follows.
We note that the demonstration data is indeed used in the meta-parameter update implicitly, as its impact can be observed in (3). We found empirically that the double use of the demonstration data, either by adding an additional gradient through a BC loss function, or by replacing with results in similar or worse performance than the approach described above.
We now formally present our algorithm called Enhanced Meta-RL using Demonstrations (EMRLD).
We now present a theoretical justification of why EMRLD should have a superior performance in the sparse reward setting as compared to other gradient based algorithms that do not use demonstration data. First, we introduce some notation. Let be the meta-policy used at iteration of our algorithm. Also, let be the policy obtained after task-specific adaptation for task . Recall that is the value of the policy for the MDP corresponding to task . Similarly, we can define the state-action value function and advantage function of policy for task as and , respectively. Also, let be the visitation frequency of policy for task . Now, we can define the value of the meta-policy over the ensemble of all tasks as .
If the demonstration data has to be useful, it should provide a reasonable amount of guidance. In particular, we would like the task-specific policy adapted using this data to collect feedback that would ensure good meta-policy updates, particularly in the initial stages of meta-training. Since the capability of the demonstration data to guide adaptation will depend on the demonstration policy according to which it is generated, we make the following assumption about .
During the initial stages of meta-training, , for all and .
Assumption 3.1 implies that during the initial stages of meta-training, the demonstration policy can provide a higher advantage on average than the current policy adapted to that task. This is a reasonable assumption, since any reasonable demonstration policy is likely to perform much better than an untrained policy in the initial phase of learning. We also note that a similar assumption was used in learning from demonstration literature kang2018policy; rengarajan2022reinforcement
We now present the performance improvement result for EMRLD.
Let be the meta-policy used at iteration of our algorithm and let be the policy obtained after task adaptation in task . Let Assumption 3.1 holds for . Then,
Theorem 3.2 presents a lower bound for the meta policy improvement as a sum of two groups of terms. Maximizing the first term in group one with a constraint on its second term will ensure a higher lower bound and hence an improvement in the meta-parameter training. We notice that this is indeed achieved by the TRPO step used in the meta-parameter update. Hence, this first group is the same for any MAML-type of algorithm. The advantage of the demonstration data is revealed in the second group of terms. The term adds a positive quantity to the lower bound, and this contribution from this second group of term can be maximized by minimizing . However, this minimization is hard to perform in practice because estimating requires sampling the data according to , and this is not feasible at iteration . Hence, in practice, we replace that term by . This can be easily achieved by including the standard maximum likelihood objective in the adaptation step. Thus, EMRLD both exploits the advantage offered by an RL step, as well as that of behavior cloning for meta-policy optimization.
We can further improve the performance of EMRLD by including a behavior cloning warm starting step before performing the update (2). We simplify this warm start to a one step gradient as, , and then do the task adaptation as in (2) starting with . We call this version of our algorithm as EMRLD-WS. Such a warm start is likely to provide more meaningful samples than directly rolling out the meta-policy to obtain samples for task-specific adaptation. In the next section, we will see empirically how our design choices for EMRLD and EMRLD-WS enable them to learn policies that provide higher rewards using only a small amount of (even sub-optimal) demonstration data.
4 Experimental Evaluation
We evaluate the performance of EMRLD based on whether the meta-policy it generates is a good initial condition for task-specific adaptation in sparse-reward environments over (i) Tasks already seen in training and (ii) New unseen tasks. We seek to validate the conjecture that in the sparse-reward setting, EMRLD should be able to leverage even demonstrations of inexpert policies to attain high test performance over previously unseen tasks. We do so with regard to two classes of tasks, namely,
Tasks that differ in their reward functions: Simulation experiments on Point2D Navigation finn2017model, TwoWheeled Locomotion gupta2018meta, and HalfCheetah wawrzynski2009cat; todorov2012mujoco
Tasks that differ in the environment dynamics: Real-world experiments using a TurtleBot, which is a two-wheeled differential drive robot amsters2019turtlebot.
4.1 Experiments on simulated environments
Sparse multi-task environments
We present simulation results for three standard environments shown in Figure 1 and described below. We train over a small number of tasks that differ in their reward functions. We generate unknown tasks for test by randomly modifying the reward function.
Point2D Navigation is a 2 dimensional goal-reaching environment. The states are the location of the agent on a 2D plane. The actions are appropriate 2D displacements . Training tasks are defined by a fixed set of goal locations on a semi-circle of radius . The agent is given a zero reward everywhere except when it is a certain distance near the goal location, making the reward function highly sparse. Within a single task, the objective of the agent is to reach the goal location in the least number of time steps starting from origin. Test tasks are generated by sampling any point on the semicircle as the goal.
TwoWheeled Locomotion environment is a goal-reaching with sparse rewards, similar to Point2D Navigation. However, the robot is constrained by the permissible actions (limits on angular and linear velocity) and trajectories feasible based on the turning radius of the robot. Here, our training tasks are a fixed set of goal locations on a semi-circle of radius while test goals are sampled randomly. Further details on state-space and dynamics are provided in the Appendix.
HalfCheetah Forward-Backward consists of two tasks in which the HalfCheetah agent learns to either move in the forward (task 1) or backward (task 2) directions with as high velocity as possible. The agent gets a reward only after it has moved a certain number of units along the x-axis in the correct direction, making the rewards sparse. Training and test are under the same two tasks.
Optimal data and sub-optimal data
We provide a limited amount of demonstration data in the form of just one trajectory per task for guidance. Optimal data consists of transitions generated by an expert policy trained using TRPO. Sub-optimal data is generated by an inexpert, partially trained TRPO policy with induced action noise and truncated trajectories as shown in Figure 2.
Baselines We compare the performance of our algorithm against the following gradient based meta-reinforcement learning algorithms: (i) MAML: finn2017model The standard MAML algorithm for meta-RL (ii) Meta-BC: A variant of finn2017one; this is a supervised learning/behavior cloning version of MAML, where the maximum likelihood loss is used in the adaptation as well as the meta-optimization steps. (iii) GMPS: Guided meta policy search mendonca2019guided, which uses RL for gradient based adaptation, and uses demonstration data for supervised meta-parameter optimization. The implemention of our algorithms and baselines is based on a publicly available meta-learning code base Arnold2020-ss licensed under the MIT License.
seeds and the shaded region corresponds to the standard deviation over them. For the test plots, a solid line corresponds to the mean performance over all testing tasks, and the shaded region corresponds to the standard deviation over them.
Performance with optimal demonstration data:
We illustrate the training and testing performance of the different algorithms trained and tested with optimal data in Figure 3. The top row of Figure 3 shows the average adapted return across training tasks of the meta-policy during training iterations. The bottom row of Figure 3 shows the average return of the trained meta-policy adapted across testing tasks over adaptation steps. We see that our algorithms out perform the others by obtaining the highest average return, and are able to quickly adapt to testing tasks with just one adaptation step and one trajectory of demonstration data. Additionally, our algorithms demonstrate a nearly-monotone improvement in average return demonstrating stable learning. Meta-BC fails and has unstable training performance as the amount of demonstration data available per task is very small. Training over only a small number of tasks further hampers the performance of Meta-BC. MAML and GMPS fail to learn due to sparsity of the environment as the purely RL adaptation step incurs almost zero reward, and hence, negligible learning signal. Furthermore, GMPS is hampered in the meta update step due to availability of only a small amount of demonstration data per-task.
Performance with sub-optimal demonstration data:
EMRLD uses a combination of RL and imitation, which is valuable when presented with sub-optimal demonstrations. For the Point2D Navigation environment, we collect sub-optimal data for each task using a partially trained agent with induced action noise, and truncate the trajectories short of the reward region. Hence, pure imitation cannot reach the goal. For the TwoWheeled Locomotion environment, we collect data in a similar fashion for all tasks, but remove state-action pairs at the beginning of each trajectory. Since the first few state-action pairs contain information on how to orient the two-wheeled agent towards the goal, this truncation eliminates the possibility of direct imitation being successful. Similarly, in HalfCheetah we use a partially trained policy and truncate trajectories before they reach the reward bearing region. Figure 4 illustrates that EMRLD outperforms all the baselines and is quickly able to adapt to unseen tasks, emphasizing the benefit of its RL component. Meta-BC and GMPS fail because they are restricted by the optimality of the data, and the absence of crucial information greatly impacts their performance. MAML again fails due to the sparsity of the reward.
We conclude by presenting in Figure 5, sets of trajectories generated during testing in the TwoWheeledLococmotion environment when provided with optimal or sub-optimal demonstration data. The variants of EMRLD clearly outperform the others, showing their strength in the sparse reward setting.
4.2 Real-world Experiments on TurtleBot
We demonstrate the ability of EMRLD variants to adapt to sparse-reward tasks when they differ in environment dynamics and have sparse reward feedback. We do so via performance evaluation in the real-world using a TurtleBot shown in Figure 6 (left). We first we modify the TwoWheeled Locomotion sparse reward environment by fixing the goal, and changing the dynamics by inducing a residual angular velocity which mimics drifts in the environment. This environmental drift is what differentiates each task. In other words, for a given task, the environment would cause the robot to drift in some specific unknown direction. We train on a set of tasks with different angular velocity values (i.e., different driving environments). We use one trajectory of demonstration data per task collected using an expert policy trained using TRPO. Note that all the training and data collection is done in simulation. The results are shown in Figure 6 (middle), where we see that the EMRLD variants clearly outperform the others.
For testing, we consider the environment where the Turtlebot experiences a fixed but unknown residual angular velocity representing environmental drift. Thus, we bias the angular velocity control of the TurtleBot by some amount unknown to the algorithm under test. We first execute the meta policy on the TurtleBot in the real world to collect trajectories. We also provide one trajectory of simulated demonstration data. We use these samples to adapt the meta-policy, and execute the adapted task-specific policy on the TurtleBot. The results are shown in Figure 6 (right), where the origin is at and the goal is indicated by a star. It is clear that the variants of EMRLD are the best at quickly adapting to the drift in the environment and are successful with just one step of adaptation.
We studied the problem of meta-RL algorithm design for sparse-reward problems, in which demonstration data generated by a possibly inexpert policy is also provided. Our key observation was that simple application of an untrained meta-policy in a sparse-reward environment might not provide meaningful samples, and guidance provided by imitating the inexpert policy can greatly assuage this effect. We first showed analytically that this insight is accurate and that meta-policy improvement might be feasible as long as the inexpert demonstration policy has an advantage. We then developed two meta-RL algorithms, EMRLD and EMRLD-WS that are enhanced by using demonstration data. We show through extensive simulations, as well as real world robot experiments that EMRLD is able to offer a considerable advantage over existing approaches in sparse reward scenarios.
6 Limitations and Future Work
EMRLD inherits the limitations of the gradient-based meta-RL approaches like MAML namely on-policy training, and data collection and gradient computation during test adaptation. A limitation specific to our proposed algorithms is the assumption on availability of task specific demonstration data. However, we reiterate that for a small number of train tasks, this assumption is quite practical, further, our framework allows for this data to be sub-optimal.
A possible future direction to explore is the context based meta-RL (that does’t require gradient computation during testing) with demonstration data. Another future work direction is to explore usage of demonstration data in off-policy meta-RL algorithms.
This work was supported in part by the National Science Foundation (NSF) grants NSF-CAREER-EPCN-2045783 and NSF ECCS 2038963, and U.S. Army Research Office (ARO) grant W911NF-19-1-0367. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the sponsoring agencies.
Ethics Statement and Societal Impacts
Our work considers the theory and instantiation of meta-RL algorithms that were trained and tested on simulation and robot environments. No human subjects or human generated data were involved. Thus, we do not perceive ethical concerns with our research approach.
While reinforcement learning shows much promise for application to societally valuable systems, applying it to environments that include human interaction must proceed with caution. This is because guarantees are probabilistic, and ensuring that the risk is kept within acceptable limits is a must to ensure safe deployments.
For all authors…
Do the main claims made in the abstract and introduction accurately reflect the paper’s contributions and scope? Section 1
Did you describe the limitations of your work? Section 4
Did you discuss any potential negative societal impacts of your work?
Have you read the ethics review guidelines and ensured that your paper conforms to them?
If you are including theoretical results…
Did you state the full set of assumptions of all theoretical results? Section 3
Did you include complete proofs of all theoretical results? Appendix
If you ran experiments…
Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? Supplementary Material (https://github.com/DesikRengarajan/EMRLD)
Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? Appendix
Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)? Section 4
Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? Appendix
If you are using existing assets (e.g., code, data, models) or curating/releasing new assets…
If your work uses existing assets, did you cite the creators? Section 4
Did you mention the license of the assets?
Did you include any new assets either in the supplemental material or as a URL? Appendix
Did you discuss whether and how consent was obtained from people whose data you’re using/curating?
Did you discuss whether the data you are using/curating contains personally identifiable information or offensive content?
If you used crowdsourcing or conducted research with human subjects…
Did you include the full text of instructions given to participants and screenshots, if applicable?
Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable?
Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation?
Appendix A Proof of Theorem 3.2
We will use the well known Performance Difference Lemma [kakade2002approximately] in our analysis.
Lemma A.1 (Performance difference lemma, [kakade2002approximately]).
For policies any two policies and ,
where , for .
Proof of Theorem 3.2.
Recall the following notations: is the meta-policy used at iteration of our algorithm, is the policy obtained after task-specific adaptation for task , is the state-visitation frequency of policy for task , and is the value of the policy for the MDP corresponding to task . The value of the meta-policy is defined as .
We can obtain a performance difference lemma for the meta-policies as follows.
Appendix B Environments
In this section, we describe all the simulation and real-world environments in detail.
b.1 Simulation Environments
Point 2D Navigation: Point 2D Navigation [finn2017model] is a 2 dimensional goal reaching environment with , , and the following dynamics,
Where and are the and location of the agent, and are the actions taken which correspond to the displacement in the and direction respectively, all taken at time step . The goals are located on a semi circle of radius , and the episode terminates when the agent reaches the goal or spends more than time steps in the environment. The sparse reward function for the agent is defined as follows,
where and are the location of the goal. The agent is given a zero reward everywhere except when it is a certain distance near the goal location. Within the distance , the agent is given two kinds of rewards. If the agent is very close to the goal, say a distance , then it rewarded with a positive bonus of
Number_of_times_steps_remaining_in_episode. This is done to create a sink near goal location to trap the agent inside it, rather than letting it wander in the region to keep collecting misleading positive reward. For distances between 0.02 and 0.2, the agent is given a positive reward of
TwoWheeled Locomotion: The TwoWheeled Locomotion environment [gupta2018meta] is designed based on the two wheeled differential drive model with , , and the following dynamics,
with , where correspond to the and coordinate of the agent, and are the actions corresponding to the linear and angular velocity of the agent all at time , and is the time discretization factor. Goals are located on a semi-circle of radius , and the episode terminates if the agent reaches the goal, or spends more than time steps in the environment, or moves out of region, which is a square box of side . The sparse reward function for the agent is defined as follows,
where and are the location of the goal.
Half Cheetah Forward-Backward: The Half Cheetah Forward-Backward environment [finn2017model], is a modified version of the standard MuJoCo[todorov2012mujoco] HalfCheetah environment with and , where the agent is tasked with moving forward or backward, with the episode terminating if the agent spends more than time steps in the environment. The sparse reward function is as follows,
where corresponds to the position of the agent, is the control cost, all at time step , is the time discretization factor, and is the goal direction, which is for the forward task and for the backward task.
TwoWheeled Locomotion - Changing Dynamics: We modify the TwoWheeled locomotion environment by fixing the goal to , and adding a residual angular velocity,
with , where is the residual angular velocity, which corresponds to different task, and mimics drift in the environment. The sparse reward function is similar to the one described in section B.1.
b.2 Real-World TurtleBot Platform and Experiments
We deploy the policy trained on the environment described in section B.1 on a TurtleBot 3 [amsters2019turtlebot]
, a real world open source differential drive robot. We use ROS as a middleware to set up communication between the bot and a custom built OpenAI Gym environment. The OpenAI Gym environment acts as an interface between the policy being deployed and the bot. The custom built environment, subscribes to ROS topics (/odom for , ), which are used to communicate the state of the bot, and publish (/cmd_vel for , ) actions. This is done asynchronously through a callback driven mechanism. The bot transmits its state information over a wireless network to an Intel NUC, which transmits back the corresponding action according to the policy being deployed.
The trajectories executed by the adapted policies are plotted in figure 7 ( note that figure 7 is the same as figure 6, re-plotted here for clarity). During policy execution on the TurtleBot, we set the residual angular velocity that mimics drift to , we note that our algorithms (EMRLD and EMRLD-WS) are able to adapt to the drift in the environment and reach the goal. We further note that MAML, takes a longer sub-optimal route to reach the reward region, but misses the goal.
We have provided a link to real-world demonstration with our code111https://github.com/DesikRengarajan/EMRLD. For EMRLD, we show the execution of the meta policy used to collect data, and the adapted policy. It can be clearly seen that the the meta policy collects rewards in the vicinity of the goal region, which is then used for adaptation. The adapted policy then reaches the goal. We further show the execution of the adapted policies for the baseline algorithms on the TurtleBot, and we can observe that EMRLD and EMRLD-WS outperform all the baseline algorithms and reach the goal.
Appendix C Experimental Setup
Computing infrastructure and run time: The experiments are run on computers with AMD Ryzen Threadripper 3960X 24-Core Processor with max CPU speed of 3800MHz. Our implementation does not make use of GPUs. Insead, the implementation is CPU thread intensive. On an average, EMRLD and EMRLD-WS take 3h to run on smaller environments, and take 5h on HalfCheetah. We train goal conditioned expert policies using TRPO. Expert policy training takes 0.5h to run. Our code is based on learn2learn222https://github.com/learnables/learn2learn [Arnold2020-ss]
, a software library built using PyTorch[paszke2019pytorch] for Meta-RL research.
Neural Network and Hyperparameters: In our work, the meta policy and the adapted policiesand the output is a Gaussian mean vector . The standard deviation is kept fixed, and is not learnable. During training, an action is sampled from .
For value baseline (used for advantage computation) of meta-learning algorithms, we use a linear baseline function of the form , where and is discounted sum of rewards starting from state till the end of an episode. This was first proposed in [duan2016benchmarking] and is used in MAML [finn2017model]
. This is preferred as a learnable baseline can add additional gradient computation and backpropagation overheads in meta-learning.
We use TRPO on goal conditioned policies to obtain optimal and sub-optimal experts for all the tasks in an environment at once. For each environment, the task context variable, i.e.
, a vector that contains differentiating information on a task, is appended to the input state vector of the policy network. The rest of the policy mechanism is same as described above for meta-policies. A learnable value network is used to cut variance in advantage estimation. Once the expert policy is trained to the desired amount, just one trajectory per task is sampled to construct demonstration data.
All the models used in this work are multi-layer perceptrons (MLPs). The policy models for all the meta-learning algorithms have two layers of 100 neurons each with Rectified Linear Unit (ReLU) non-linearities. The data generating policy and value models use two layers of 128 neurons each.
Table 1 lists the set of hyperparameters used for EMRLD, EMRLD-WS and the baseline algorithms. In addition to the ones listed in Table 1, meta batch size is dependant on the training environment: it is for Point2D Navigation, for TwoWheeled Locomotion and for HalfCheetah Forward-Backward. In Table 1, Meta LR specified as ‘TRPO’ means that the learning rate is determined by step-size rule coming from TRPO. The meta optimization steps in Meta-BC and GMPS use ADAM [kingma2014adam] optimizer with a learning rate of . We use CPU cores to parallelize policy rollouts for adaptation. The hyperparameters and are kept fixed across environments for EMRLD and EMRLD-WS. The parameter is kept at for both optimal and sub-optimal data, and across environments. The parameter takes a lower value of across environments for optimal data as in practise optimal data is expected to be highly informative. Hence, we desire the gradient component arising from optimal data to hold more value while adaptation. For sub-optimal data, the agent is required to explore to obtain performance beyond data, and hence, is kept at . We further show in section D that our algorithm is robust to choice of and .
|Adapt batch size||20||20||20||20||20|
|CPU thread No.||20||20||20||20||20|
Appendix D Sensitivity Analysis
We perform sensitivity analysis for parameters and on our algorithms EMRLD and EMRLD-WS for optimal data on Point2D Navigation. The results for the same are included in Fig. 8. All the plots are averaged over three random seed runs. To assess the sensitivity of our algorithms to , we fix and vary to take values from . Similarly, to assess how sensitive our algorithm’s performance to is, we fix and vary to take values from . All the hyperparameters are kept fixed to the values listed in Table 1. We observe that our algorithms are fairly robust to variations in and for three random seeds. Since demonstration data is leveraged to extract useful information regarding the environment and the reward structure, our algorithms are slightly more sensitive to variation than variation.
Appendix E Ablation experiments
We perform ablation experiments for EMRLD by setting and on the Point2D Navigation environment with the optimal and the sub-optimal demonstration data. We observe from figure 9, that setting hampers the performance to a greater extant as the agent is unable to extract useful information from the environment due to the sparse reward structure. We also observe that setting hampers the performance, as the agent is unable to exploit the RL structure of the problem to achieve high rewards.
Appendix F Related Work
Reinforcement learning (RL) has become popular as a tool to perform learning from interaction in complex problem domains like autonomous navigation of stratospheric balloons [bellemare2020autonomous] and autonomously solving a game of Go [silver2016mastering]. In large scale complex environments, one requires a large amount of data to learn any meaningful RL policy [botvinick2019reinforcement]. This is in stark contrast to how we as humans behave and learn - by translating our prior knowledge of past exposure to same/similar tasks into behavioural policies for a new task at hand. The initial work [schmidhuber1996simple]
took to addressing the above mentioned gap and proposed the paradigm of meta-learning. The idea has been extended to obtain gradient based algorithms in supervised learning, unsupervised learning, control, and reinforcement learning[schweighofer2003meta, hochreiter2001learning, thrun2012learning, wang2016learning, duan2016rl]. More recently, model-agnostic meta-learning (MAML) [finn2017model] introduced a gradient based two-step approach to meta-learning: an inner adaptation step to learn specific task policies, and an outer meta-optimization loop that implicitly makes use of the inner policies. MAML can be used both in the supervised learning and RL contexts. Reptile [nichol2018first] introduced efficient first order meta-learning algorithms. PEARL [rakelly2019efficient] takes a different approach to meta-RL, wherein task specific contexts are learned during training, and interpreted from trajectories during testing to solve the task. In its native form, the RL variant of MAML can suffer from issues of inefficient gradient estimation, exploration, and dependence on a rich reward function. Among others, algorithms like ProMP [rothfuss2018promp] and DiCE [foerster2018dice] address the issue of inefficient gradient estimation. Similarly, E-MAML [al2017continuous, stadie2018some] and MAESN [gupta2018meta] deal with the issue of exploration in meta-RL. Inadequate reward information or sparse rewards is a particularly challenging problem setting for RL , and hence, for meta-RL. Very recently, HTR [packer2021hindsight] proposed to relabel the experience replay data of any off-policy algorithm to overcome exploration difficulties in sparse reward goal reaching environments. Different from this approach, we leverage the popular learning from demonstration idea to aid learning of meta-policies on tasks including and beyond goal reaching ones.
RL with demonstration: ‘Learning from demonstrations’ (LfD) [schaal1996learning] first proposed the use of demonstrations in RL to speed up learning. Since then, leveraging demonstrations has become an attractive approach to aid learning [hester2018deep, vecerik2017leveraging, nair2018overcoming]. Earlier work has incorporated data from both expert and inexpert policies to assist with policy learning in sparse reward environments [nair2018overcoming, hester2017learning, vecerik2017leveraging, kang2018policy, rengarajan2022reinforcement]. In particular, DQfD [hester2017learning] utilizes demonstration data by adding it to the replay buffer for Q-learning. DDPGfD[vecerik2017leveraging] extend use of demonstration data to continuous action spaces, and is built upon DDPG [lillicrap2015continuous]. DAPG [rajeswaran2017learning] proposes an online fine-tuning algorithm by combining policy gradient and behavior cloning. POfD [kang2018policy] propose an approach to use demonstration data through an appropriate loss function into the RL policy optimization step to implicitly reshape sparse reward function. LOGO [rengarajan2022reinforcement] proposes a two-step guidance approach where demonstration data is used to guide the RL policy in the initial phase of learning.
Meta-RL with demonstration: Use of demonstration data in meta-RL is new, and the works in this area are rather few. Meta Imitation Learning [finn2017one] extends MAML [finn2017model] to imitation learning from expert video demonstrations. WTL [zhou2019watch] uses demonstrations to generate an exploration algorithm, and uses the exploration data along with demonstration data to solve the task. ODA [zhao2021offline] use demonstration data to perform offline meta-RL for industrial insertion, and [arulkumaran2022all] propose generalized ‘upside down RL’ algorithms that use demonstration data to perform offline-meta-RL. GMPS [mendonca2019guided] extends MAML [finn2017model] to leverage expert demonstration data by performing meta-policy optimization via supervised learning. Closest to our approach are GMPS [mendonca2019guided] and Meta Imitation Learning [finn2017one], and we will focus on comparisons with versions of these algorithms, along with the original MAML [finn2017model].