Goal-conditioned Imitation Learning

06/13/2019 ∙ by Yiming Ding, et al. ∙ berkeley college Intel 11

Designing rewards for Reinforcement Learning (RL) is challenging because it needs to convey the desired task, be efficient to optimize, and be easy to compute. The latter is particularly problematic when applying RL to robotics, where detecting whether the desired configuration is reached might require considerable supervision and instrumentation. Furthermore, we are often interested in being able to reach a wide range of configurations, hence setting up a different reward every time might be unpractical. Methods like Hindsight Experience Replay (HER) have recently shown promise to learn policies able to reach many goals, without the need of a reward. Unfortunately, without tricks like resetting to points along the trajectory, HER might take a very long time to discover how to reach certain areas of the state-space. In this work we investigate different approaches to incorporate demonstrations to drastically speed up the convergence to a policy able to reach any goal, also surpassing the performance of an agent trained with other Imitation Learning algorithms. Furthermore, our method can be used when only trajectories without expert actions are available, which can leverage kinestetic or third person demonstration. The code is available at https://sites.google.com/view/goalconditioned-il/ .

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 4

page 6

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Reinforcement Learning (RL) has shown impressive results in a plethora of simulated tasks, ranging from attaining super-human performance in video-games Mnih et al. [2015], Vinyals et al. [2019] and board-games Silver et al. [2017], to learning complex locomotion behaviors Heess et al. [2017], Florensa et al. [2017a]. Nevertheless, these successes are shyly echoed in real world robotics Riedmiller et al. [2018], Zhu et al. [2018a]. This is due to the difficulty of setting up the same learning environment that is enjoyed in simulation. One of the critical assumptions that are hard to obtain in the real world are the access to a reward function. Self-supervised methods have the power to overcome this limitation.

A very versatile and reusable form of self-supervision for robotics is to learn how to reach any previously observed state upon demand. This problem can be formulated as training a goal-conditioned policy Kaelbling [1993], Schaul et al. [2015] that seeks to obtain the indicator reward of having the observation exactly match the goal. Such a reward does not require any additional instrumentation of the environment beyond the sensors the robot already has. But in practice, this reward is never observed because in continuous spaces like the ones in robotics, the exact same observation is never observed twice. Luckily, if we are using an off-policy RL algorithm Lillicrap et al. [2015], Haarnoja et al. [2018], we can “relabel" a collected trajectory by replacing its goal by a state actually visited during that trajectory, therefore observing the indicator reward as often as we wish. This method was introduced as Hindsight Experience Replay Andrychowicz et al. [2017] or HER, although it used special resets, and the reward was in fact an -ball around the goal, which only makes sense in lower-dimensional state-spaces. More recently the method was shown to work directly from vision with a special reward Nair et al. [2018a], and even only with the indicator reward of exactly matching observation and goal Florensa et al. [2018a].

In theory these approaches could learn how to reach any goal, but the breadth-first nature of the algorithm makes that some areas of the space take a long time to be learned Florensa et al. [2018b]. This is specially challenging when there are bottlenecks between different areas of the state-space, and random motion might not traverse them easily Florensa et al. [2017b]. Some practical examples of this are pick-and-place, or navigating narrow corridors between rooms, as illustrated in Fig. 2 depicting the diverse set of environments we work with. In both cases a specific state needs to be reached (grasp the object, or enter the corridor) before a whole new area of the space is discovered (placing the object, or visiting the next room). This problem could be addressed by engineering a reward that guides the agent towards the bottlenecks, but this defeats the purpose of trying to learn without direct reward supervision. In this work we study how to leverage a few demonstrations that traverse those bottlenecks to boost the learning of goal-reaching policies.

Learning from Demonstrations, or Imitation Learning (IL), is a well-studied field in robotics Kalakrishnan et al. [2009], Ross et al. [2011], Bojarski et al. [2016]. In many cases it is easier to obtain a few demonstrations from an expert than to provide a good reward that describes the task. Most of the previous work on IL is centered around trajectory following, or doing a single task. Furthermore it is limited by the performance of the demonstrations, or relies on engineered rewards to improve upon them. In this work we study how IL methods can be extended to the goal-conditioned setting, and show that combined with techniques like HER it can outperform the demonstrator without the need of any additional reward. We also investigate how the different methods degrade when the trajectories of the expert become less optimal, or less abundant. We also observe that these methods can be run in a complete reset-free fashion, hence overcoming another limitation of RL in the real world. Finally, the method we develop is able leverage demonstrations that do not include the expert actions. This is very convenient in practical robotics where demonstrations might have been given by a motion planner, by kinestetic demonstrations (moving the agent externally, and not by actually actuating it), or even by another agent. To our knowledge, this is the first framework that can boost goal-conditioned policy learning with only state demonstrations.

2 Preliminaries

We define a discrete-time finite-horizon discounted Markov decision process (MDP) by a tuple

, where is a state set, is an action set,

is a transition probability distribution,

is a discount factor, and is the horizon. Our objective is to find a stochastic policy that maximizes the expected discounted reward within the MDP, . We denote by the entire state-action trajectory, where , and . In the goal-conditioned setting that we use here, the policy and the reward are also conditioned on a “goal" . The reward is , and hence the return is the , where is the number of time-steps to the goal. Given that the transition probability is not affected by the goal, can be “relabeled" in hindsight, so a transition can be treated as . Finally, we also assume access to trajectories that were collected by an expert attempting to reach a goal sampled uniformly among the feasible goals. Those trajectories must be approximately geodesics, meaning that the actions are taken such that the goal is reached as fast as possible.

3 Related Work

Imitation Learning can be seen as an alternative to reward crafting to train desired behaviors. There are many ways to leverage demonstrations, from Behavioral Cloning Pomerleau [1989] that directly maximizes the likelihood of the expert actions under the training agent policy, to Inverse Reinforcement Learning that extracts a reward function from those demonstrations and then trains a policy to maximize it Ziebart et al. [2008], Finn et al. [2016], Fu et al. [2018]. Another formulation close to the later introduced by Ho and Ermon [2016] is Generative Adversarial Imitation Learning (GAIL), explained in details in the next section. Originally, the algorithms used to optimize the policy were on-policy methods like Trust Region Policy Optimization Schulman et al. [2015], but recently there has been a wake of works leveraging the efficiency of off-policy algorithms without loss in stability Blondé and Kalousis [2019], Sasaki et al. [2019], Schroecker et al. [2019], Kostrikov et al. [2019]. This is a key capability that we are going to exploit later on.

Unfortunately most work in the field cannot outperform the expert, unless another reward is available during training Vecerik et al. [2017], Gao et al. [2018], Sun et al. [2018], which might defeat the purpose of using demonstrations in the first place. Furthermore, most tasks tackled with these methods consist on tracking expert state trajectories Zhu et al. [2018b], Peng et al. [2018], but can’t adapt to unseen situations.

In this work we are interested in goal-conditioned tasks, where the objective is to be able to reach any state upon demand. This kind of multi-task learning are pervasive in robotics, but challenging if no reward-shaping is applied. Relabeling methods like Hindsight Experience Replay Andrychowicz et al. [2017] unlock the learning even in the sparse reward case Florensa et al. [2018a]. Nevertheless, the inherent breath-first nature of the algorithm might still make very inefficient learning to learn complex policies. To overcome the exploration issue we investigate the effect of leveraging a few demonstrations. The closest prior work is by Nair et al. [2018b], where a Behavioral Cloning loss is used with a Q-filter. We found that a simple annealing of the Behavioral Cloning loss Rajeswaran et al. [2018] works better. Furthermore, we also introduce a new relabeling technique of the expert trajectories that is particularly useful when only few demonstrations are available. We also experiment with Goal-conditioned GAIL, leveraging the recently shown compatibility with off-policy algorithms.

4 Demonstrations in Goal-conditioned tasks

1:Input: Demonstrations , replay buffer , policy , discount , hindsight probability
2:while not done do
3:     # Sample rollout
4:      Goal are sampled from observed states
5:     Use to sample
6:     # Sample from buffers
7:     ,
8:     # Relabel agent
9:     if HER then
10:         for each , with probability  do
11:              ,    Unif Use future HER strategy
12:         end for
13:     end if
14:     if EXPERT RELABEL then
15:         ,    Unif
16:     end if
17:     
18:     if  then
19:         
20:          Add annealed GAIL reward
21:     end if
22:     # Fit
23:      Use target networks for stability
24:     
25:     # Update Policy
26:      Combine weighted gradients
27:     Anneal and Ensures outperforming the expert
28:end while
Algorithm 1 Goal-conditioned Imitation Learning

In this section we describe the different algorithms we compare to running only Hindsight Experience Replay Andrychowicz et al. [2017]. First we revisit adding a Behavioral Cloning loss to the policy update as in Nair et al. [2018b], then we propose a novel expert relabeling technique, and finally we formulate for the first time a goal-conditioned GAIL algorithm, and propose a method to train it with state-only demonstrations.

4.1 Goal-conditioned Behavioral Cloning

The most direct way to leverage demonstrations is to construct a data-set of all state-action-goal tuples , and run a supervised regression algorithm. In the goal-conditioned case and assuming a deterministic policy , the loss is:

This loss and its gradient are computed without any additional environments samples from the trained policy . This makes it particularly convenient to combine a gradient descend step based on this loss with other policy updates. In particular we can use a standard off-policy Reinforcement Learning algorithm like DDPG Lillicrap et al. [2015], where we fit the

, and then estimate the gradient of the expected return as:

In our goal-conditioned case, the fitting can also benefit form “relabeling" like done in HER Andrychowicz et al. [2017]. The improvement guarantees with respect to the task reward are lost when we combine the BC and the deterministic policy gradient updates, but this can be side-stepped by either applying a Q-filter to the BC loss as proposed in Nair et al. [2018b], or by annealing it as we do in our experiments, which allows to eventually outperform the expert.

4.2 Relabeling the expert

(a) Performance on reaching states visited in demonstrations. The state is colored in green if the policy reaches it when attempting so, and red otherwise.
(b) Performance on reaching any possible state. Each cell is colored green if the policy can reach the center of it when attempting so, and red otherwise.
Figure 1: Policy performance on reaching different goals in the four rooms, when training on 20 demonstrations with standard Behavioral Cloning (top row) or with our expert relabeling (bottom).

The expert trajectories have been collected by asking the expert to reach a specific goal . But they are also valid trajectories to reach any other state visited within the demonstration! This is the key motivating insight to propose a new type of relabeling: if we have the transitions in a demonstration, we can also consider the transition as also coming from the expert! Indeed that demonstration also went through the state , so if that was the goal, the expert would also have generated this transition. This can be understood as a type of data augmentation leveraging the assumption that the tasks we work on are quasi-static. It will be particularly effective in the low data regime, where not many demonstrations are available. The effect of expert relabeling can be visualized in the four rooms environment as it’s a 2D task where states and goals can be plotted. In Fig. 1 we compare the final performance of two agents, one trained with pure Behavioral Cloning, and the other one also using expert relabeling.

4.3 Goal-conditioned GAIL with Hindsight

The compounding error in Behavioral Cloning might make the policy deviate arbitrarily from the demonstrations, and it requires too many demonstrations when the state dimension increases. The first problem is less severe in our goal-conditioned case because in fact we do want to visit and be able to purposefully reach all states, even the ones that the expert did not visited. But the second drawback will become pressing when attempting to scale this method to practical robotics tasks where the observations might be high-dimensional sensory input like images. Both problems can be mitigated by using other Imitation Learning algorithms that can leverage additional rollouts collected by the learning agent in a self-supervised manner, like GAIL Ho and Ermon [2016]. In this section we extend the formulation of GAIL to tackle goal-conditioned tasks, and then we detail how it can be combined with HER Andrychowicz et al. [2017], which allows to outperform the demonstrator and generalize to reaching all goals. We call the final algorithm goal-GAIL. First of all, the discriminator needs to also be conditioned on the goal , and be trained by minimizing

Once the discriminator is fitted, we can run our favorite RL algorithm on the reward . In our case we used the off-policy algorithm DDPG Lillicrap et al. [2015] to allow for the relabeling techniques outlined above. In the goal-conditioned case we also supplement with the indicator reward . This combination is slightly tricky because now the fitted does not have the same clear interpretation it has when only one of the two rewards is used Florensa et al. [2018a] . Nevertheless, both rewards are pushing the policy towards the goals, so it shouldn’t be too conflicting. Furthermore, to avoid any drop in final performance, the weight of the reward coming from GAIL () can be annealed.

All possible variants we study are detailed in Algorithm 1. In particular, falls back to pure Behavioral Cloning, removes the BC component, doesn’t relabel agent trajectories, removes the discriminator output from the reward, and EXPERT RELABEL indicates whether the here explained expert relabeling should be performed. In the next section we test these variants in the diverse environments depicted in Fig. 2.

4.4 Use of state-only Demonstrations

Both Behavioral Cloning and GAIL use state-action pairs from the expert. This limits the use of the methods, combined or not with HER, to setups where the exact same agent was actuated to reach different goals. Nevertheless, much more data could be cheaply available if the action was not required. For example, non-expert humans might not be able to operate a robot, but might be able to move the robot along the desired trajectory. This is called a kinestetic demonstration. Another type of state-only demonstration could be the one used in third-person imitation Stadie et al. [2017], where the expert performed the task with an embodiment different than the agent that needs to learn the task. This has mostly been applied to the trajectory-following case. In our case every demonstration might have a different objective.

Furthermore, we would like to propose a method that not only leverages state-only demonstrations, but can also outperform the quality and coverage of the demonstrations given, or at least generalize to similar goals. The main insight we have here is that we can replace the action in the GAIL formulation by the next state , and in most environments this should be as informative as having access to the action directly. Intuitively, given a desired goal , it should be possible to determine if a transition

is taking the agent in the right direction. The loss function to train a discriminator able to tell apart the current agent and demonstrations (always transitioning towards the goal) is simply:

5 Experiments

We are interested in answering the following questions:

  1. [noitemsep,topsep=0pt,parsep=0pt,partopsep=0pt]

  2. Can the use of demonstrations accelerate the learning of goal-conditioned tasks without reward?

  3. Is the Expert Relabeling an efficient way of doing data-augmentation on the demonstrations?

  4. Can state-only demonstrations be leveraged equally well as full trajectories?

  5. Compared to Behavorial Cloning methods, is GAIL more robust to noise in the expert actions?

We will evaluate these questions in two different simulated robotic goal-conditioned tasks that are detailed in the next subsection along with the performance metric used throughout the experiments section. All the results use 20 demonstrations. All curves have 5 random seeds and the shaded area is one standard deviation

5.1 Tasks

Experiments are conducted in two continuous environments in MuJoCo Todorov et al. [2012]. The performance metric we use in all our experiments is the percentage of goals in the feasible goal space the agent is able to reach. We call this metric coverage. To estimate this percentage we sample feasible goals uniformly, and execute a rollout of the current policy. It is consider a success if the agent reaches within of the desired goal. Note that during training we do not assume access to the feasible goal distribution, nor use any to give rewards. These are two very commonly used assumptions in works using HER Andrychowicz et al. [2017], Nair et al. [2018b], and we do not assume them.

(a) Continuous Four rooms
(b) Fetch Pick & Place
Figure 2: Environments where we test the use of demonstrations

Four rooms environment: This is a continuous twist on a well studied problem in the Reinforcement Learning literature. A point mass is placed in an environment with four rooms connected through small openings as depicted in Fig. 1(a). The action space of the agent is continuous and specifies the desired change in state space, and the goals-space exactly corresponds to the state-space.

Pick and Place: This task is the same as the one described by Nair et al. [2018b], where a fetch robot needs to pick a block and place it in a desired point in space. The control is now four-dimensional, corresponding to a change in position of the end-effector as well as a change in gripper opening. The goal space is three dimensional and is restricted to the position of the block.

5.2 Goal-conditioned Imitation Learning

In goal-conditioned tasks, HER Andrychowicz et al. [2017] should eventually converge to a policy able to reach any desired goal. Nevertheless, this might take a long time, specially in environments where there are bottlenecks that need to be traversed before accessing a whole new area of the goal space. In this section we show how the methods introduced in the previous section can leverage a few demonstrations to improve the convergence speed of HER. This was already studied for the case of Behavioral Cloning by Nair et al. [2018b], and in this work we show we also get a benefit when using GAIL as the Imitation Learning algorithm, which brings considerable advantages over Behavioral Cloning as shown in the next sections.

(a) Continuous Four rooms
(b) Fetch Pick & Place
Figure 3: Performance of Goal-conditioned GAIL compared to only GAIL and HER

In both environments, we observe that running GAIL with relabeling (GAIL+HER) considerably outperforms running each of them in isolation. HER alone has a very slow convergence, although as expected it ends up reaching the same final performance if run long enough. On the other hand GAIL by itself learns fast at the beginning, but its final performance is capped. This is because despite collecting more samples on the environment, those come with no reward of any kind indicating what is the task to perform (reach the given goals). Therefore, once it has extracted all the information it can from the demonstrations it cannot keep learning and generalize to goals further from the demonstrations. This is not an issue anymore when combined with HER, as our results show.

5.3 Expert relabeling

Here we show that the Expert Relabeling technique introduced in Section 4.2 is beneficial when using demonstrations in the goal-conditioned imitation learning framework. As shown in Fig. 4, our expert relabeling technique brings considerable performance boosts for both Behavioral Cloning methods and goal-GAIL in both environments.

We also perform a further analysis of the benefit of the expert relabeling in the four-rooms environment because it is easy to visualize in 2D the goals the agent can reach. We see in Fig. 1 that without the expert relabeling, the agent fails to learn how to reach many intermediate states visited in the middle of a demonstration.

The performance of running pure Behavioral Cloning is plotted as a horizontal dotted line given that it does not require any additional environment steps. We observe that combining BC with HER always produces faster learning than running just HER, and it reaches higher final performance than running pure BC with only 20 demonstrations.

(a) Continuous Four rooms
(b) Fetch Pick & Place
Figure 4: Effect of our Expert Relabeling technique on different Goal-Conditioned Imitation Learning algorithms.

5.4 Using state-only demonstrations

Figure 5: Output of the Discriminator when the goal is the white point in the lower left, and the start is always at the top right.

Behavioral Cloning and standard GAIL rely on the state-action tuples coming from the expert. Nevertheless there are many cases in robotics where we have access to demonstrations of a task, but without the actions. In this section we want to emphasize that all the results obtained with our goal-GAIL method and reported in Fig. 3 and Fig. 4 do not require any access to the action that the expert took. Surprisingly, in the four rooms environment, despite the more restricted information goal-GAIL has access to, it outperforms BC combined with HER. This might be due to the superior imitation learning performance of GAIL, and also to the fact that these tasks might be possible to solve by only matching the state-distribution of the expert. We run the experiment of training GAIL only conditioned on the current state, and not the action (as also done in other non-goal-conditioned works Fu et al. [2018]), and we observe that the discriminator learns a very well shaped reward that clearly encourages the agent to go towards the goal, as pictured in Fig. 5. See the Appendix for more details.

5.5 Robustness to sub-optimal expert

In the above sections we were assuming access to perfectly optimal experts. Nevertheless, in practical applications the experts might have a more erratic behavior, not always taking the best action to go towards the given goal. In this section we study how the different methods perform when a sub-optimal expert is used. To do so we collect trajectories attempting goals by modifying our optimal expert in three ways: first we condition it on a goal , where , therefore the expert doesn’t go exactly where it is asked to. Second we add noise to the optimal actions, and third we make it be -greedy. All together, the sub-optimal expert is then , where , and is a uniformly sampled random action in the action space.

In Fig. 6 we observe that approaches that directly try to copy the action of the expert, like Behavioral Cloning, greatly suffer under a sub-optimal expert, to the point that it barely provides any improvement over performing plain Hindsight Experience Replay. On the other hand, methods based on training a discriminator between expert and current agent behavior are able to leverage much noisier experts. A possible explanation of this phenomenon is that a discriminator approach can give a positive signal as long as the transition is "in the right direction", without trying to exactly enforce a single action. Under this lens, having some noise in the expert might actually improve the performance of these adversarial approaches, as it has been observed in many generative models literature Goodfellow et al. .

(a) Continuous Four rooms
(b) Fetch Pick & Place
Figure 6: Learning with sub-optimal demonstrations

6 Conclusions and Future Work

Hindsight relabeling can be used to learn useful behaviors without any reward supervision for goal-conditioned tasks, but they are inefficient when the state-space is large or includes exploration bottlenecks. In this work we show how only a few demonstrations can be leveraged to improve the convergence speed of these methods. We introduce a novel algorithm, goal-GAIL, that converges faster than HER and to a better final performance than a naive goal-conditioned GAIL. We also study the effect of doing expert relabeling as a type of data augmentation on the provided demonstrations, and demonstrate it improves the performance of our goal-GAIL as well as goal-conditioned Behavioral Cloning. We emphasize that our goal-GAIL method only needs state demonstrations, without using expert actions like other Behavioral Cloning methods. Finally, we show that goal-GAIL is robust to sub-optimalities in the expert behavior.

All the above factors make our goal-GAIL algorithm very suited for real-world robotics. This is a very exciting future work. In the same line, we also want to test the performance of these methods in vision-based tasks. Our preliminary experiments show that Behavioral Cloning fails completely in the low data regime in which we operate (less than 20 demonstrations).

References

Appendix A Hyperparameters and Architectures

In the two environments, i.e. Four Rooms environment and Fetch Pick & Place, the task horizons are set to 300 and 100 respectively. The discount factors are . In all experiments, the Q function, policy and discriminator are paramaterized by fully connected neural networks with two hidden layers of size 256. DDPG is used for policy optimization and hindsight probability is set to . The initial value of the behavior cloning loss weight is set to and is annealed by 0.9 per 250 rollouts collected. The initial value of the discriminator reward weight is set to . We found empirically that there is no need to anneal .

For experiments with sub-optimal expert in section 5.5, is set to and , and is set to and respectively for Four Rooms environment and Fetch Pick & Place.

Appendix B Effect of Different Input of Discriminator

We trained the discriminator in three settings:

  • current state and goal:

  • current state, next state and goal:

  • current state, action and goal:

We compare the three different setups in Fig.  7 and  8.

(a) 12 demos
(b) 6 demos
Figure 7: Study of different discriminator inputs for goal-GAIL in Continuous Four Rooms
(a) 12 demos
(b) 6 demos
Figure 8: Study of different discriminator inputs for goal-GAIL in Fetch Pick & Place