SQIL: Imitation Learning via Regularized Behavioral Cloning

05/27/2019 ∙ by Siddharth Reddy, et al. ∙ berkeley college 0

Learning to imitate expert behavior given action demonstrations containing high-dimensional, continuous observations and unknown dynamics is a difficult problem in robotic control. Simple approaches based on behavioral cloning (BC) suffer from state distribution shift, while more complex methods that generalize to out-of-distribution states can be difficult to use, since they typically involve adversarial optimization. We propose an alternative that combines the simplicity of BC with the robustness of adversarial imitation learning. The key insight is that under the maximum entropy model of expert behavior, BC corresponds to fitting a soft Q function that maximizes the likelihood of observed actions. This perspective suggests a way to regularize BC so that it generalizes to out-of-distribution states: combine the standard maximum-likelihood objective with a penalty on the soft Bellman error of the soft Q function. We show that this penalty term gives the agent an incentive to take actions that lead it back to demonstrated states when it encounters new states. Experiments show that our method outperforms BC and GAIL on a variety of image-based and low-dimensional environments in Box2D, Atari, and MuJoCo.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

This paper considers the problem of training an agent to imitate an expert policy, given expert action demonstrations and access to the environment. The agent does not get to observe a reward signal or query the expert, and does not know the state transition dynamics.

Standard approaches to this problem based on behavioral cloning (BC) seek to imitate the expert’s actions, but do not reason about the consequences of actions pomerleau1991efficient . As a result, they suffer from state distribution shift, and fail to generalize to states that are very different from those seen in the demonstrations ross2010efficient ; ross2011reduction

. Approaches based on inverse reinforcement learning (IRL) deal with this issue by fitting a reward function that represents preferences over trajectories rather than individual actions

ng2000algorithms ; ziebart2008maximum , and using the learned reward function to train the imitation agent through RL wulfmeier2015maximum ; finn2016guided ; fu2017learning . This is the core idea behind generative adversarial imitation learning (GAIL), which implicitly combines IRL and RL using generative adversarial networks (GANs) ho2016generative ; goodfellow2014generative . GAIL is the state of the art, but tends to require additional reward augmentation and feature engineering when applied to environments with high-dimensional image observations li2017infogail , and inherits the difficulty of training GANs kurach2018gan .

The main idea in this paper is that instead of resorting to adversarial imitation learning to overcome state distribution shift, we can modify BC so that it generalizes to out-of-distribution states while remaining simple and easy to implement. We propose combining the standard maximum-likelihood objective of BC, which encourages the agent to imitate the expert in demonstrated states, with a regularization term that gives the agent an incentive to take actions that lead it back to demonstrated states when it encounters new states.

Our intuition for the regularizer is that it should incorporate information about the dynamics of the environment into the objective, so that the agent can learn to get back to demonstrated states. We derive such a regularizer from the maximum entropy model of expert behavior. In this model, the logits of the imitation policy can be interpreted as soft Q values, and BC corresponds to maximum-likelihood estimation of the soft Q function given the demonstrations. The problem with BC is that the learned soft Q function may output arbitrary values in states that are out-of-distribution with respect to the demonstrations. To overcome this issue, we regularize the soft Q function by imposing a penalty on the squared soft Bellman error, which is approximated using transitions from rollouts of the imitation policy periodically sampled during training. Since we do not have access to a reward signal, we set all rewards to zero in the samples, which encourages the agent to get back to the demonstrated states. We refer to this algorithm as

regularized behavioral cloning (RBC).

To better understand the effect of the proposed regularizer, we show that RBC is similar to an off-policy reinforcement learning (RL) algorithm that rewards the agent for reaching demonstrated states and matching demonstrated actions in those states. The off-policy RL algorithm is a variant of soft Q-learning that initializes the agent’s experience replay buffer with demonstrations, sets rewards to a positive constant in the stored demonstration experiences, and sets rewards to zero in all additional experiences. We call this method soft Q imitation learning (SQIL). The connection between RBC and SQIL enables us to interpret the effect of the regularizer on the agent’s incentives. It also makes our method easier to implement, since SQIL only requires a few small changes to existing off-policy RL code.

The main contribution of this paper is SQIL: a simple and general imitation learning algorithm that is effective in MDPs with high-dimensional, continuous observations and unknown dynamics. We run experiments in four image-based environments – Car Racing, Pong, Breakout, and Space Invaders – and three low-dimensional environments – Humanoid, HalfCheetah, and Lunar Lander – from OpenAI Gym 1606.01540 , Arcade Learning Environment bellemare2013arcade , and MuJoCo todorov2012mujoco , to compare SQIL to two prior methods: BC and GAIL. We find that SQIL outperforms both prior methods, especially on the image-based tasks. Our experiments illustrate two key benefits of SQIL: (1) that it can overcome the state distribution shift problem of BC without adversarial training or learning a reward function, which makes it easier to use with images, and (2) that it is simple to implement using existing off-policy value-based RL algorithms.

2 Preliminaries

This work builds on the maximum causal entropy (MaxCausalEnt) model of expert behavior ziebart2010modelingint ; levine2018reinforcement

. In an infinite-horizon Markov Decision Process (MDP) with a continuous state space

and discrete action space ,111Assuming a discrete action space simplifies our analysis. SQIL can be applied to continuous control tasks using existing sampling methods haarnoja2017reinforcement ; haarnoja2018soft , as illustrated in Section 5.3. the demonstrator is assumed to follow a policy that maximizes reward . The policy forms a Boltzmann distribution over actions,

(1)

where is the soft Q function, and is the soft value function,

(2)

If is an absorbing state, we assume , where is the discount factor, and

is a constant hyperparameter that represents the reward for remaining in an absorbing state for one timestep.

222Section 4 discusses how the value of is chosen. The soft Q values are a deterministic function of the rewards and dynamics, given by the soft Bellman equation,

(3)

3 Imitation Learning in the Maximum Entropy Model

We aim for an imitation learning algorithm that generalizes to new states, without resorting to complex adversarial optimization procedures. We build on BC, which is a simple approach that seeks to imitate the expert’s actions using supervised learning. BC does not reason about the consequences of actions, so when the agent makes small mistakes and enters states that are slightly different from those in the demonstrations, the distribution mismatch between the states in the demonstrations and those actually encountered by the agent leads to compounding errors

ross2011reduction . Our solution is to add a regularization term to BC that enables it to overcome state distribution shift while remaining simple and easy to implement.

We derive the regularization term from the maximum entropy model of expert behavior, resulting in a method that infers the expert’s soft Q function by maximizing the likelihood of observed actions and minimizing the squared soft Bellman error. We show that this regularized BC method is similar to an off-policy RL algorithm that rewards the agent for reaching demonstrated states and matching demonstrated actions in those states. This connection enables us to interpret the effect of the regularizer on the agent’s incentives, and makes it possible to implement our method by applying a few simple modifications to any off-policy value-based RL algorithm.

3.1 Behavioral Cloning

Under the generative model in Section 2, BC corresponds to maximum-likelihood estimation of the soft Q function given the demonstrations. Let denote a rollout, where is an absorbing state, and let denote the set of demonstration rollouts. We define BC as fitting a parameterized soft Q function to minimize the loss,

(4)

where the equality follows from Equation 1, and denotes the soft value function given by and Equation 2. The experiments in Section 5

use a convolutional neural network or multi-layer perceptron to model

, where are the weights of the neural network.

3.2 Regularized Behavioral Cloning

The issue with BC is that when the agent encounters states that are out-of-distribution with respect to , may output arbitrary values. One solution is to add a regularization term to the BC objective that encourages to output reasonable default values when it encounters such states. For example, a penalty on the L2-norm of the soft Q values would encourage the soft Q values in out-of-distribution states to be zero instead of arbitrary, leading to a uniform random policy in those states. Instead of inducing a uniform policy, we would like to endow the agent with the ability to get back to demonstrated states when it encounters new states. To do so, we need to incorporate information about the dynamics of the environment into the regularization term.

The model in Section 2 suggests a natural choice of regularizer that achieves this goal: a penalty on the squared soft Bellman error, i.e., the squared difference between the LHS and RHS of Equation 3. Since we do not know the expert’s reward function , we set all rewards to zero. Additionally, since the state space cannot be enumerated and the dynamics are unknown, we approximate the penalty by evaluating it on transitions observed in the demonstrations as well as additional rollouts periodically sampled during training using the imitation policy. Sampling ensures that the penalty covers the state distribution actually encountered by the agent, instead of only the demonstrations.

Formally, we define the regularized BC objective as follows.

(5)

where is a constant hyperparameter, and denotes the sum of squared soft Bellman errors,

(6)

The BC objective encourages to output high values for demonstrated actions at demonstrated states, and the penalty term propagates those high values to nearby states. In other words, outputs high values for actions that lead to states from which the demonstrated states are reachable, so when the agent finds itself far from the demonstrated states, it takes actions that lead it back to the demonstrated states.

3.3 Connection to Off-Policy Reinforcement Learning

The squared soft Bellman error term in Equation 5 strongly resembles a soft Q-learning objective haarnoja2017reinforcement , hinting at an alternative interpretation of RBC. This section shows that the RBC objective is similar to that of a soft Q-learning algorithm that gives the agent a constant positive reward for reaching a demonstrated state and matching the demonstrated action in that state, and zero reward otherwise.

Consider the following modification to the RBC objective.

(7)

The additional terms encourage the learned soft Q values to be higher and make the imitation policy less stochastic, which can improve performance and reduce the number of hyperparameters that need to be tuned (see the ablation experiments in Section 5.4). More importantly, they lead to the following result (derived in Section A.1 of the appendix).

(8)

where are constant hyperparameters. Equation 8 is the gradient of a soft Q-learning algorithm that gives the agent a constant reward of

for taking the demonstrated action in a demonstrated state, assigns a reward of zero to all non-demonstration experiences, and balances the number of demonstration experiences and non-demonstration experiences sampled for each step of stochastic gradient descent. We call this algorithm

soft Q imitation learning (SQIL).

SQIL vs. RBC. The main benefit of using SQIL to optimize Equation 7 instead of using RBC to optimize Equation 5 is that SQIL is trivial to implement, since it only requires a few small changes to existing deep Q-learning code (see Section 4). Extending SQIL to MDPs with a continuous action space is also easy, since we can simply replace soft Q-learning with the soft actor-critic algorithm haarnoja2018soft (see Section 5.3). Given the difficulty of implementing deep RL algorithms correctly henderson2018deep , this flexibility makes SQIL more practical to use, since it can be built on top of existing code bases. Furthermore, the ablation study in Section 5.4 suggests that SQIL actually performs better than RBC.

4 Soft Q Imitation Learning

1:  Require
2:  Initialize
3:  for  do
4:      {See Equation 6}
5:     if  then
6:         Sample rollout with imitation policy {See Equation 1}
7:         
8:     end if
9:  end for
Algorithm 1 Soft Q Imitation Learning (SQIL)

SQIL is summarized in Algorithm 1. It performs soft Q-learning with three small, but important, modifications: (1) it initially fills the agent’s experience replay buffer with demonstrations, where the rewards are set to some positive constant (e.g., ), (2) as the agent interacts with the world and accumulates new experiences, it adds them to the replay buffer, and sets the rewards for these additional experiences to zero, and (3) it balances the number of demonstration experiences and new experiences in each sample from the replay buffer. Section A.3 in the appendix contains additional implementation details.

Crucially, since the agent can learn from off-policy data, the agent does not necessarily have to visit the demonstrated states in order to experience positive rewards. Instead, the agent replays the demonstrations that were initially added to its experience replay buffer. Thus, SQIL can be used in stochastic environments with continuous states, where the demonstration states may never actually be encountered by the agent.

Termination condition. As the imitation policy learns to behave more like the expert, a growing number of expert-like transitions get added to with an assigned reward of zero. This causes the effective reward for mimicking the expert to decay over time. Balancing the number of demonstration experiences and new experiences sampled from the replay buffer ensures that this effective reward remains at least instead of decaying to zero. In practice, we find that this reward decay does not degrade performance if SQIL is halted once the squared soft Bellman error objective converges to a minimum (e.g., see Figure 8 in the appendix).

Rewards at absorbing states. The value of hyperparameter , which controls the reward at absorbing states (see Section 2), affects the agent’s incentives. Setting to a constant much larger than would encourage the agent to terminate the episode quickly, which may be appropriate for certain tasks. Setting much lower than would encourage the agent to avoid terminating the episode. See discrimac for further discussion. We set in all our experiments, which include some environments where terminating the episode is always undesirable (e.g., walking without falling down) and other environments where success requires terminating the episode (e.g., landing at a target), suggesting that SQIL is not sensitive to the choice of .

5 Experimental Evaluation

Our experiments aim to compare SQIL to existing imitation learning methods on a variety of tasks with high-dimensional, continuous observations, such as images, and unknown dynamics. To that end, we benchmark SQIL against BC and GAIL on four image-based games – Car Racing, Pong, Breakout, and Space Invaders – and three low-dimensional tasks – Humanoid, HalfCheetah, and Lunar Lander. We also investigate which components of SQIL contribute most to its performance via an ablation study on the Lunar Lander game. Section A.3 in the appendix contains additional experimental details.

For all the image-based tasks, we implement a version of GAIL that uses deep Q-learning (GAIL-DQL) instead of TRPO as in the original GAIL paper ho2016generative , since Q-learning performs better than TRPO in these environments, and because this allows for a head-to-head comparison of SQIL and GAIL: both algorithms use the same underlying RL algorithm, but provide the agent with different rewards – SQIL provides constant rewards, while GAIL provides learned rewards. We use the standard GAIL-TRPO method as a baseline for all the low-dimensional tasks, since TRPO performs better than Q-learning in these environments.

The original GAIL method implicitly encodes prior knowledge – namely, that terminating an episode is either always desirable or always undesirable. As pointed out in discrimac , this makes comparisons to alternative methods unfair. We implement the unbiased version of GAIL proposed by discrimac , and use this in all of the experiments. Comparisons to the biased version with implicit termination knowledge are included in Section A.2 in the appendix.

5.1 Testing Generalization in Image-Based Car Racing

The goal of this experiment is to study not only how well each method can mimic the expert demonstrations, but also how well they can acquire policies that generalize to new states that are not seen in the demonstrations. To do so, we train the imitation agents in an environment with a different initial state distribution than that of the expert demonstrations , allowing us to systematically control the mismatch between the distribution of states in the demonstrations and the states actually encountered by the agent. We run experiments on the Car Racing game from the Box2D environments in OpenAI Gym (screenshot in Figure 1). To create , the car is rotated 90 degrees so that it begins perpendicular to the track, instead of parallel to the track as in . This intervention presents a significant generalization challenge to the imitation learner, since the expert demonstrations do not contain any examples of states where the car is perpendicular to the road, or even significantly off the road axis. The agent must learn to make a tight turn to get back on the road, then stabilize its orientation so that it is parallel to the road, and only then proceed forward to mimic the expert demonstrations.

Domain Shift () No Shift ()
Random
BC (P’91)
GAIL-DQL
SQIL (Ours)
Expert
Figure 1:

Image-based Car Racing. Average reward on 100 episodes after training. Standard error on three random seeds.

The results in Figure 1 show that SQIL and BC perform equally well when there is no variation in the initial state. The task is easy enough that even BC achieves a high reward. SQIL performs much better than BC when starting from , showing that SQIL is capable of generalizing to a new initial state distribution, while BC is not. SQIL learns to make a tight turn that takes the car through the grass and back onto the road, then stabilizes the car’s orientation so that it is parallel to the track, and then proceeds forward like the expert does in the demonstrations. BC tends to drive straight ahead into the grass instead of turning back onto the road.

SQIL outperforms GAIL in both conditions. Since SQIL and GAIL both use deep Q-learning for RL in this experiment, the gap between them may be attributed to the difference in the reward functions they use to train the agent. SQIL benefits from providing a constant reward that does not require fitting a discriminator, while GAIL struggles to train a discriminator to provide learned rewards directly from images.

5.2 Image-Based Experiments on Atari

Figure 2: Image-based Atari. Smoothed with a rolling window of 100 episodes. Standard error on three random seeds. X-axis represents amount of interaction with the environment (not expert demonstrations).

The results in Figure 2 show that SQIL outperforms BC on Pong, Breakout, and Space Invaders – additional evidence that BC suffers from compounding errors, while SQIL does not. SQIL also outperforms GAIL on all three games, illustrating the difficulty of using GAIL to train an image-based discriminator, as in Section 5.1.

5.3 Instantiating SQIL for Continuous Control in Low-Dimensional MuJoCo

Figure 3: Low-dimensional MuJoCo. SQIL: best performance on 10 consecutive training episodes. BC, GAIL: results from baselines .

The experiments in the previous sections evaluate SQIL in MDPs with a discrete action space. This section illustrates how SQIL can be adapted to continuous actions. We instantiate SQIL using soft actor-critic (SAC) – an off-policy RL algorithm that can solve continuous control tasks haarnoja2018soft . In particular, SAC is modified in the following ways: (1) the agent’s experience replay buffer is initially filled with expert demonstrations, where rewards are set to a positive constant, (2) when taking gradient steps to fit the agent’s soft Q function, a balanced number of demonstration experiences and new experiences are sampled from the replay buffer, and (3) the agent observes rewards of zero during its interactions with the environment, instead of an extrinsic reward signal that specifies the desired task. This instantiation of SQIL is compared to GAIL on the Humanoid (17 DoF) and HalfCheetah (6 DoF) tasks from MuJoCo.

The results in Figure 3 show that SQIL outperforms BC and performs comparably to GAIL on both tasks, demonstrating that SQIL can be successfully deployed on problems with continuous actions, and that SQIL can perform well even with a small number of demonstrations. This experiment also illustrates how SQIL can be run on top of SAC or any other off-policy value-based RL algorithm.

5.4 Ablation Study on Low-Dimensional Lunar Lander

We hypothesize that SQIL works well because it combines information about the expert’s policy from demonstrations with information about the environment dynamics from rollouts of the imitation policy periodically sampled during training. We also expect RBC to perform comparably to SQIL, since their objectives are similar. To test these hypotheses, we conduct an ablation study using the Lunar Lander game from the Box2D environments in OpenAI Gym (screenshot in Figure 4). As in Section 5.1, we control the mismatch between the distribution of states in the demonstrations and the states encountered by the agent by manipulating the initial state distribution. To create , the agent is placed in a starting position never visited in the demonstrations.

Domain Shift () No Shift ()
Random
BC (P’91)
GAIL-TRPO
SQIL (Ours)
Ablation
RBC
Expert
Figure 4: Low-dimensional Lunar Lander. Best success rate on 100 consecutive episodes during training. Standard error on five random seeds. Performance bolded if at least within one standard error of expert.

In the first variant of SQIL, is set to zero, to prevent SQIL from using additional samples drawn from the environment (see line 4 of Algorithm 1). This comparison tests if SQIL really needs to interact with the environment, or if it can rely solely on the demonstrations. In the second condition, is set to zero to prevent SQIL from accessing information about state transitions (see Equation 6 and line 4 of Algorithm 1). This comparison tests if SQIL is actually extracting information about the dynamics from the samples, or if it can perform just as well with a naïve regularizer (setting to zero effectively imposes a penalty on the L2-norm of the soft Q values instead of the squared soft Bellman error). In the third condition, a uniform random policy is used to sample additional rollouts, instead of the imitation policy (see line 6 of Algorithm 1). This comparison tests how important it is that the samples cover the states encountered by the agent during training. In the fourth condition, we use RBC to optimize Equation 5 instead of using SQIL to optimize Equation 7. This comparison tests the effect of the additional terms in Equation 7 vs. Equation 5.

The results in Figure 4 show that all methods perform well when there is no variation in the initial state. When the initial state is varied, SQIL performs significantly better than BC, GAIL, and the ablated variants of SQIL. This confirms our hypothesis that SQIL needs to sample from the environment using the imitation policy, and relies on information about the dynamics encoded in the samples.

Surprisingly, SQIL outperforms RBC by a large margin, suggesting that the additional terms in Equation 7 do in fact improve performance by encouraging the learned soft Q values to be higher and making the imitation policy less stochastic (discussed in Section 3.3). We expect that with additional tuning of the temperature hyperparameter in RBC, we could achieve the same effect as the additional terms, and RBC would perform the same as SQIL.

6 Related Work

Various approaches have been developed to address state distribution shift in BC, without relying on IRL or adversarial optimization. Hand-engineering a domain-specific loss function and carefully designing the demonstration collection process have enabled researchers to train effective imitation policies for self-driving cars

bojarski2016end , autonomous drones giusti2016machine , and robotic manipulators zhang2017deep ; rahmatizadeh2017vision ; rahmatizadeh2016learning . DAgger-based methods query the expert for on-policy action labels ross2011reduction ; laskey2016shiv . These approaches either require domain knowledge or the ability to query the expert, while SQIL requires neither.

piot2014boosted propose an imitation learning algorithm that optimizes a classification objective subject to a constraint on the Bellman error, akin to regularized BC. We build on this work by showing that regularized BC is similar to off-policy RL with constant rewards, and draw on this connection to implement our method on top of existing deep RL algorithms.

Concurrently with SQIL, another imitation learning algorithm that uses off-policy RL with constant rewards instead of a learned reward function was developed sasaki2018sample . We see our paper as contributing additional evidence to support this core idea, rather than proposing a competing method. First, SQIL is derived as an extension of BC, while the prior method is derived from an alternative formulation of the IRL objective, showing that two different theoretical approaches independently lead to using off-policy RL with constant rewards as an alternative to adversarial training – a sign that this idea may be a promising direction for future work. Second, SQIL is shown to outperform BC and GAIL in domains that were not evaluated in sasaki2018sample – in particular, tasks with image observations and significant shift in the state distribution between the demonstrations and the training environment. This suggests that the results of the low-dimensional MuJoCo experiments in sasaki2018sample , which show the prior method outperforms BC and GAIL, may extend to more complex tasks.

SQIL resembles the Deep Q-learning from Demonstrations (DQfD) hester2017deep and Normalized Actor-Critic (NAC) algorithms gao2018reinforcement , in that all three algorithms fill the agent’s experience replay buffer with demonstrations and include an imitation loss in the agent’s objective. The key difference between SQIL and these prior methods is that DQfD and NAC are RL algorithms that assume access to a reward signal, while SQIL is an imitation learning algorithm that does not require an extrinsic reward signal from the environment. Instead, SQIL automatically constructs a reward signal from the demonstrations.

The SQIL objective is similar to that of the inverse soft Q-learning (ISQL) algorithm reddy2018you . Their details and motivations are, however, significantly different. ISQL is an internal dynamics estimation algorithm, while SQIL is for imitation learning. ISQL also assumes that the demonstrations include observations of the expert’s reward signal, while SQIL does not.

7 Conclusions and Future Work

We contribute the SQIL algorithm: a general method for learning to imitate an expert given action demonstrations and access to the environment. Simulation experiments on tasks with high-dimensional, continuous observations and unknown dynamics show that our method outperforms both BC and GAIL, while being simple to implement on top of existing off-policy RL code.

SQIL might be used to recover not just the expert’s policy, but also their reward function; for example, by using a parameterized reward function to model rewards in the soft Bellman error terms, instead of using constant rewards. This could provide a simpler alternative to existing adversarial IRL algorithms fu2017learning .

8 Acknowledgements

Thanks to the reviewers and lab mates who provided us with substantial feedback on earlier versions of this paper; in particular, Ashvin Nair, Gregory Kahn, and an anonymous user on OpenReview. Thanks to Ridley Scott and Philip K. Dick for the 1982 film, Blade Runner. One of the core ideas behind SQIL – initially filling the agent’s experience replay buffer with demonstrations where rewards are set to a positive constant – was inspired by Deckard’s conversation with Dr. Eldon Tyrell about Rachael’s memory implants.

Tyrell If we gift [replicants] with a past, we create a cushion or a pillow for their emotions, and consequently, we can control them better. Deckard Memories. You’re talking about memories.

This work was supported in part by Berkeley DeepDrive, GPU donations from NVIDIA, NSF IIS-1700696, and AFOSR FA9550-17-1-0308.

References

Appendix A Appendix

a.1 Derivation of SQIL Gradient

Splitting up the sum of squared soft Bellman error terms for and ,

(9)

Setting turns the inner sum in the first term into a telescoping sum.

(10)

Since is assumed to be absorbing (see Section 3.1), is a constant (see Section 2). Thus,

(11)

where are constant hyperparameters.

a.2 Comparing the Biased and Unbiased Variants of GAIL

As discussed in Section 5, to correct the original GAIL method’s biased handling of rewards at absorbing states, we implement the suggested changes to GAIL in Section 4.2 of [17]: adding a transition to an absorbing state and a self-loop at the absorbing state to the end of each rollout sampled from the environment, and adding a binary feature to the observations indicating whether or not a state is absorbing. We refer to the original, biased GAIL method as GAIL-DQL-B and GAIL-TRPO-B, and the unbiased version as GAIL-DQL-U and GAIL-TRPO-U.

Car Racing. The results in Figure 5 show that both the biased (GAIL-DQL-B) and unbiased (GAIL-DQL-U) versions of GAIL perform equally poorly. The problem of training an image-based discriminator for this task may be difficult enough that even with an unfair bias toward avoiding crashes that terminate the episode, GAIL-DQL-B does not perform better than GAIL-DQL-U.

Atari. The results in Figure 6 show that SQIL outperforms both variants of GAIL on Pong and the unbiased version of GAIL (GAIL-DQL-U) on Breakout and Space Invaders, but performs comparably to the biased version of GAIL (GAIL-DQL-B) on Space Invaders and worse than it on Breakout. This may be due to the fact that in Breakout and Space Invaders, the agent has multiple lives – five in Breakout, and three in Space Invaders – and receives a termination signal that the episode has ended after losing each life. Thus, the agent experiences many more episode terminations than in Pong, exacerbating the bias in the way the original GAIL method handles rewards at absorbing states. Our implementation of GAIL-DQL-B in this experiment provides a learned reward of , where is the discriminator (see Section A.3 in the appendix for details). The learned reward is always positive, while the implicit reward at an absorbing state is zero. Thus, the agent is inadvertently encouraged to avoid terminating the episode. For Breakout and Space Invaders, this just happens to be the right incentive, since the objective is to stay alive as long as possible. GAIL-DQL-B outperforms SQIL in Breakout and performs comparably to SQIL in Space Invaders because GAIL-DQL-B is accidentally biased in the right way.

Lunar Lander. The results in Figure 7 show that when the initial state is varied, SQIL outperforms the unbiased variant of GAIL (GAIL-TRPO-U), but underperforms against the biased version of GAIL (GAIL-TRPO-B). The latter result is likely due to the fact that the implementation of GAIL-TRPO-B we used in this experiment provides a learned reward of , where is the discriminator (see Section A.3 in the appendix for details). The learned reward is always negative, while the implicit reward at an absorbing state is zero. Thus, the agent is inadvertently encouraged to terminate the episode quickly. For the Lunar Lander game, this just happens to be the right incentive, since the objective is to land on the ground and thereby terminate the episode. As in the Atari experiments, GAIL-TRPO-B performs better than SQIL in this experiment because GAIL-TRPO-B is accidentally biased in the right way.

Domain Shift () No Shift ()
Random
BC (P’91)
GAIL-DQL-B
GAIL-DQL-U
SQIL (Ours)
Expert
Figure 5: Image-based Car Racing. Average reward on 100 episodes after training. Standard error on three random seeds.
Figure 6: Image-based Atari. Smoothed with a rolling window of 100 episodes. Standard error on three random seeds. X-axis represents amount of interaction with the environment (not expert demonstrations).
Domain Shift () No Shift ()
Random
BC (P’91)
GAIL-TRPO-B (HE’16)
GAIL-TRPO-U
SQIL (Ours)
Ablation
RBC
Expert
Figure 7: Low-dimensional Lunar Lander. Best success rate on 100 consecutive episodes during training. Standard error on five random seeds. Performance bolded if at least within one standard error of expert.

a.3 Implementation Details

To ensure fair comparisons, the same network architectures were used to evaluate SQIL, GAIL, and BC. For Lunar Lander, we used a network architecture with two fully-connected layers containing 128 hidden units each to represent the Q network in SQIL, the policy and discriminator networks in GAIL, and the policy network in BC. For Car Racing, we used four convolutional layers (following [10]) and two fully-connected layers containing 256 hidden units each. For Humanoid and HalfCheetah, we used two fully-connected layers containing 256 hidden units each. For Atari, we used the convolutional neural network described in [22] to represent the Q network in SQIL, as well as the Q network and discriminator network in GAIL.

To ensure fair comparisons, the same demonstration data were used to train SQIL, GAIL, and BC. For Lunar Lander, we collected 100 demonstration rollouts. For Car Racing, Pong, Breakout, and Space Invaders, we collected 20 demonstration rollouts. Expert demonstrations were generated from scratch for Lunar Lander using DQN [22], and collected from open-source pre-trained policies for Car Racing [10] as well as Humanoid and HalfCheetah [4]. The Humanoid demonstrations were generated by a stochastic expert policy, while the HalfCheetah demonstrations were generated by a deterministic expert policy; both experts were trained using TRPO.333https://drive.google.com/drive/folders/1h3H4AY_ZBx08hz-Ct0Nxxus-V1melu1U We used two open-source implementations of GAIL: [6] for Lunar Lander, and [4] for MuJoCo. We adapted the OpenAI Baselines implementation of GAIL to use soft Q-learning for Car Racing and Atari. Expert demonstrations were generated from scratch for Atari using DQN.

For Lunar Lander, we set , , and . For Car Racing, we set , , and . For Humanoid, we set and . For HalfCheetah, we set and . For Atari, we set , , and .

SQIL was not pre-trained in any of the experiments. GAIL was pre-trained using BC for HalfCheetah, but was not pre-trained in any other experiments.

In standard implementations of soft Q-learning and SAC, the agent’s experience replay buffer typically has a fixed size, and once the buffer is full, old experiences are deleted to make room for new experiences. In SQIL, we never delete demonstration experiences from the replay buffer, but otherwise follow the standard implementation.

We use Adam [16] to take the gradient step in line 4 of Algorithm 1.

The BC and GAIL performance metrics in Section 5.3 are taken from [4].444https://github.com/openai/baselines/blob/master/baselines/gail/result/gail-result.md

The GAIL and SQIL policies in Section 5.3 are set to be deterministic during the evaluation rollouts used to measure performance.

Figure 8: Standard error over two random seeds. No smoothing across training steps.