1 Introduction
This paper considers the problem of training an agent to imitate an expert policy, given expert action demonstrations and access to the environment. The agent does not get to observe a reward signal or query the expert, and does not know the state transition dynamics.
Standard approaches to this problem based on behavioral cloning (BC) seek to imitate the expert’s actions, but do not reason about the consequences of actions pomerleau1991efficient . As a result, they suffer from state distribution shift, and fail to generalize to states that are very different from those seen in the demonstrations ross2010efficient ; ross2011reduction
. Approaches based on inverse reinforcement learning (IRL) deal with this issue by fitting a reward function that represents preferences over trajectories rather than individual actions
ng2000algorithms ; ziebart2008maximum , and using the learned reward function to train the imitation agent through RL wulfmeier2015maximum ; finn2016guided ; fu2017learning . This is the core idea behind generative adversarial imitation learning (GAIL), which implicitly combines IRL and RL using generative adversarial networks (GANs) ho2016generative ; goodfellow2014generative . GAIL is the state of the art, but tends to require additional reward augmentation and feature engineering when applied to environments with highdimensional image observations li2017infogail , and inherits the difficulty of training GANs kurach2018gan .The main idea in this paper is that instead of resorting to adversarial imitation learning to overcome state distribution shift, we can modify BC so that it generalizes to outofdistribution states while remaining simple and easy to implement. We propose combining the standard maximumlikelihood objective of BC, which encourages the agent to imitate the expert in demonstrated states, with a regularization term that gives the agent an incentive to take actions that lead it back to demonstrated states when it encounters new states.
Our intuition for the regularizer is that it should incorporate information about the dynamics of the environment into the objective, so that the agent can learn to get back to demonstrated states. We derive such a regularizer from the maximum entropy model of expert behavior. In this model, the logits of the imitation policy can be interpreted as soft Q values, and BC corresponds to maximumlikelihood estimation of the soft Q function given the demonstrations. The problem with BC is that the learned soft Q function may output arbitrary values in states that are outofdistribution with respect to the demonstrations. To overcome this issue, we regularize the soft Q function by imposing a penalty on the squared soft Bellman error, which is approximated using transitions from rollouts of the imitation policy periodically sampled during training. Since we do not have access to a reward signal, we set all rewards to zero in the samples, which encourages the agent to get back to the demonstrated states. We refer to this algorithm as
regularized behavioral cloning (RBC).To better understand the effect of the proposed regularizer, we show that RBC is similar to an offpolicy reinforcement learning (RL) algorithm that rewards the agent for reaching demonstrated states and matching demonstrated actions in those states. The offpolicy RL algorithm is a variant of soft Qlearning that initializes the agent’s experience replay buffer with demonstrations, sets rewards to a positive constant in the stored demonstration experiences, and sets rewards to zero in all additional experiences. We call this method soft Q imitation learning (SQIL). The connection between RBC and SQIL enables us to interpret the effect of the regularizer on the agent’s incentives. It also makes our method easier to implement, since SQIL only requires a few small changes to existing offpolicy RL code.
The main contribution of this paper is SQIL: a simple and general imitation learning algorithm that is effective in MDPs with highdimensional, continuous observations and unknown dynamics. We run experiments in four imagebased environments – Car Racing, Pong, Breakout, and Space Invaders – and three lowdimensional environments – Humanoid, HalfCheetah, and Lunar Lander – from OpenAI Gym 1606.01540 , Arcade Learning Environment bellemare2013arcade , and MuJoCo todorov2012mujoco , to compare SQIL to two prior methods: BC and GAIL. We find that SQIL outperforms both prior methods, especially on the imagebased tasks. Our experiments illustrate two key benefits of SQIL: (1) that it can overcome the state distribution shift problem of BC without adversarial training or learning a reward function, which makes it easier to use with images, and (2) that it is simple to implement using existing offpolicy valuebased RL algorithms.
2 Preliminaries
This work builds on the maximum causal entropy (MaxCausalEnt) model of expert behavior ziebart2010modelingint ; levine2018reinforcement
. In an infinitehorizon Markov Decision Process (MDP) with a continuous state space
and discrete action space ,^{1}^{1}1Assuming a discrete action space simplifies our analysis. SQIL can be applied to continuous control tasks using existing sampling methods haarnoja2017reinforcement ; haarnoja2018soft , as illustrated in Section 5.3. the demonstrator is assumed to follow a policy that maximizes reward . The policy forms a Boltzmann distribution over actions,(1) 
where is the soft Q function, and is the soft value function,
(2) 
If is an absorbing state, we assume , where is the discount factor, and
is a constant hyperparameter that represents the reward for remaining in an absorbing state for one timestep.
^{2}^{2}2Section 4 discusses how the value of is chosen. The soft Q values are a deterministic function of the rewards and dynamics, given by the soft Bellman equation,(3) 
3 Imitation Learning in the Maximum Entropy Model
We aim for an imitation learning algorithm that generalizes to new states, without resorting to complex adversarial optimization procedures. We build on BC, which is a simple approach that seeks to imitate the expert’s actions using supervised learning. BC does not reason about the consequences of actions, so when the agent makes small mistakes and enters states that are slightly different from those in the demonstrations, the distribution mismatch between the states in the demonstrations and those actually encountered by the agent leads to compounding errors
ross2011reduction . Our solution is to add a regularization term to BC that enables it to overcome state distribution shift while remaining simple and easy to implement.We derive the regularization term from the maximum entropy model of expert behavior, resulting in a method that infers the expert’s soft Q function by maximizing the likelihood of observed actions and minimizing the squared soft Bellman error. We show that this regularized BC method is similar to an offpolicy RL algorithm that rewards the agent for reaching demonstrated states and matching demonstrated actions in those states. This connection enables us to interpret the effect of the regularizer on the agent’s incentives, and makes it possible to implement our method by applying a few simple modifications to any offpolicy valuebased RL algorithm.
3.1 Behavioral Cloning
Under the generative model in Section 2, BC corresponds to maximumlikelihood estimation of the soft Q function given the demonstrations. Let denote a rollout, where is an absorbing state, and let denote the set of demonstration rollouts. We define BC as fitting a parameterized soft Q function to minimize the loss,
(4) 
where the equality follows from Equation 1, and denotes the soft value function given by and Equation 2. The experiments in Section 5
use a convolutional neural network or multilayer perceptron to model
, where are the weights of the neural network.3.2 Regularized Behavioral Cloning
The issue with BC is that when the agent encounters states that are outofdistribution with respect to , may output arbitrary values. One solution is to add a regularization term to the BC objective that encourages to output reasonable default values when it encounters such states. For example, a penalty on the L2norm of the soft Q values would encourage the soft Q values in outofdistribution states to be zero instead of arbitrary, leading to a uniform random policy in those states. Instead of inducing a uniform policy, we would like to endow the agent with the ability to get back to demonstrated states when it encounters new states. To do so, we need to incorporate information about the dynamics of the environment into the regularization term.
The model in Section 2 suggests a natural choice of regularizer that achieves this goal: a penalty on the squared soft Bellman error, i.e., the squared difference between the LHS and RHS of Equation 3. Since we do not know the expert’s reward function , we set all rewards to zero. Additionally, since the state space cannot be enumerated and the dynamics are unknown, we approximate the penalty by evaluating it on transitions observed in the demonstrations as well as additional rollouts periodically sampled during training using the imitation policy. Sampling ensures that the penalty covers the state distribution actually encountered by the agent, instead of only the demonstrations.
Formally, we define the regularized BC objective as follows.
(5) 
where is a constant hyperparameter, and denotes the sum of squared soft Bellman errors,
(6) 
The BC objective encourages to output high values for demonstrated actions at demonstrated states, and the penalty term propagates those high values to nearby states. In other words, outputs high values for actions that lead to states from which the demonstrated states are reachable, so when the agent finds itself far from the demonstrated states, it takes actions that lead it back to the demonstrated states.
3.3 Connection to OffPolicy Reinforcement Learning
The squared soft Bellman error term in Equation 5 strongly resembles a soft Qlearning objective haarnoja2017reinforcement , hinting at an alternative interpretation of RBC. This section shows that the RBC objective is similar to that of a soft Qlearning algorithm that gives the agent a constant positive reward for reaching a demonstrated state and matching the demonstrated action in that state, and zero reward otherwise.
Consider the following modification to the RBC objective.
(7) 
The additional terms encourage the learned soft Q values to be higher and make the imitation policy less stochastic, which can improve performance and reduce the number of hyperparameters that need to be tuned (see the ablation experiments in Section 5.4). More importantly, they lead to the following result (derived in Section A.1 of the appendix).
(8) 
where are constant hyperparameters. Equation 8 is the gradient of a soft Qlearning algorithm that gives the agent a constant reward of
for taking the demonstrated action in a demonstrated state, assigns a reward of zero to all nondemonstration experiences, and balances the number of demonstration experiences and nondemonstration experiences sampled for each step of stochastic gradient descent. We call this algorithm
soft Q imitation learning (SQIL).SQIL vs. RBC. The main benefit of using SQIL to optimize Equation 7 instead of using RBC to optimize Equation 5 is that SQIL is trivial to implement, since it only requires a few small changes to existing deep Qlearning code (see Section 4). Extending SQIL to MDPs with a continuous action space is also easy, since we can simply replace soft Qlearning with the soft actorcritic algorithm haarnoja2018soft (see Section 5.3). Given the difficulty of implementing deep RL algorithms correctly henderson2018deep , this flexibility makes SQIL more practical to use, since it can be built on top of existing code bases. Furthermore, the ablation study in Section 5.4 suggests that SQIL actually performs better than RBC.
4 Soft Q Imitation Learning
SQIL is summarized in Algorithm 1. It performs soft Qlearning with three small, but important, modifications: (1) it initially fills the agent’s experience replay buffer with demonstrations, where the rewards are set to some positive constant (e.g., ), (2) as the agent interacts with the world and accumulates new experiences, it adds them to the replay buffer, and sets the rewards for these additional experiences to zero, and (3) it balances the number of demonstration experiences and new experiences in each sample from the replay buffer. Section A.3 in the appendix contains additional implementation details.
Crucially, since the agent can learn from offpolicy data, the agent does not necessarily have to visit the demonstrated states in order to experience positive rewards. Instead, the agent replays the demonstrations that were initially added to its experience replay buffer. Thus, SQIL can be used in stochastic environments with continuous states, where the demonstration states may never actually be encountered by the agent.
Termination condition. As the imitation policy learns to behave more like the expert, a growing number of expertlike transitions get added to with an assigned reward of zero. This causes the effective reward for mimicking the expert to decay over time. Balancing the number of demonstration experiences and new experiences sampled from the replay buffer ensures that this effective reward remains at least instead of decaying to zero. In practice, we find that this reward decay does not degrade performance if SQIL is halted once the squared soft Bellman error objective converges to a minimum (e.g., see Figure 8 in the appendix).
Rewards at absorbing states. The value of hyperparameter , which controls the reward at absorbing states (see Section 2), affects the agent’s incentives. Setting to a constant much larger than would encourage the agent to terminate the episode quickly, which may be appropriate for certain tasks. Setting much lower than would encourage the agent to avoid terminating the episode. See discrimac for further discussion. We set in all our experiments, which include some environments where terminating the episode is always undesirable (e.g., walking without falling down) and other environments where success requires terminating the episode (e.g., landing at a target), suggesting that SQIL is not sensitive to the choice of .
5 Experimental Evaluation
Our experiments aim to compare SQIL to existing imitation learning methods on a variety of tasks with highdimensional, continuous observations, such as images, and unknown dynamics. To that end, we benchmark SQIL against BC and GAIL on four imagebased games – Car Racing, Pong, Breakout, and Space Invaders – and three lowdimensional tasks – Humanoid, HalfCheetah, and Lunar Lander. We also investigate which components of SQIL contribute most to its performance via an ablation study on the Lunar Lander game. Section A.3 in the appendix contains additional experimental details.
For all the imagebased tasks, we implement a version of GAIL that uses deep Qlearning (GAILDQL) instead of TRPO as in the original GAIL paper ho2016generative , since Qlearning performs better than TRPO in these environments, and because this allows for a headtohead comparison of SQIL and GAIL: both algorithms use the same underlying RL algorithm, but provide the agent with different rewards – SQIL provides constant rewards, while GAIL provides learned rewards. We use the standard GAILTRPO method as a baseline for all the lowdimensional tasks, since TRPO performs better than Qlearning in these environments.
The original GAIL method implicitly encodes prior knowledge – namely, that terminating an episode is either always desirable or always undesirable. As pointed out in discrimac , this makes comparisons to alternative methods unfair. We implement the unbiased version of GAIL proposed by discrimac , and use this in all of the experiments. Comparisons to the biased version with implicit termination knowledge are included in Section A.2 in the appendix.
5.1 Testing Generalization in ImageBased Car Racing
The goal of this experiment is to study not only how well each method can mimic the expert demonstrations, but also how well they can acquire policies that generalize to new states that are not seen in the demonstrations. To do so, we train the imitation agents in an environment with a different initial state distribution than that of the expert demonstrations , allowing us to systematically control the mismatch between the distribution of states in the demonstrations and the states actually encountered by the agent. We run experiments on the Car Racing game from the Box2D environments in OpenAI Gym (screenshot in Figure 1). To create , the car is rotated 90 degrees so that it begins perpendicular to the track, instead of parallel to the track as in . This intervention presents a significant generalization challenge to the imitation learner, since the expert demonstrations do not contain any examples of states where the car is perpendicular to the road, or even significantly off the road axis. The agent must learn to make a tight turn to get back on the road, then stabilize its orientation so that it is parallel to the road, and only then proceed forward to mimic the expert demonstrations.
Domain Shift ()  No Shift ()  

Random  
BC (P’91)  
GAILDQL  
SQIL (Ours)  
Expert 
Imagebased Car Racing. Average reward on 100 episodes after training. Standard error on three random seeds.
The results in Figure 1 show that SQIL and BC perform equally well when there is no variation in the initial state. The task is easy enough that even BC achieves a high reward. SQIL performs much better than BC when starting from , showing that SQIL is capable of generalizing to a new initial state distribution, while BC is not. SQIL learns to make a tight turn that takes the car through the grass and back onto the road, then stabilizes the car’s orientation so that it is parallel to the track, and then proceeds forward like the expert does in the demonstrations. BC tends to drive straight ahead into the grass instead of turning back onto the road.
SQIL outperforms GAIL in both conditions. Since SQIL and GAIL both use deep Qlearning for RL in this experiment, the gap between them may be attributed to the difference in the reward functions they use to train the agent. SQIL benefits from providing a constant reward that does not require fitting a discriminator, while GAIL struggles to train a discriminator to provide learned rewards directly from images.
5.2 ImageBased Experiments on Atari
The results in Figure 2 show that SQIL outperforms BC on Pong, Breakout, and Space Invaders – additional evidence that BC suffers from compounding errors, while SQIL does not. SQIL also outperforms GAIL on all three games, illustrating the difficulty of using GAIL to train an imagebased discriminator, as in Section 5.1.
5.3 Instantiating SQIL for Continuous Control in LowDimensional MuJoCo
The experiments in the previous sections evaluate SQIL in MDPs with a discrete action space. This section illustrates how SQIL can be adapted to continuous actions. We instantiate SQIL using soft actorcritic (SAC) – an offpolicy RL algorithm that can solve continuous control tasks haarnoja2018soft . In particular, SAC is modified in the following ways: (1) the agent’s experience replay buffer is initially filled with expert demonstrations, where rewards are set to a positive constant, (2) when taking gradient steps to fit the agent’s soft Q function, a balanced number of demonstration experiences and new experiences are sampled from the replay buffer, and (3) the agent observes rewards of zero during its interactions with the environment, instead of an extrinsic reward signal that specifies the desired task. This instantiation of SQIL is compared to GAIL on the Humanoid (17 DoF) and HalfCheetah (6 DoF) tasks from MuJoCo.
The results in Figure 3 show that SQIL outperforms BC and performs comparably to GAIL on both tasks, demonstrating that SQIL can be successfully deployed on problems with continuous actions, and that SQIL can perform well even with a small number of demonstrations. This experiment also illustrates how SQIL can be run on top of SAC or any other offpolicy valuebased RL algorithm.
5.4 Ablation Study on LowDimensional Lunar Lander
We hypothesize that SQIL works well because it combines information about the expert’s policy from demonstrations with information about the environment dynamics from rollouts of the imitation policy periodically sampled during training. We also expect RBC to perform comparably to SQIL, since their objectives are similar. To test these hypotheses, we conduct an ablation study using the Lunar Lander game from the Box2D environments in OpenAI Gym (screenshot in Figure 4). As in Section 5.1, we control the mismatch between the distribution of states in the demonstrations and the states encountered by the agent by manipulating the initial state distribution. To create , the agent is placed in a starting position never visited in the demonstrations.
Domain Shift ()  No Shift ()  
Random  
BC (P’91)  
GAILTRPO  
SQIL (Ours)  
Ablation  
RBC  
Expert 
In the first variant of SQIL, is set to zero, to prevent SQIL from using additional samples drawn from the environment (see line 4 of Algorithm 1). This comparison tests if SQIL really needs to interact with the environment, or if it can rely solely on the demonstrations. In the second condition, is set to zero to prevent SQIL from accessing information about state transitions (see Equation 6 and line 4 of Algorithm 1). This comparison tests if SQIL is actually extracting information about the dynamics from the samples, or if it can perform just as well with a naïve regularizer (setting to zero effectively imposes a penalty on the L2norm of the soft Q values instead of the squared soft Bellman error). In the third condition, a uniform random policy is used to sample additional rollouts, instead of the imitation policy (see line 6 of Algorithm 1). This comparison tests how important it is that the samples cover the states encountered by the agent during training. In the fourth condition, we use RBC to optimize Equation 5 instead of using SQIL to optimize Equation 7. This comparison tests the effect of the additional terms in Equation 7 vs. Equation 5.
The results in Figure 4 show that all methods perform well when there is no variation in the initial state. When the initial state is varied, SQIL performs significantly better than BC, GAIL, and the ablated variants of SQIL. This confirms our hypothesis that SQIL needs to sample from the environment using the imitation policy, and relies on information about the dynamics encoded in the samples.
Surprisingly, SQIL outperforms RBC by a large margin, suggesting that the additional terms in Equation 7 do in fact improve performance by encouraging the learned soft Q values to be higher and making the imitation policy less stochastic (discussed in Section 3.3). We expect that with additional tuning of the temperature hyperparameter in RBC, we could achieve the same effect as the additional terms, and RBC would perform the same as SQIL.
6 Related Work
Various approaches have been developed to address state distribution shift in BC, without relying on IRL or adversarial optimization. Handengineering a domainspecific loss function and carefully designing the demonstration collection process have enabled researchers to train effective imitation policies for selfdriving cars
bojarski2016end , autonomous drones giusti2016machine , and robotic manipulators zhang2017deep ; rahmatizadeh2017vision ; rahmatizadeh2016learning . DAggerbased methods query the expert for onpolicy action labels ross2011reduction ; laskey2016shiv . These approaches either require domain knowledge or the ability to query the expert, while SQIL requires neither.piot2014boosted propose an imitation learning algorithm that optimizes a classification objective subject to a constraint on the Bellman error, akin to regularized BC. We build on this work by showing that regularized BC is similar to offpolicy RL with constant rewards, and draw on this connection to implement our method on top of existing deep RL algorithms.
Concurrently with SQIL, another imitation learning algorithm that uses offpolicy RL with constant rewards instead of a learned reward function was developed sasaki2018sample . We see our paper as contributing additional evidence to support this core idea, rather than proposing a competing method. First, SQIL is derived as an extension of BC, while the prior method is derived from an alternative formulation of the IRL objective, showing that two different theoretical approaches independently lead to using offpolicy RL with constant rewards as an alternative to adversarial training – a sign that this idea may be a promising direction for future work. Second, SQIL is shown to outperform BC and GAIL in domains that were not evaluated in sasaki2018sample – in particular, tasks with image observations and significant shift in the state distribution between the demonstrations and the training environment. This suggests that the results of the lowdimensional MuJoCo experiments in sasaki2018sample , which show the prior method outperforms BC and GAIL, may extend to more complex tasks.
SQIL resembles the Deep Qlearning from Demonstrations (DQfD) hester2017deep and Normalized ActorCritic (NAC) algorithms gao2018reinforcement , in that all three algorithms fill the agent’s experience replay buffer with demonstrations and include an imitation loss in the agent’s objective. The key difference between SQIL and these prior methods is that DQfD and NAC are RL algorithms that assume access to a reward signal, while SQIL is an imitation learning algorithm that does not require an extrinsic reward signal from the environment. Instead, SQIL automatically constructs a reward signal from the demonstrations.
The SQIL objective is similar to that of the inverse soft Qlearning (ISQL) algorithm reddy2018you . Their details and motivations are, however, significantly different. ISQL is an internal dynamics estimation algorithm, while SQIL is for imitation learning. ISQL also assumes that the demonstrations include observations of the expert’s reward signal, while SQIL does not.
7 Conclusions and Future Work
We contribute the SQIL algorithm: a general method for learning to imitate an expert given action demonstrations and access to the environment. Simulation experiments on tasks with highdimensional, continuous observations and unknown dynamics show that our method outperforms both BC and GAIL, while being simple to implement on top of existing offpolicy RL code.
SQIL might be used to recover not just the expert’s policy, but also their reward function; for example, by using a parameterized reward function to model rewards in the soft Bellman error terms, instead of using constant rewards. This could provide a simpler alternative to existing adversarial IRL algorithms fu2017learning .
8 Acknowledgements
Thanks to the reviewers and lab mates who provided us with substantial feedback on earlier versions of this paper; in particular, Ashvin Nair, Gregory Kahn, and an anonymous user on OpenReview. Thanks to Ridley Scott and Philip K. Dick for the 1982 film, Blade Runner. One of the core ideas behind SQIL – initially filling the agent’s experience replay buffer with demonstrations where rewards are set to a positive constant – was inspired by Deckard’s conversation with Dr. Eldon Tyrell about Rachael’s memory implants.
Tyrell If we gift [replicants] with a past, we create a cushion or a pillow for their emotions, and consequently, we can control them better. Deckard Memories. You’re talking about memories.
This work was supported in part by Berkeley DeepDrive, GPU donations from NVIDIA, NSF IIS1700696, and AFOSR FA95501710308.
References

[1]
Marc G Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling.
The arcade learning environment: An evaluation platform for general
agents.
Journal of Artificial Intelligence Research
, 47:253–279, 2013.  [2] Mariusz Bojarski, Davide Del Testa, Daniel Dworakowski, Bernhard Firner, Beat Flepp, Prasoon Goyal, Lawrence D Jackel, Mathew Monfort, Urs Muller, Jiakai Zhang, et al. End to end learning for selfdriving cars. arXiv preprint arXiv:1604.07316, 2016.
 [3] Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Openai gym, 2016.
 [4] Prafulla Dhariwal, Christopher Hesse, Oleg Klimov, Alex Nichol, Matthias Plappert, Alec Radford, John Schulman, Szymon Sidor, Yuhuai Wu, and Peter Zhokhov. Openai baselines. https://github.com/openai/baselines, 2017.

[5]
Chelsea Finn, Sergey Levine, and Pieter Abbeel.
Guided cost learning: Deep inverse optimal control via policy
optimization.
In
International Conference on Machine Learning
, pages 49–58, 2016.  [6] Justin Fu, Katie Luo, and Sergey Levine. Learning robust rewards with adversarial inverse reinforcement learning. arXiv preprint arXiv:1710.11248, 2017.
 [7] Yang Gao, Ji Lin, Fisher Yu, Sergey Levine, Trevor Darrell, et al. Reinforcement learning from imperfect demonstrations. arXiv preprint arXiv:1802.05313, 2018.
 [8] Alessandro Giusti, Jérôme Guzzi, Dan C Ciresan, FangLin He, Juan P Rodríguez, Flavio Fontana, Matthias Faessler, Christian Forster, Jürgen Schmidhuber, Gianni Di Caro, et al. A machine learning approach to visual perception of forest trails for mobile robots. IEEE Robotics and Automation Letters, 1(2):661–667, 2016.
 [9] Ian Goodfellow, Jean PougetAbadie, Mehdi Mirza, Bing Xu, David WardeFarley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014.
 [10] David Ha and Jürgen Schmidhuber. Recurrent world models facilitate policy evolution. arXiv preprint arXiv:1809.01999, 2018.
 [11] Tuomas Haarnoja, Haoran Tang, Pieter Abbeel, and Sergey Levine. Reinforcement learning with deep energybased policies. arXiv preprint arXiv:1702.08165, 2017.
 [12] Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actorcritic: Offpolicy maximum entropy deep reinforcement learning with a stochastic actor. arXiv preprint arXiv:1801.01290, 2018.
 [13] Peter Henderson, Riashat Islam, Philip Bachman, Joelle Pineau, Doina Precup, and David Meger. Deep reinforcement learning that matters. In ThirtySecond AAAI Conference on Artificial Intelligence, 2018.
 [14] Todd Hester, Matej Vecerik, Olivier Pietquin, Marc Lanctot, Tom Schaul, Bilal Piot, Dan Horgan, John Quan, Andrew Sendonaris, Gabriel DulacArnold, et al. Deep qlearning from demonstrations. arXiv preprint arXiv:1704.03732, 2017.
 [15] Jonathan Ho and Stefano Ermon. Generative adversarial imitation learning. In Advances in Neural Information Processing Systems, pages 4565–4573, 2016.
 [16] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 [17] Ilya Kostrikov, Kumar Krishna Agrawal, Debidatta Dwibedi, Sergey Levine, and Jonathan Tompson. Discriminatoractorcritic: Addressing sample inefficiency and reward bias in adversarial imitation learning. In International Conference on Learning Representations, 2019.
 [18] Karol Kurach, Mario Lucic, Xiaohua Zhai, Marcin Michalski, and Sylvain Gelly. The gan landscape: Losses, architectures, regularization, and normalization. arXiv preprint arXiv:1807.04720, 2018.

[19]
Michael Laskey, Sam Staszak, Wesley YuShu Hsieh, Jeffrey Mahler, Florian T
Pokorny, Anca D Dragan, and Ken Goldberg.
Shiv: Reducing supervisor burden in dagger using support vectors for efficient learning from demonstrations in high dimensional state spaces.
In Robotics and Automation (ICRA), 2016 IEEE International Conference on, pages 462–469. IEEE, 2016.  [20] Sergey Levine. Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:1805.00909, 2018.
 [21] Yunzhu Li, Jiaming Song, and Stefano Ermon. Infogail: Interpretable imitation learning from visual demonstrations. In Advances in Neural Information Processing Systems, pages 3812–3822, 2017.
 [22] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Humanlevel control through deep reinforcement learning. Nature, 518(7540):529, 2015.
 [23] Andrew Y Ng, Stuart J Russell, et al. Algorithms for inverse reinforcement learning. In Icml, pages 663–670, 2000.
 [24] Bilal Piot, Matthieu Geist, and Olivier Pietquin. Boosted and rewardregularized classification for apprenticeship learning. In Proceedings of the 2014 international conference on Autonomous agents and multiagent systems, pages 1249–1256. International Foundation for Autonomous Agents and Multiagent Systems, 2014.
 [25] Dean A Pomerleau. Efficient training of artificial neural networks for autonomous navigation. Neural Computation, 3(1):88–97, 1991.
 [26] Rouhollah Rahmatizadeh, Pooya Abolghasemi, Aman Behal, and Ladislau Bölöni. Learning real manipulation tasks from virtual demonstrations using lstm. arXiv preprint, 2016.
 [27] Rouhollah Rahmatizadeh, Pooya Abolghasemi, Ladislau Bölöni, and Sergey Levine. Visionbased multitask manipulation for inexpensive robots using endtoend learning from demonstration. arXiv preprint arXiv:1707.02920, 2017.
 [28] Siddharth Reddy, Anca D Dragan, and Sergey Levine. Where do you think you’re going?: Inferring beliefs about dynamics from behavior. arXiv preprint arXiv:1805.08010, 2018.
 [29] Stéphane Ross and Drew Bagnell. Efficient reductions for imitation learning. In Proceedings of the thirteenth international conference on artificial intelligence and statistics, pages 661–668, 2010.
 [30] Stéphane Ross, Geoffrey Gordon, and Drew Bagnell. A reduction of imitation learning and structured prediction to noregret online learning. In Proceedings of the fourteenth international conference on artificial intelligence and statistics, pages 627–635, 2011.
 [31] Fumihiro Sasaki, Tetsuya Yohira, and Atsuo Kawaguchi. Sample efficient imitation learning for continuous control. In International Conference on Learning Representations, 2019.
 [32] Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for modelbased control. In Intelligent Robots and Systems (IROS), 2012 IEEE/RSJ International Conference on, pages 5026–5033. IEEE, 2012.
 [33] Markus Wulfmeier, Peter Ondruska, and Ingmar Posner. Maximum entropy deep inverse reinforcement learning. arXiv preprint arXiv:1507.04888, 2015.
 [34] Tianhao Zhang, Zoe McCarthy, Owen Jow, Dennis Lee, Ken Goldberg, and Pieter Abbeel. Deep imitation learning for complex manipulation tasks from virtual reality teleoperation. arXiv preprint arXiv:1710.04615, 2017.
 [35] Brian D Ziebart, J Andrew Bagnell, and Anind K Dey. Modeling interaction via the principle of maximum causal entropy. In Proceedings of the 27th International Conference on International Conference on Machine Learning, pages 1255–1262. Omnipress, 2010.
 [36] Brian D Ziebart, Andrew L Maas, J Andrew Bagnell, and Anind K Dey. Maximum entropy inverse reinforcement learning. In AAAI, volume 8, pages 1433–1438. Chicago, IL, USA, 2008.
Appendix A Appendix
a.1 Derivation of SQIL Gradient
a.2 Comparing the Biased and Unbiased Variants of GAIL
As discussed in Section 5, to correct the original GAIL method’s biased handling of rewards at absorbing states, we implement the suggested changes to GAIL in Section 4.2 of [17]: adding a transition to an absorbing state and a selfloop at the absorbing state to the end of each rollout sampled from the environment, and adding a binary feature to the observations indicating whether or not a state is absorbing. We refer to the original, biased GAIL method as GAILDQLB and GAILTRPOB, and the unbiased version as GAILDQLU and GAILTRPOU.
Car Racing. The results in Figure 5 show that both the biased (GAILDQLB) and unbiased (GAILDQLU) versions of GAIL perform equally poorly. The problem of training an imagebased discriminator for this task may be difficult enough that even with an unfair bias toward avoiding crashes that terminate the episode, GAILDQLB does not perform better than GAILDQLU.
Atari. The results in Figure 6 show that SQIL outperforms both variants of GAIL on Pong and the unbiased version of GAIL (GAILDQLU) on Breakout and Space Invaders, but performs comparably to the biased version of GAIL (GAILDQLB) on Space Invaders and worse than it on Breakout. This may be due to the fact that in Breakout and Space Invaders, the agent has multiple lives – five in Breakout, and three in Space Invaders – and receives a termination signal that the episode has ended after losing each life. Thus, the agent experiences many more episode terminations than in Pong, exacerbating the bias in the way the original GAIL method handles rewards at absorbing states. Our implementation of GAILDQLB in this experiment provides a learned reward of , where is the discriminator (see Section A.3 in the appendix for details). The learned reward is always positive, while the implicit reward at an absorbing state is zero. Thus, the agent is inadvertently encouraged to avoid terminating the episode. For Breakout and Space Invaders, this just happens to be the right incentive, since the objective is to stay alive as long as possible. GAILDQLB outperforms SQIL in Breakout and performs comparably to SQIL in Space Invaders because GAILDQLB is accidentally biased in the right way.
Lunar Lander. The results in Figure 7 show that when the initial state is varied, SQIL outperforms the unbiased variant of GAIL (GAILTRPOU), but underperforms against the biased version of GAIL (GAILTRPOB). The latter result is likely due to the fact that the implementation of GAILTRPOB we used in this experiment provides a learned reward of , where is the discriminator (see Section A.3 in the appendix for details). The learned reward is always negative, while the implicit reward at an absorbing state is zero. Thus, the agent is inadvertently encouraged to terminate the episode quickly. For the Lunar Lander game, this just happens to be the right incentive, since the objective is to land on the ground and thereby terminate the episode. As in the Atari experiments, GAILTRPOB performs better than SQIL in this experiment because GAILTRPOB is accidentally biased in the right way.
Domain Shift ()  No Shift ()  

Random  
BC (P’91)  
GAILDQLB  
GAILDQLU  
SQIL (Ours)  
Expert 
Domain Shift ()  No Shift ()  
Random  
BC (P’91)  
GAILTRPOB (HE’16)  
GAILTRPOU  
SQIL (Ours)  
Ablation  
RBC  
Expert 
a.3 Implementation Details
To ensure fair comparisons, the same network architectures were used to evaluate SQIL, GAIL, and BC. For Lunar Lander, we used a network architecture with two fullyconnected layers containing 128 hidden units each to represent the Q network in SQIL, the policy and discriminator networks in GAIL, and the policy network in BC. For Car Racing, we used four convolutional layers (following [10]) and two fullyconnected layers containing 256 hidden units each. For Humanoid and HalfCheetah, we used two fullyconnected layers containing 256 hidden units each. For Atari, we used the convolutional neural network described in [22] to represent the Q network in SQIL, as well as the Q network and discriminator network in GAIL.
To ensure fair comparisons, the same demonstration data were used to train SQIL, GAIL, and BC. For Lunar Lander, we collected 100 demonstration rollouts. For Car Racing, Pong, Breakout, and Space Invaders, we collected 20 demonstration rollouts. Expert demonstrations were generated from scratch for Lunar Lander using DQN [22], and collected from opensource pretrained policies for Car Racing [10] as well as Humanoid and HalfCheetah [4]. The Humanoid demonstrations were generated by a stochastic expert policy, while the HalfCheetah demonstrations were generated by a deterministic expert policy; both experts were trained using TRPO.^{3}^{3}3https://drive.google.com/drive/folders/1h3H4AY_ZBx08hzCt0NxxusV1melu1U We used two opensource implementations of GAIL: [6] for Lunar Lander, and [4] for MuJoCo. We adapted the OpenAI Baselines implementation of GAIL to use soft Qlearning for Car Racing and Atari. Expert demonstrations were generated from scratch for Atari using DQN.
For Lunar Lander, we set , , and . For Car Racing, we set , , and . For Humanoid, we set and . For HalfCheetah, we set and . For Atari, we set , , and .
SQIL was not pretrained in any of the experiments. GAIL was pretrained using BC for HalfCheetah, but was not pretrained in any other experiments.
In standard implementations of soft Qlearning and SAC, the agent’s experience replay buffer typically has a fixed size, and once the buffer is full, old experiences are deleted to make room for new experiences. In SQIL, we never delete demonstration experiences from the replay buffer, but otherwise follow the standard implementation.
The BC and GAIL performance metrics in Section 5.3 are taken from [4].^{4}^{4}4https://github.com/openai/baselines/blob/master/baselines/gail/result/gailresult.md
The GAIL and SQIL policies in Section 5.3 are set to be deterministic during the evaluation rollouts used to measure performance.