Support-weighted Adversarial Imitation Learning

02/20/2020 ∙ by Ruohan Wang, et al. ∙ Imperial College London 0

Adversarial Imitation Learning (AIL) is a broad family of imitation learning methods designed to mimic expert behaviors from demonstrations. While AIL has shown state-of-the-art performance on imitation learning with only small number of demonstrations, it faces several practical challenges such as potential training instability and implicit reward bias. To address the challenges, we propose Support-weighted Adversarial Imitation Learning (SAIL), a general framework that extends a given AIL algorithm with information derived from support estimation of the expert policies. SAIL improves the quality of the reinforcement signals by weighing the adversarial reward with a confidence score from support estimation of the expert policy. We also show that SAIL is always at least as efficient as the underlying AIL algorithm that SAIL uses for learning the adversarial reward. Empirically, we show that the proposed method achieves better performance and training stability than baseline methods on a wide range of benchmark control tasks.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 8

page 9

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Adversarial Imitation Learning (AIL) algorithms is a powerful class of methods that learn to imitate expert behaviors from demonstrations. AIL alternates between learning a reward function via adversarial training, and performing reinforcement learning (RL) with the learned reward function. AIL has been shown to be effective with only a small of expert trajectories, with no further access to other RL signals. AIL is also more robust and mitigates the issue of distributional drift from behavioral cloning 

ross2011reduction, a classical imitation learning method that requires large amount of training data to generalize well. Generative Adversarial Imitation Learning (GAIL) (ho2016generative), an early and influential AIL method, shows the equivalence between inverse reinforcement learning settings and Generative Adversarial Networks (GANs) goodfellow2014generative. This observation motivates casting imitation learning as a distribution matching task between the expert and the RL agent policies. Recent works have sought to improve various aspects of AIL, such as robustness to changes in environment dynamics fu2017learning, and improving sample efficiency of environment interactions nagabandi2018neural. However, AIL still faces several practical challenges associated with adversarial training, including potential training instability salimans2016improved; brock2018large, and implicit reward bias kostrikov2018discriminator.

wang2019red demonstrated that imitation learning is also feasible by constructing a fixed reward function via support estimation of the expert policy. Since support estimation only requires expert demonstrations, the method sidesteps the training instability associated with adversarial training. However, we show in Section 4.1 that the reward learned via support estimation deteriorates and leads to poor performance when the expert data is sparse.

Support estimation and adversarial reward represent two different yet complementary RL signals for imitation learning, both learnable from expert demonstrations. We unify both signals into Support-weighted Adversarial Imitation Learning (SAIL), a general framework that weighs the adversarial reward with a confidence score derived from support estimation of the expert policy. Consequently, SAIL improves the quality of the learned reward to address potential training instability and reward bias. We highlight that SAIL may be efficiently applied on top of many existing AIL algorithms such as GAIL and Discriminator-Actor-Critic kostrikov2018discriminator. In addition, it can be shown that SAIL is at least as efficient as the underlying AIL method that SAIL leverages for learning the adversarial reward. In the experiments on a series of benchmark control tasks, we demonstrate that SAIL achieves better performance and training stability, as well as mitigates the implicit reward bias.

Our main contribution includes:

  • [partopsep=0em,itemsep=0em,topsep=0em]

  • We propose a general framework for adversarial imitation learning, combining both adversarial rewards and support estimation of the expert policy.

  • The proposed method is easy to implement, and may be applied to many existing AIL algorithms.

  • We show SAIL improves performance and training stability, and better mitigates reward bias over the baseline methods.

The rest of the paper is organized as follows: we review the relevant background and literature in Section 2. In Section 3, we detail the proposed method and the theoretical analysis on its sample efficiency with respect to the expert data. We present the experiment results in Section 4 and conclude in Section 5.

2 Background

We recall the definition of Markov Decision Process and introduce formal notations used in this work. We also review the related literature on imitation learning.

2.1 Task Setting

We consider an infinite-horizon discounted Markov Decision Process (MDP) , where is the set of states, the set of actions,

the transition probability,

the reward function, the distribution over initial states, and the discount factor. Let be a stochastic policy with expected discounted reward where , , and for . We denote the expert policy.

2.2 Imitation Learning

Behavioral Cloning (BC) learns a policy

directly from expert trajectories via supervised learning. BC is simple to implement, and effective when expert data is abundant. However, BC is prone to distributional drift: the state distribution of expert demonstrations deviates from that of the agent policy, due to accumulation of small mistakes during policy execution. Distributional drift may lead to catastrophic errors 

ross2011reduction. While several methods address the issue ross2010efficient; sun2017deeply, they often assume further access to the expert during training.

Inverse Reinforcement Learning (IRL) first estimates a reward from expert demonstrations, followed by RL using the estimated reward ng2000algorithms; abbeel2004apprenticeship. Building upon a maximum entropy formulation of IRL ziebart2008maximum, finn2016connection and fu2017learning explore adversarial IRL and its connection to Generative Adversarial Imitation Learning ho2016generative.

2.2.1 Adversarial Imitation Learning

Generative Adversarial Imitation Learning (GAIL) ho2016generative is an early and influential work on AIL. It casts imitation learning as distribution matching between the expert and the RL agent. Specifically, the authors show the connection between IRL and GANs and formulate the following a minimax game:

(1)

where the expectations and

denote the joint distributions over state-actions of the RL agent and the expert, respectively. GAIL is able to achieve expert performance with a small number of expert trajectories on various benchmark tasks. However, GAIL is relatively sample inefficient with respect to environment interaction, and inherits issues associated with adversarial training, such as vanishing gradients, training instability and overfitting to expert demonstrations 

arjovsky2017towards; brock2018large.

GAIL has inspired many follow-up works aimed at improving the efficiency and stability of AIL methods. For instance, Generative Moment Matching Imitation Learning 

kim2018imitation replaces the adversarial reward with a non-parametric maximum mean discrepancy estimator to sidestep adversarial learning. baram2017end improve sample efficiency with a model-based RL algorithm. In addition, kostrikov2018discriminator and sasaki2018sample demonstrate significant gain in sample efficiency with off-policy RL algorithms. Furthermore, Generative Predecessor Models for Imitation Learning schroecker2019generative imitates the expert policy using generative models to reason about alternative histories of demonstrated states.

The proposed method extends the broad family of AIL algorithms additional information. In particular, we improve the quality of the learned reward by weighing the adversarial reward with a score derived from support estimation of the expert policy. The proposed method is therefore complementary and orthogonal to many aforementioned techniques for improving the algorithmic efficiency and stability.

2.2.2 Imitation Learning via Support Estimation

Alternative to AIL, wang2019red demonstrate the feasibility of using a fixed RL reward via estimating the support of the expert policy from expert demonstrations. Connecting kernel-based support estimation de2014universally to Random Network Distillation burda2018exploration, the authors propose Random Expert Distillation (RED) to learn a reward function based on support estimation. Specifically, RED learns the reward parameter by minimizing:

(2)

where projects from expert demonstrations to some embedding of size , with randomly initialized . The reward is then defined as:

(3)

where

is a hyperparameter. As optimizing

Eq. 2 only requires expert data, RED sidesteps adversarial learning, and casts imitation learning as a standard RL task using the learned reward. While RED works well given sufficient expert data, we show in the experiments that its performance suffers in the more challenging setting of sparse expert data.

3 Method

Formally, we consider the task of learning a reward function from a finite set of trajectories , sampled from the expert policy within a MDP. Each trajectory is a sequence of state-action tuples in the form of . Assuming that the expert trajectories are consistent with some latent reward function , we aim to learn a policy that achieves good performance with respect to by applying RL on the learned reward function .

In this section, we first discuss the advantages and shortcomings of AIL to motivate our method. We then introduce Support-weighted Adversarial Learning (SAIL), and present a theoretical analysis that compares SAIL with the underlying AIL method that SAIL uses for adversarial reward learning. In particular, we consider GAIL for adversarial reward learning.

3.1 Adversarial Imitation Learning

A clear advantage of AIL resides in its low sample complexity with respect to expert data. For instance, GAIL requires as little as 200 state-action tuples from the expert to achieve imitation. The reason is that the adversarial reward may be interpreted as an effective exploration mechanism for the RL agent. To see this, consider the learned reward function under the optimality assumption. With the optimal discriminator to Eq. 1 , a common reward for GAIL is

(4)

Section 3.1 shows that the adversarial reward only depends on the ratio . Intuitively, incentivizes the RL agent towards under-visited state-actions, where , and away from over-visited state-actions, where . When and match exactly, converges to an indicator function for the support of , since  (goodfellow2014generative). In practice, the adversarial reward is unlikely to converge, as is estimated from a finite set of expert demonstrations. Instead, the adversarial reward continuously drives the agent to explore by evolving the reward landscape.

In practice, AIL faces several challenges, such as potential training instability associated with adversarial training. wang2019red

demonstrated empirically that the adversarial reward could be unreliable in regions of the state-action space where the expert data is sparse, causing the agent to diverge from the intended behavior. When the agent policy is substantially different from the expert, the discriminator could differentiate them with high confidence. As a result, the agent receives tiny and uninformative reward, which causes significant slow down in training, a scenario similar to the vanishing gradient problem in GAN training 

(arjovsky2017towards).

On the other hand, kostrikov2018discriminator demonstrated that the adversarial reward encodes an implicit survival bias: the non-negative reward may lead to sub-optimal behaviors in goal-oriented tasks where the agent learns to move around the goal to accumulate rewards, instead of completing the tasks. While the authors address the issue by introducing absorbing states, the solution requires additional RL signals from the environment, such as access to the time limit of an environment for detecting early termination of training episodes. In Section 4.2, We demonstrate empirically that our proposed method mitigates the issue, and is able to imitate the expert more robustly.

3.2 Support-weighted Adversarial Imitation Learning

We propose a novel reward function that unifies the adversarial reward with the score derived from support estimation of the expert policy.

(5)

SAIL leverages the exploration mechanism offered by the adversarial reward , and weigh the adversarial reward with , a score derived from support estimation. Intuitively, may be interpreted as a confidence estimate on the reliability of the adversarial reward, based on the availability of training data. This is particularly useful in our task context, when only limited number of expert demonstrations is available. As support estimation only requires expert demonstrations, our method requires no further assumptions than the underlying AIL method used.

We use a bounded reward instead of the typical . The modification allows and to have the same range and thus contribute equally to the reward function. For all experiments, we include the comparison between the two rewards, and show that the bounded one generally produces more robust policies. In the rest of paper, we denote SAIL with the bounded reward as SAIL-b, and SAIL with the log reward as SAIL. Similarly, we denote GAIL using the bounded reward as GAIL-b.

To improve training stability, SAIL constrains the RL agent to the estimated support of the expert policy, where provides a more reliable RL signal wang2019red. As tends to be very small (ideally zero) for , discourages the agent from exploring those state-actions by masking away the rewards. This is a desirable property as the quality of the RL signals beyond the support of the expert policy can’t be guaranteed. We demonstrate in Section 4.1 the improved training stability on the Mujoco benchmark tasks .

SAIL also mitigates the survival bias in goal-oriented tasks by encouraging the agent to stop at the goal and complete the task. In particular, shapes the adversarial reward by favoring stopping at the goal against all other actions, as stopping at the goal is on the support of the expert policy, while other actions are not. We demonstrate empirically that SAIL learns to assign significantly higher reward towards completing the task and corrects for the bias in Section 4.2.

We provide the pseudocode implementation of SAIL in Algorithm 1. The algorithm computes by estimating the support of the expert policy, followed by iterative updates of the policy and . We apply the Trust Region Policy Optimization (TRPO) algorithm schulman2015trust with the reward for policy updates.

  Input: Expert trajectories , function models, initial policy , initial discriminator parameters , learning rate .
  
  
  for 
    sample a trajectory
    
    
     TRPO.
  
  def  RED
    Sample
    Minimize
    return
Algorithm 1 Support-weighted Adversarial Imitation Learning

3.3 Comparing SAIL with GAIL

In this section, we show that SAIL is at least as efficient as GAIL in its sample complexity for expert data, and provide comparable RL signals on the expert policy’s support. We note that our analysis could be similarly applied to other AIL methods, suggesting the broad applicability of our approach.

We begin with the asymptotic setting, where the number of expert trajectories tends to infinity. In this case, both GAIL’s, RED’s and SAIL’s discriminators ultimately recover the expert policy’s support at convergence (see ho2016generative for GAIL and wang2019red for RED; SAIL follows from their combination). At convergence, both SAIL and GAIL also recovers the expert policy as the expert and agent policy distributions match exactly. It is therefore critical to characterize the rates of convergence of the two methods, namely their relative sample complexity with respect to the number of expert demonstrations.

if the expert policy has infinite support, would converge to a constant function with value 1 under the asymptotic setting. We consequently recover GAIL and maintains all the theoretical properties of the algorithm. On the other hand, when only a finite number of demonstrations is available, would estimate a finite support and helps the RL agent to avoid state-actions not on the estimated support of the expert policy.

Formally, let . Prototypical learning bounds for an estimator of the support provide high probability bounds in the form of for any confidence , with a constant not depending on or the number of samples (i.e., expert state-actions). Here, represents the learning rate, namely how fast the estimator is converging to the support. By choosing the reward in Eq. 5, we are leveraging the faster learning rates between and , with respect to support estimation. At the time being, no results are available to characterize the sample complexity of GAIL (loosely speaking, the and introduced above). Therefore, we proceed by focusing on a relative comparison with SAIL. In particular, we show the following (see appendix for a proof).

Proposition 1.

Assume that for any the rewards for RED and GAIL have the following learning rates in estimating the support

(6)

Then, for any and any , the following holds

(7)

with probability at least , where and are the upper bounds for and , respectively.

Eq. 7 shows that SAIL is at least as fast as the faster among RED and GAIL with respect to support estimation, implying that SAIL is at least as efficient as GAIL in the sample complexity for expert data. Moreover, Eq. 7 indicates how fast the proposed method could correctly identify state-actions not belonging to the estimated support of the expert and assigns low rewards to them.

Proposition 2.

For any and any , we assume that

(8)

The following event holds with probability at least that

(9)

Eq. 9 shows that on the expert policy’s support, is close to up to a precision that improves with the number of expert state-actions. SAIL thus provides RL signals comparable to GAIL on the expert policy’s support.

It is also worth noting that the analysis could explain why is a less viable approach for combining the two RL signals. The analogous bound to Eq. 7 would be the sum of errors from the two methods, implying the slower of the two learning rates, while Eq. 9 would improve only by a constant, as would be absent from Eq. 9. Our preliminary experiments indicated that performed noticeably worse than Eq. 5.

Lastly, we comment on whether the assumptions in Eqs. 8 and 6 are satisfied in practice. Following the kernel-based version of RED (wang2019red), we can borrow previous results from the set learning literature, which guarantee RED to have a rate of   (de2014universally; rudi2017regularized). These rates have been shown to be optimal. Any estimator of the support cannot have faster rates than , unless additional assumptions are imposed. Learning rates for distribution matching with GANs are still an active area of research, and conclusive results characterizing the convergence rates of these estimators are not available. We refer to singh2018nonparametric for an in-depth analysis of the topic.

4 Experiments

To demonstrate improved performance and training stability, We evaluate SAIL against baseline methods on six Mujoco control tasks. In addition, we use Lunar Lander, another common benchmark task that allows easy access to human demonstrations, to show how the proposed method is more robust and mitigates reward bias. We omit evaluation against methods using off-policy RL algorithms, as they are not the focus of this work. As discussed previously, the proposed method could also be applied to such algorithms.

4.1 Mujoco Tasks

Mujoco control tasks have been commonly used as the standard benchmark for AIL. We evaluate SAIL against BC, GAIL and RED on Hopper, Reacher, HalfCheetah, Walker2d, Ant and Humanoid. We adopt the same experimental setup presented in ho2016generative by sub-sampling the expert trajectories every 20 samples. Consistent with the observation from kostrikov2018discriminator, our preliminary experiments show that sub-sampling presents a more challenging setting, as BC is competitive with AIL when full trajectories are used. In our experiments, we also adopt the minimum number of expert trajectories specified in ho2016generative for each task. More details on experiment setup are available in the appendix.

We run each algorithm using 5 different random seeds in all Mujoco tasks. Table 1

shows the performances among the evaluated algorithms. We choose the best policy obtained from the 5 random seeds for each algorithm, and report the mean performance and standard deviation of the chosen policy over 50 evaluation runs. The policies are rolled out deterministically.


Hopper Reacher Cheetah Walker Ant Humanoid
BC 312.3 34.5 -8.8 3.3 1892.0 206.9 248.2 117.8 1752.0 434.8 539.4 185.7
RED 1056.5 0.5 -9.1 4.1 -0.2 0.7 2372.8 8.8 1005.5 8.6 6012.0 434.9
GAIL 3826.5 3.2 -9.1 4.4 4604.7 77.6 5295.4 44.1 1013.3 16.0 8781.2 3112.6
GAIL-b 3810.5 8.1 -8.3 2.5 4510.0 68.0 5388.1 161.2 3413.1 744.7 10132.5 1859.3
SAIL 3824.7 6.6 -7.5 2.7 4747.5 43.4 5293.0 590.9 3330.4 729.4 9292.8 3190.0
SAIL-b 3811.6 3.8 -7.4 2.5 4632.2 59.1 5438.6 18.4 4176.3 203.1 10589.6 52.2
Table 1: Episodic reward on the Mujoco tasks evaluated over 50 runs. SAIL-b achieves overall the best performance, with significantly lower standard deviation, indicating the robustness of the learned policies.

The results show that SAIL-b is comparable to GAIL on Hopper, and outperform the other methods on all other tasks. We note that RED significantly underperforms in the sub-sampling setting111wang2019red used full trajectories in their original experiments. Across all tasks, SAIL-b generally achieves lower standard deviation compared to other algorithms, in particular for Humanoid, indicating the robustness of the learned policies.

We stress that standard deviation is a critical metric for the robustness of the learned policies and has practical implications. For instance, the large standard deviations in Humanoid are caused by occasional crashes, which is potentially dangerous in real-world applications regardless of the general good performance. To illustrate this, Fig. 1(a) shows the histogram of all 50 evaluations in Humanoid for RED, GAIL-b and SAIL-b. It is clear that SAIL-b imitates the expert consistently. Though GAIL-b appears to be only slightly worse in average performance, the degradation is caused by occasional and highly undesirable crashes, suggesting incomplete imitation of the expert. RED learns a sub-optimal gait, but demonstrates no crashes. The results suggest that the proposed method improves the quality of the RL signals.

(a) Hopper
(b) Reacher
(c) HalfCheetah
(d) Walker2d
(e) Ant
(f) Humanoid
Figure 1: Training progress for RED, GAIL, GAIL-b, SAIL, and SAIL-b. Consistent with our theoretical analysis, SAIL-b (blue) is more stable and sample efficient in Reacher, Ant and Humanoid, and comparable to other algorithms for the remaining tasks.

For ablation, we compare the unbounded reward (i.e. SAIL) against against the bounded one (i.e. SAIL-b), we observe that the bounded variant generally produces policies with smaller standard deviations and better performances, especially for Ant and Humanoid. We attribute the improvements to the fact that SAIL-b receives equal contribution in RL signal from both support estimation and adversarial reward, as and have the same range. We also note that GAIL fails to imitate the expert in Ant, while GAIL-b performs significantly better. The results suggest that constraining the range of the adversarial reward could improve performance.

4.1.1 Training Stability and Sample Efficiency

To assess algorithms’ respective sensitivity to random seeds, we plot the policy performance against the number of iterations for each algorithm in Fig. 1, Each iteration consists of 1000 environment steps. The figure reports mean and standard deviation of each algorithm, across the 5 random seeds.

Fig. 1 shows that SAIL-b is more sample efficient and stable in Reacher, Ant and Humanoid tasks; and is comparable to the other algorithms in the remaining tasks. Consistent with our analysis in Section 3.3, SAIL-b appears at least as efficient as GAIL even when the support estimation (i.e., the performance of RED) suffers from insufficient expert data in Hopper, HalfCheetah and Walker2d. In Reacher, Ant and Humanoid, SAIL-b benefits from the support estimation and achieves better performance and training stability. In particular, we note that without support estimation, GAIL fails to imitate the expert in Ant (Fig. 0(e)). Similar failures were also observed in kostrikov2018discriminator. GAIL is also more sensitive to initial conditions: GAIL converged to sub-optimal policies in 2 out 5 seeds in Humanoid. Lastly, while RED improves noticeably faster during early training in Humanoid, it converged to a sub-optimal behavior eventually.

4.2 Lunar Lander

We demonstrate that the proposed method mitigates the survival bias in Lunar Lander (Fig. 2(b)) from OpenAI Gym (gym), while other baseline methods imitate the expert inconsistently. In this task, the agent is required to control a spacecraft to safely land between the flags. We specifically choose a human expert to provide 10 demonstrations as alternative source for training data, as the task allows easy access to human demonstrations. This is in contrast with Mujoco tasks, where human demonstration is difficult and the expert policies are learned via RL.

(a) Performance histogram of 50 evaluation runs on Humanoid for RED, GAIL, and SAIL-b. SAIL-b imitates the expert consistently. GAIL has undesirable failure cases, with rewards of less than 1000 (bottom left corner). RED is consistent though sub-optimal.
(b) (LunarLander) Average reward at the goal states by different algorithms. SAIL-b assigns significantly higher reward to "no op", enabling the agent to learn the appropriate landing behaviors. Other algorithms fail to imitate the expert consistently.
Figure 2:

We observe that even without the environment reward, Lunar Lander provides a natural RL signal by terminating episodes early after a crash, thus encouraging the agent to avoid crashing. Consequently, both SAIL and GAIL are able to successfully imitate the expert and land the spacecraft appropriately. SAIL performs slightly better than GAIL on the average reward, and achieve noticeably lower standard deviation. The average performances and the standard deviations evaluated over 50 runs are presented in Fig. 2(a).

Default No-terminal
BC 100.38 130.91 100.38 130.91
RED 13.75 53.43 -39.33 24.39
GAIL 258.30 28.98 169.73 80.84
GAIL-b 250.53 67.07 -69.33 79.76
SAIL 257.02 20.66 237.96 49.70
SAIL-b 262.97 18.11 256.83 20.99
Expert 253.58 31.27 253.58 31.27
(a) Average environment reward and standard deviation on Lunar Lander, evaluated over 50 runs for the default and no-terminal environment.
(b) The task of Lunar Lander requires landing the spacecraft between the flags without crashing.
Figure 3:

To construct a more challenging task, we disable early termination from environment, thus removing external RL signals. In this no-terminal environment, a training episode only ends after the time limit. We present each algorithm’s performance for the no-terminal setting in Fig. 2(a). SAIL outperforms GAIL. Specifically, we observe that GAIL learns to land for some initial conditions, while exhibiting survival bias in other scenarios by hovering at the goal. In contrast, SAIL is still able to recover the expert policy.222Illustrative video at https://vimeo.com/361835881

We show in Fig. 1(b) the average learned reward fr GAIL, SAIL-b and RED at goal states, to visualize how support estimation shapes the learned reward. The goal states are selected from the expert trajectories and satisfy two conditions:

touching the ground (the state vector has indicator variables for ground contact), and

has "no op" as the corresponding action. As the reward functions are dynamic, we snapshot the learned rewards when the algorithms obtain their best policies, respectively. It is clear that SAIL-b assigns a significantly higher average reward to "no op" at goal states compared against the other algorithms, thus facilitating the agent learning. Though GAIL and RED still favor "no op" to other actions, the differences in reward are much smaller, causing less consistent landing behaviors.

We further observe that all evaluated AIL methods oscillate between partially hovering behavior and landing behavior during policy learning. The observation suggests that our method only partially addresses the survival bias, a limitation we will tackle in future works. This is likely caused by SAIL’s non-negative reward, despite the beneficial shaping effect from support estimation.

To demonstrate the compatibility of the proposed method with other improvements for AIL, we show that the method is compatible with the absorbing state technique proposed in kostrikov2018discriminator if the time limit of an environment is known. The additional experiment results and discussion are available in the Appendix.

5 Conclusion

In this paper, we propose Support-weighted Adversarial Imitation Learning by combining support estimation of the expert policy with adversarial imitation learning. The proposed approach improves the quality of the RL signals by weighing the adversarial reward with the score derived from support estimation, leading to better training stability and performance. Our approach is also orthogonal and complementary to many existing AIL methods. More broadly, our results show that expert demonstrations contain rich sources of information for imitation learning. Effectively merging different sources of information in the expert demonstrations produces more efficient and stable algorithms; and appears to be a promising direction for future research.

References

Appendix A Proof for Proposition 1 and 2

Observe that for any

(10)

By the assumption on the learning rate in Eq. 6, one of the two following events holds with probability at least , for any and

(11)

Plugging the above upper bounds into Eq. 10 yields the desired result in Eq. 7.

By assumption in Eq. 8 the following event holds with probability at least for .

(12)

Plugging this inequality in the definition of , we obtain

(13)
(14)
(15)

Appendix B Experiment Details

The experiments are based on OpenAI’s baselines333https://github.com/openai/baselines and the original implementation of RED444https://github.com/RuohanW/RED. We adapted the code from RED for our experiments, and used the accompanying dataset of expert trajectories. 4 Nvidia GTX1070 GPUs were used in the experiments.

Table 2 shows the environment information, number of environment steps and number of expert trajectories used for each task. Each full trajectory consists of 1000 pairs. They are sub-sampled during the experiments.

Task State Space Action Space Trajectories Env Steps Exp Performance
Hopper-v2 11 3 4 3777.8 3.8
Reacher-v2 11 2 4 -3.7 1.4
HalfCheetah-v2 17 6 4 4159.8 93.1
Walker2d-v2 17 6 4 5505.8 81.4
Ant-v2 111 8 4 4821.0 107.4
Humanoid-v2 376 17 80 10413.1 47.0

Table 2: Environment information, number of expert trajectories and environment steps used for each task

b.1 Network Architecture

The default policy network from OpenAI’s baselines are used for all tasks: two fully-connected layers of 100 units each, with tanh nonlinearities. The discriminator networks and the value function networks use the same architecture.

RED and SAIL use RND burda2018exploration for support estimation. We use the default networks from RED. We set

following the heuristic in

wang2019red that from the expert trajectories mostly have reward close to 1.

b.2 Hyperparameters

For fair comparisons, all algorithms shared hyperparameters for each task. We present them in the table below, including discriminator learning rate , discount factor , number of policy steps per iteration

, and whether the policy has fixed variance. All other hyperparameters are set to their default values from OpenAI’s baselines.

Task Name Fixed Variance
Hopper 0.99 0.0003 3 False
Reacher 0.99 0.0003 3 False
HalfCheetah 0.99 0.0003 3 False
Walker2d 0.99 0.0003 3 False
Ant 0.99 0.0001 3 False
Humanoid 0.99 0.0001 5 False

Table 3: Hyperparameters used for each tasks

Appendix C Additional Results on Lunar Lander

In the default environment, Lunar Lander contains several terminal states, including crashing, flying out of view, and landing at the goal. In the no-terminal environment, all terminal states are disabled, such that the agent must solely rely on the expert demonstrations for inferring that stopping is the correct behavior upon landing.

To compare our method with the technique of introducing virtual absorbing state (AS) (kostrikov2018discriminator), we also construct a goal-terminal environment where the only terminal state is successful landing at the goal, because the AS technique cannot be directly applied in the no-terminal environment. We also combine SAIL with the AS technique to demonstrate that the proposed method is complementary is existing improvements to AIL. We present the results in Table 4.

Default Goal-terminal No-terminal
GAIL 258.30 28.98 -7.16 31.64 -69.33 79.76
GAIL-b 250.53 67.07 4.16 107.37 169.73 80.84
SAIL 257.02 20.66 261.07 35.66 237.96 49.70
SAIL-b 262.97 18.11 252.07 67.22 256.83 20.99
GAIL + AS 271.46 11.90 110.22 119.25 -
GAIL-b + AS 269.97 16.48 186.02 98.27 -
SAIL + AS 274.89 12.82 254.58 25.40 -
SAIL-b + AS 270.33 15.86 258.30 20.75 -
Table 4: Average environment reward and standard deviation on Lunar Lander, evaluated over 50 runs for the default, goal-terminal and no-terminal environment.