State-only Imitation with Transition Dynamics Mismatch

02/27/2020 ∙ by Tanmay Gangwani, et al. ∙ University of Illinois at Urbana-Champaign 10

Imitation Learning (IL) is a popular paradigm for training agents to achieve complicated goals by leveraging expert behavior, rather than dealing with the hardships of designing a correct reward function. With the environment modeled as a Markov Decision Process (MDP), most of the existing IL algorithms are contingent on the availability of expert demonstrations in the same MDP as the one in which a new imitator policy is to be learned. This is uncharacteristic of many real-life scenarios where discrepancies between the expert and the imitator MDPs are common, especially in the transition dynamics function. Furthermore, obtaining expert actions may be costly or infeasible, making the recent trend towards state-only IL (where expert demonstrations constitute only states or observations) ever so promising. Building on recent adversarial imitation approaches that are motivated by the idea of divergence minimization, we present a new state-only IL algorithm in this paper. It divides the overall optimization objective into two subproblems by introducing an indirection step and solves the subproblems iteratively. We show that our algorithm is particularly effective when there is a transition dynamics mismatch between the expert and imitator MDPs, while the baseline IL methods suffer from performance degradation. To analyze this, we construct several interesting MDPs by modifying the configuration parameters for the MuJoCo locomotion tasks from OpenAI Gym.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

Code Repositories

RL-Indirect-imitation

Pytorch code for "State-only Imitation with Transition Dynamics Mismatch" (ICLR 2020)


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In the Reinforcement Learning (RL) framework, the objective is to train policies that maximize a certain reward criterion. Deep-RL, which combines RL with the recent advances in the field of deep-learning, has produced algorithms demonstrating remarkable success in areas such as games 

(Mnih et al., 2015; Silver et al., 2016), continuous control (Lillicrap et al., 2015), and robotics (Levine et al., 2016), to name a few. However, the application of these algorithms beyond controlled simulation environments has been fairly modest; one of the reasons being that manual specification of a good reward function is a hard problem. Imitation Learning (IL) algorithms (Pomerleau, 1991; Ng et al., 2000; Ziebart et al., 2008; Ho & Ermon, 2016) address this issue by replacing reward functions with expert demonstrations, which are easier to collect in most scenarios.

The conventional setting used in most of the IL literature is the availability of state-action trajectories from the expert, , collected in an environment modeled as a Markov decision process (MDP) with transition dynamics . These dynamics govern the distribution over the next state, given the current state and action. The IL objective is to leverage to train an imitator policy in the same MDP as the expert. This is a severe requirement that impedes the wider applicability of IL algorithms. In many practical scenarios, the transition dynamics of the environment in which the imitator policy is learned (henceforth denoted by ) is different from the dynamics of the environment used to collect expert behavior, . Consider self-driving cars as an example, where the goal is to learn autonomous navigation on a vehicle with slightly different gear-transmission characteristics than the vehicle used to obtain human driving demonstrations. We therefore strive for an IL method that could train agents under a transition dynamics mismatch, . We assume that other MDP attributes are the same for the expert and imitator environments.

Beyond the dynamics equivalence, another assumption commonly used in IL literature is the availability of expert actions (along with the states). A few recent works (Torabi et al., 2018a, b; Sun et al., 2019) have proposed “state-only” IL algorithms, where expert demonstrations do not include the actions. This opens up the possibility of employing IL to situations such as kinesthetic teaching in robotics and learning from weak-supervision sources such as videos. Moreover, if and differ, then the expert actions, even if available, are not quite useful for imitation anyway, since the application of an expert action from any state leads to different next-state distributions for the expert and the imitator. Hence, our algorithm uses state-only expert demonstrations.

We build on previous IL literature inspired by GAN-based adversarial learning - GAIL (Ho & Ermon, 2016) and AIRL (Fu et al., 2017). In both these methods, the objective is to minimize the distance between the visitation distributions () induced by the policy and expert, under some suitable metric

, such as Jensen-Shannon divergence. We classify GAIL and AIRL as

direct imitation methods as they directly reduce . Different from these, we propose an indirect imitation approach which introduces another distribution as an intermediate or indirection step. In slight detail, starting with the Max-Entropy Inverse-RL objective (Ziebart et al., 2008), we derive a lower bound which transforms the overall IL problem into two sub-parts which are solved iteratively: the first is to train a policy to imitate a distribution represented by a trajectory buffer, and the second is to move the buffer distribution closer to expert’s () over the course of training. The first part, which is policy imitation by reducing is done with AIRL, while the second part, which is reducing , is achieved using a Wasserstein critic (Arjovsky et al., 2017). We abbreviate our approach as I2L, for indirect-imitation-learning.

We test the efficacy of our algorithm with continuous-control locomotion tasks from MuJoCo. Figure 0(a) depicts one example of the dynamics mismatch which we evaluate in our experiments. For the Ant agent, an expert walking policy is trained under the default dynamics provided in the OpenAI Gym, . The dynamics under which to learn the imitator policy are curated by modifying the gravity parameter to half its default value (i.e. ), . Figure 0(b) plots the average episodic returns of in the original and modified environments, and proves that direct policy transfer is infeasible. For Figure 0(c), we just assume access to state-only expert demonstrations from , and do IL with the GAIL algorithm. GAIL performs well if the imitator policy is learned in the same environment as the expert (), but does not succeed under mismatched transition dynamics, (). In our experiments section, we consider other sources of dynamics mismatch as well, such as agent-density and joint-friction. We show that I2L trains much better policies than baseline IL algorithms in these tasks, leading to successful transfer of expert skills to an imitator in an environment dissimilar to the expert.

We start by reviewing the relevant background on Max-Entropy IRL, GAIL and AIRL, since these methods form an integral part of our overall algorithm.

(a)
(b)
(c)
Figure 1: (a) Different amount of gravitation pull is one example of transition dynamics mismatch between the expert and the imitator MDPs. (b) An expert policy trained in transfer poorly to an environment with dissimilar dynamics (gravity ). (c) IL performance with GAIL degrades when , compared to the conventional IL setting of imitating in the same environment as the expert.

2 Background

An RL environment modeled as an MDP is characterized by the tuple (, , , , ), where is the state-space, and is the action-space. Given an action , the next state is governed by the transition dynamics , and reward is computed as . The RL objective is to maximize the expected discounted sum of rewards, , where is the discount factor, and is the initial state distribution. We define the unnormalized -discounted state-visitation distribution for a policy by , where

is the probability of being in state

at time , when following policy and starting state . The expected policy return can then be written as , where is the state-action visitation distribution (also referred to as the occupancy measure). For any policy , there is a one-to-one correspondence between and its occupancy measure (Puterman, 1994).

2.1 Maximum Entropy IRL

Designing reward functions that adequately capture the task intentions is a laborious and error-prone procedure. An alternative is to train agents to solve a particular task by leveraging demonstrations of that task by experts. Inverse Reinforcement Learning (IRL) algorithms (Ng et al., 2000; Russell, 1998) aim to infer the reward function from expert demonstrations, and then use it for RL or planning. The IRL method, however, has an inherent ambiguity, since many expert policies could explain a set of provided demonstrations. To resolve this, Ziebart (2010) proposed the Maximum Causal Entropy (MaxEnt) IRL framework, where the objective is to learn a reward function such that the resulting policy matches the provided expert demonstrations in the expected feature counts , while being as random as possible:

where is the -discounted causal entropy, and denotes the empirical feature counts of the expert. This constrained optimization problem is solved by minimizing the Lagrangian dual, resulting in the maximum entropy policy: , where is the Lagrangian multiplier on the feature matching constraint, and are the soft value functions such that the following equations hold (please see Theorem 6.8 in Ziebart (2010)):

Inspired by the energy-based formulation of the maximum entropy policy described above, , recent methods (Finn et al., 2016; Haarnoja et al., 2017; Fu et al., 2017) have proposed to model complex, multi-modal action distributions using energy-based policies, , where

is represented by a universal function approximator, such as a deep neural network. We can then interpret the IRL problem as a maximum likelihood estimation problem:

(1)

2.2 Adversarial IRL

An important implication of casting IRL as maximum likelihood estimation is that it connects IRL to adversarial training. We now briefly discuss AIRL (Fu et al., 2017) since it forms a component of our proposed algorithm. AIRL builds on GAIL (Ho & Ermon, 2016), a well-known adversarial imitation learning algorithm. GAIL frames IL as an occupancy-measure matching (or divergence minimization) problem. Let and represent the state-action visitation distributions of the policy and the expert, respectively. Minimizing the Jenson-Shanon divergence recovers a policy with a similar trajectory distribution as the expert. GAIL iteratively trains a policy () and a discriminator () to optimize the min-max objective similar to GANs (Goodfellow et al., 2014):

(2)

GAIL attempts to learn a policy that behaves similar to the expert demonstrations, but it bypasses the process of recovering the expert reward function. Finn et al. (2016) showed that imposing a special structure on the discriminator makes the adversarial GAN training equivalent to optimizing the MLE objective (Equation 1). Furthermore, if trained to optimality, it is proved that the expert reward (up to a constant) can be recovered from the discriminator. They operate in a trajectory-centric formulation which can be inefficient for high dimensional state- and action-spaces. Fu et al. (2017) present AIRL which remedies this by proposing analogous changes to the discriminator, but operating on a single state-action pair:

(3)

Similar to GAIL, the discriminator is trained to maximize the objective in Equation 2; is learned, whereas the value of is “filled in”. The policy is optimized jointly using any RL algorithm with as rewards. When trained to optimality, ; hence recovers the soft advantage of the expert policy (up to a constant).

2.3 State-only Imitation

State-only IL algorithms extend the scope of applicability of IL by relieving the need for expert actions in the demonstrations. The original GAIL approach could be modified to work in the absence of actions. Specifically, Equation 2 could be altered to use a state-dependent discriminator , and state-visitation (instead of state-action-visitation) distributions and . The AIRL algorithm, however, requires expert actions due to the special structure enforced on the discriminator (Equation 3), deeming it incompatible with state-only IL. This is because, even though could potentially be made a function of only the state , actions are still needed for the “filled in” component. Inspired by GAIL, Torabi et al. (2018b) proposed GAIfO for state-only IL. The motivation is to train the imitator to perform actions that have similar effects in the environment, rather than mimicking the expert action. Algorithmically, GAIL is modified to make the discriminator a function of state transitions , and include state-transition distributions .

3 Indirect Imitation Learning (I2L)

We now detail our I2L algorithm which alters the standard IL routine (used by GAIL, AIRL) by introducing an intermediate or indirection step, through a new distribution represented by a trajectory buffer. For this section, we ignore the properties of the transition dynamics for the expert and the imitator MDPs (); they can be the same or different, I2L has no specific dependence on this. denotes a trajectory, which is a sequence of state-action pairs, . We begin with the expert’s (unknown) trajectory distribution, although our final algorithm works with state-only expert demonstrations.

Let the trajectory distribution induced by the expert be , and its state-action visitation distribution be . Using the parameterization from Equation 1, the likelihood objective to maximize for reward learning in MaxEnt-IRL can be written as (ignoring constants w.r.t ):

(4)

As alluded to in Sections 2.22.3, if expert actions were available, one could optimize for by solving an equivalent adversarial min-max objective, as done in AIRL. To handle state-only IL, we proceed to derive a lower bound to this objective and optimize that instead. Let there be a surrogate policy with a state-action distribution . The following proposition provides a lower bound to the likelihood objective in Equation 4.

Under mild assumptions of Lipschitz continuity of the function , we have that for two different state-action distributions and ,

where is the Lipschitz constant, and is the 1-Wasserstein (or Earth Mover’s) distance between the state-action distributions.

Proof.

Let denote the concatenation of state and action. Under Lipschitz continuity assumption for , for any two inputs and , we have

Let

be any joint distribution over the random variables representing the two inputs, such that the marginals are

and . Taking expectation w.r.t on both sides, we get

Since the above inequality holds for any , it also holds for , which gives us the 1-Wasserstein distance

Rearranging terms,

We can therefore lower bound the likelihood objective (Equation 4) as:

where is the trajectory distribution induced by the surrogate policy . Since the original optimization (Equation 1) is infeasible under the AIRL framework in the absence of expert actions, we instead maximize the lower bound, which is to solve the surrogate problem:

(5)

This objective can be intuitively understood as follows. Optimizing w.r.t recovers the reward (or soft advantage) function of the surrogate policy , in the same spirit as MaxEnt-IRL. Optimizing w.r.t brings the state-action distribution of close (in 1-Wasserstein metric) to the expert’s, along with a bias term that increases the log-likelihood of trajectories from , under the current reward model . We now detail the practical implementation of these optimizations.

Surrogate policy. We do not use a separate explicit parameterization for . Instead, is implicitly represented by a buffer , with a fixed capacity of trajectories 222 in all our experiments. In this way, can be viewed as a mixture of deterministic policies, each representing a delta distribution in trajectory space. is akin to experience replay (Lin, 1992), in that it is filled with trajectories generated from the agent’s interaction with the environment during the learning process. The crucial difference is that inclusion to is governed by a priority-based protocol (explained below). Optimization w.r.t can now be done using adversarial training (AIRL), since the surrogate policy actions are available in . Following Equation 3, the objective for the discriminator is:

(6)

where is the learner (imitator) policy that is trained with as rewards.

Optimizing . Since is characterized by the state-action tuples in the buffer , updating amounts to refreshing the trajectories in . For the sake of simplicity, we only consider the Wasserstein distance objective and ignore the other bias term, when updating for in Equation 5. Note that denote the state-action visitation distributions of the expert and the surrogate, respectively. Since we have state-only demonstrations from the expert (no expert actions), we minimize the Wasserstein distance between state visitations, rather than state-action visitations. Following the approach in WGANs (Arjovsky et al., 2017), we estimate using the Kantorovich-Rubinstein duality, and train a critic network with Lipschitz continuity constraint,

(7)

The empirical estimate of the first expectation term is done with the states in the provided expert demonstrations; for the second term, the states in are used. With the trained critic , we obtain a Figure 2: Environments for training an imitator policy are obtained by changing the default Gym configuration settings, one at a time. Environment GAIL-S I2L Expert (Traj. Return) Walker2d 3711 4107 6200 Hopper 2130 2751 3700 Ant 3217 3320 4800 Half-Cheetah 5974 5240 7500 Table 1: Average episodic returns when .

1 Networks: Policy (), Discriminator (), Wasserstein critic ()
2 empty buffer
/* State-only expert demonstration */
3 for each iteration do
4       Run in environment and collect few trajectories
       Update Wasserstein critic using and /* Equation 7 */
       Obtain trajectory score for each using Add to with the priority-based protocol, using the score as priority Update the AIRL discriminator using and /* Equation 6 */
5       Update policy with PPO using as rewards
6 end for
Algorithm 1 Indirect Imitation Learning (I2L)

score for each trajectory generated by the agent. The score is calculated as , where is the length of the trajectory. Our buffer is a priority-queue structure of fixed number of trajectories, the priority value being the score of the trajectory. This way, over the course of training, is only updated with trajectories with higher scores, and by construction of the score function, these trajectories are closer to the expert’s in terms of the Wasserstein metric. Further details on the update algorithm for the buffer and its alignment with the Wasserstein distance minimization objective are provided in Appendix 7.3.

Algorithm. The major steps of the training procedure are outlined in Algorithm 1. The policy parameters () are updated with the clipped-ratio version of PPO (Schulman et al., 2017). State-value function baselines and GAE (Schulman et al., 2015)

are used for reducing the variance of the estimated policy-gradients. The priority buffer

uses the heap-queue algorithm (Appendix 7.3). The Lipschitz constant in Equation 5 is unknown and task-dependent. If is fairly smooth, is a small constant that can be treated as a hyper-parameter and absorbed into the learning rate. Please see Appendix 7.2 for details on the hyper-parameters.

4 Related work

There is an extensive amount of literature on IL with state-action expert demonstrations, and also on integrating IL and RL to bootstrap learning (Billard et al., 2008; Argall et al., 2009). Our work is most closely related to state-only IL and adversarial Inverse-RL methods discussed in Section 2. Here, we mention other related prior literature. BCO (Torabi et al., 2018a) is a state-only IL approach that learns an inverse dynamics model by running a random exploration policy. The inverse model is then applied to infer actions from the state-only demonstrations, which in turn are used for imitation via Behavioral Cloning, making the approach vulnerable to the well-known issue of compounding errors (Ross et al., 2011)Kimura et al. (2018) learn an internal model on state-only demonstrations; the imitator policy is then trained with RL using rewards derived from the model. Imitation under a domain shift has been considered in Stadie et al. (2017); Liu et al. (2018). These methods incorporate raw images as observations and are designed to handle differences in context (such as viewpoints, visual appearance, object positions, surroundings) between the expert and the imitator environments. Gupta et al. (2017) propose learning invariant feature mappings to transfer skills from an expert to an imitator with a different morphology. However, the reward function for such a transfer is contingent on the assumption of time-alignment in episodic tasks. In our Algorithm 1, the adversarial training between the policy and buffer trajectories (AIRL, Line 9) bears some resemblance to the adversarial self-imitation approaches in (Guo et al., 2018; Gangwani et al., 2018). Those self-imitation methods are applicable for RL from sparse rewards, while our focus is IL from expert behavior, under transition dynamics mismatch.

5 Experiments

In this section, we compare the performance of I2L to baseline methods for state-only IL from Section 2.3, namely GAIL with state-dependent discriminator, denoted by GAIL-S, and GAIfO (Torabi et al., 2018b). We do the evaluation by modifying the continuous-control locomotion task from MuJoCo to introduce various types of transition dynamics mismatch between the expert and the imitator MDPs (). It should be noted that other aspects of the MDP (, , , ) are assumed to be the same 333Since state-only IL does not depend on expert actions, can also be made different between the MDPs without requiring any modifications to the algorithm.. We, therefore, use dynamics and MDP interchangeably in this section. While the expert demonstrations are collected under the default configurations provided in OpenAI Gym, we construct the environments for the imitator by changing some parameters independently: a.) gravity in is the gravity in , b.) density of the bot in is the density in , and c.) the friction coefficient on all the joints of the bot in is the coefficient in . Figure 2 has a visual. For all our experiments and tasks, we assume a single expert state-only demonstration of length 1000. We do not assume any access to the expert MDP beyond this.

Figure 3: Training progress for I2L and GAIL-S when the imitator and expert MDPs differ in the configuration of the gravity parameter. Gravity in is the gravity in .
Figure 4: Training progress for I2L and GAIL-S when the imitator and expert MDPs differ in the configuration of the density parameter. Density of the bot in is the density in .

Performance when . Table 1 shows the average episodic returns for a policy trained for 5M timesteps using GAIL-S and I2L in the standard IL setting. The policy learning curves are included in Appendix 7.1. All our experiments average 8 independent runs with random seeds. Both the algorithms work fairly well in this scenario, though I2L achieves higher scores in 3 out of 4 tasks. These numbers serve as a benchmark when we evaluate performance with transition dynamics mismatch. The table also contains the expert demonstration score for each task.

Performance when . Figures 34 and 5

plot the training progress (mean and standard deviation) with GAIL-S and I2L under mismatched transition dynamics with low gravity, high density and high friction settings, respectively, as described above. We observe that I2L achieves faster learning and higher final scores

Figure 5: Training progress for I2L and GAIL-S when the imitator and expert MDPs differ in the configuration of the friction parameter. The friction coefficient on all the joints of the bot in is the coefficient in .
Figure 6: Ablation on capacity of buffer using low-gravity Half-Cheetah. No dynamics mismatch Low gravity HalfCheetah Walker2d Hopper Ant HalfCheetah Walker2d Hopper Ant GAIfO 5082 3122 2121 3452 1518 2995 1683 594 I2L 5240 4107 2751 3320 4155 3547 2566 1617 High density High friction HalfCheetah Walker2d Hopper Ant HalfCheetah Walker2d Hopper Ant GAIfO -234 378 440 3667 2883 3858 876 380 I2L 3975 1988 1999 3319 5554 3825 2084 1145 Table 2: Comparing performance of I2L with GAIfO (Torabi et al., 2018b), a state-only IL baseline.

than GAIL-S in most of the situations. GAIL-S degrades severely in some cases. For instance, for Half-Cheetah under high density, GAIL-S drops to 923 (compared to 5974 with no dynamics change, Table 1), while I2L attains a score of 3975 (compared to 5240 with no dynamics change). Similarly, with Hopper under high friction, GAIL-S score reduces to 810 (from 2130 with no dynamics change), and the I2L score is 2084 (2751 with no dynamics change). The plots also indicate the final average performance achieved using the original GAIL (marked as GAIL-SA) and AIRL algorithms. Both of these methods require extra supervision in the form of expert actions. Even so, they generally perform worse than I2L, which can be attributed to the fact that the expert actions generated in are not very useful when the dynamics shift to .

Comparison with GAIfO baseline. GAIfO (Torabi et al., 2018b) is a recent state-only IL method which we discuss in Section 2.3. Table 2 contrasts the performance of I2L with GAIfO for imitation tasks both with and without transition dynamics mismatch. We find GAIfO to be in the same ballpark as GAIL-S. It can learn good imitation policies if the dynamics are the same between the expert and the imitator, but loses performance with mismatched dynamics. Learning curves for GAIfO are included in Appendix 7.7. Furthermore, in Appendix 7.6, we compare to BCO (Torabi et al., 2018a).

Ablation on buffer capacity. Algorithm 1 uses priority-queue buffer of fixed number of trajectories to represent the surrogate state-action visitation . All our experiments till this point fixed the buffer capacity to 5 trajectories. To gauge the sensitivity of our approach to the capacity , we ablate on it and report the results in Figure 6. The experiment is done with the low-gravity Half-Cheetah environment. We observe that the performance of I2L is fairly robust to . Surprisingly, even a capacity of 1 trajectory works well, and having a large buffer () also does not hurt performance much. The GAIL-S baseline on the same task is included for comparison.

Empirical measurements of the lower-bound and Wasserstein approximations. Section 3 introduces a lower bound on the expected value of a function under the expert’s state-action visitation. In Appendix 7.4, we analyze the quality of the lower bound by plotting the approximation-gap for the different distributions obtained during training. We observe that the gap generally reduces. Finally, in Appendix 7.5, we plot the empirical estimate of the Wasserstein distance between the state-visitations of the buffer distribution and the expert, and note that this value also typically decreases over the training iterations.

6 Conclusion

In this paper, we presented I2L, an indirect imitation-learning approach that utilizes state-only expert demonstrations collected in the expert MDP, to train an imitator policy in an MDP with a dissimilar transition dynamics function. We derive a lower bound to the Max-Ent IRL objective that transforms it into two subproblems. We then provide a practical algorithm that trains a policy to imitate the distribution characterized by a trajectory-buffer using AIRL, whilst reducing the Wasserstein distance between the state-visitations of the buffer and expert, over the course of training. Our experiments in a variety of MuJoCo-based MDPs indicate that I2L is an effective mechanism for successful skill transfer from the expert to the imitator, especially under mismatched transition dynamics.

References

7 Appendix

7.1 Performance when

Figure 7: Training progress for I2L and GAIL-S when the imitator and expert MDPs are the same.

7.2 Hyper-parameters

Hyper-parameter Value
Wasserstein critic network 3 layers, 64 hidden, tanh
Discriminator network 3 layers, 64 hidden, tanh
Policy network 3 layers, 64 hidden, tanh
Wasserstein critic optimizer, lr, gradient-steps RMS-Prop, 5e-5, 20
Discriminator optimizer, lr, gradient-steps Adam, 3e-4, 5
Policy algorithm, lr PPO (clipped ratio), 1e-4
Number of state-only expert demonstrations 1 (1000 states)
Buffer capacity 5 trajectories
(GAE) 0.99, 0.95

7.3 Further details on buffer and the update mechanism

The buffer is a priority-queue structure, with a fixed capacity of  444 in our experiments trajectories. Each trajectory is a set of tuples . Denote the trajectories by , and let be the collection of states in trajectory . Buffer characterizes the surrogate policy defined in Section 3. The state-visitation distribution of can then be written as:

(8)

where denotes the delta measure. Following Equation 7, our objective for optimizing is:

The min-max objective is optimized using an iterative algorithm. The Wasserstein critic update is done with standard gradient descent using state samples from expert demonstrations and the buffer . The update for is more challenging since is only available as an empirical measure (Equation 8). For the current iterate , the objective for then becomes:

(9)

Section 3 defines the quantity as the score of the trajectory . Therefore, the objective in Equation 9 is to update the buffer such that the average score of the trajectories in it increases.

Priority-queue with priority = score. Buffer is implemented as a priority-queue (PQ) based on Min-Heap. Let the current PQ be , sorted based on score such that . Let be the new trajectories rolled out in the environment using the current learner policy (Line 5 in Algorithm 1). For each of these, PQ is updated using standard protocol:

1 Update scores of in PQ using the latest critic
2 for each  do
3       Calculate using the latest critic
4       if  then
5             // replace the lowest scoring buffer trajectory with the new trajectory
6             heapify // PQ-library call to maintain the heap-invariant:
7       end if
8      
9 end for

It follows from the PQ-protocol that the average score of the trajectories in the buffer increases (or remains same), after the update, compared to the average score before. This aligns the update with the objective in Equation 9.

7.4 Empirical convergence of lower bound

In our main section, we derive the following lower bound which connects the expected value of a function under the expert’s state-action visitation to the expected value under another surrogate distribution, and the 1-Wasserstein distance between the distributions:

In this section, we provide empirical measurements on the gap between the original objective (LHS) and the lower bound (RHS). This gap depends on the specific choice of the surrogate distribution . In our algorithm, is characterized by trajectories in the priority-queue buffer , and is updated during the course of training based on the protocol detailed in Appendix 7.3. Figure 8 plots the estimated value of the lower bound for these different , and shows that the gap generally reduces over time. To get estimates of LHS and RHS, we need the following:

  • We take snapshots of the buffer at periodic intervals of the training to obtain the different distributions.

  • This is the expert’s state-action distribution. We train separate oracle experts in the imitator’s (learner’s) environment, and use state-action tuples from this expert policy. Note that these oracle experts are NOT used in I2L (Algorithm 1), and are only for the purpose of measurement.

  • A separate Wasserstein critic is trained using tuples from the oracle experts described above and the trajectories in buffer . This critic is NOT used in I2L since we don’t have access to oracle experts, and is only for the purpose of measurement.

  • We select the AIRL discriminator parameters from a particular training iteration. The same parameters are then used to calculate the LHS, and the RHS for different distributions.

  • The Lipschitz constant is unknown and hard to estimate for a complex, non-linear . We plot the lower bound for a few values: .

Figure 8 shows the gap between the original objective and the lower bound for all our experimental settings, (top row), and (next 3 rows). We observe that the gap generally reduces as is updated over the iterations of I2L (Algorithm 1). A better lower bound in turn leads to improved gradients for updating the AIRL discriminator , ultimately resulting in more effective policy gradients for the imitator.

Figure 8: Gap between the original objective and the lower bound for all our experimental settings, (top row), and (next 3 rows).

7.5 Empirical Wasserstein Distances

In each iteration of I2L, we update the Wasserstein critic using the states from the state-only expert demonstration, and states in the buffer (Line 6, Algorithm 1). The objective is to obtain the 1-Wasserstein distance:

In Figure 9, we plot the empirical estimate of this distance over the course of training. To get the estimate at any time, the current critic parameters and buffer trajectories are used to calculate . We show the values for all our experimental settings, (top row), and (next 3 rows). It can be seen that generally decreases over time in all situations. This is because our objective for optimizing (or updating the buffer ) is to minimize this Wasserstein estimate. Please see Appendix 7.3 for more details. The fact that the buffer state-distribution gets closer to the expert’s, together with the availability of actions in the buffer which induce those states in the imitator MDP (), enables us to successfully use AIRL for imitation under mismatched transition dynamics.

Figure 9: Estimate of the Wasserstein distance between , for all our experimental settings, (top row), and (next 3 rows).

7.6 Comparison with BCO

Figure 10 compares I2L with BCO (Torabi et al., 2018a) when the expert and imitator dynamics are same (top row) and under mismatched transition dynamics with low gravity, high density and high friction settings (next 3 rows). BCO proceeds by first learning an inverse dynamics model in the imitator’s environment,

, to predict actions from state-transitions. This model is learned via supervised learning on trajectories generated by an exploratory policy. The inverse model is then used to infer actions from the state-transitions in state-only expert demonstrations. The imitator policy is trained with Behavioral Cloning (BC) using these inferred actions. We implement the BCO(

) version from the paper since it is shown to be better than vanilla BCO. We observe the barring two situations (Ant with no dynamics mismatch, and Ant with density), BCO() is unsuccessful in learning high-return policies. This is potentially due to the difficulties in learning a robust inverse dynamics model, and the compounding error problem inherent to BC. Similar performance for BCO() is also reported by Torabi et al. (2018b).

Figure 10: Comparison between I2L and BCO.

7.7 Comparison with GAIFO

Figure 11: Comparison between I2L and two baselines derived from GAIL. The final performance of an agent trained with PPO using real rewards in is also shown.

7.8 Comparison with GAIL-SA and AIRL

Figure 12: Comparison between I2L and baselines using expert actions: GAIL-SA and AIRL