1 Introduction
Humans and animals can learn complex behaviors via imitation. Inspired by these learning mechanisms, Imitation Learning (IL) has long been a popular method for training autonomous agents from humanprovided demonstrations. However, human and animal imitation differs markedly from commonly used approaches in machine learning. Firstly, humans and animals tend to imitate the
goal of the task rather than the particular motions of the demonstrator (baker2007goal). Secondly, humans and animals can easily handle imitation scenarios where there is a shift in embodiment and dynamics between themselves and a demonstrator. The first feature of human IL can be represented within the framework of Inverse Reinforcement Learning (IRL)
(ng2000algorithms; abbeel2004apprenticeship; ziebart2008maximum), which at a high level casts the problem of imitation as one of matching outcomes rather than actions. Recent work in adversarial imitation learning (ho2016generative; finn2016guided) has accomplished this by using a discriminator to judge whether a given behavior is from an expert or imitator, and then a policy is trained using the discriminator expert likelihood as a reward. While successful in multiple problem domains, this approach makes it difficult to accommodate the second feature of human learning: imitation across shifts in embodiment and dynamics. This is because in the presence of such shifts, the discriminator may either simply use the embodiment or dynamics to infer whether it is evaluating expert behavior, and as a consequence fails to provide a meaningful reward signal.In this paper we are concerned with the problem of learning adaptive policies that can be transferred to environments with varying dynamics, by imitating a small number of expert demonstrations collected from a single source domain. This problem is important in robotic learning because it is better aligned with real world constraints: 1) reward functions are hard to obtain, 2) learned policies from one domain are hard to deploy to different domains due to varying source to target domain statistics, and 3) the target domain dynamics oftentimes changes while executing the learned policy. As such, this work assumes ground truth rewards are not available, and furthermore we assume that expert demonstrations come from only a single domain (i.e. an instance of an environment where dynamics cannot be exactly replicated by the policy at training time). To the best of our knowledge, this is the first work to tackle this challenging problem formulation.
Our proposed method solves the above problem by building upon the GAIL (ho2016generative; finn2016guided) framework, by firstly conditioning the policy on a learned dynamics embedding (“context variable” in policy search literature (deisenroth2013survey)
). We propose two embedding approaches on which the policy is conditioned, namely, a direct supervised learning approach and a variational autoencoder (VAE)
(kingma2013auto) based unsupervised approach. Secondly, to prevent the discriminator from inferring whether it is evaluating the expert behavior or imitator behavior purely through the dynamics, we propose using a Gradient Reversal Layer (GRL) to learn a dynamicsinvariant discriminator. We demonstrate the effectiveness of the proposed algorithm on benchmark Mujoco simulated control tasks. The main contributions of our work include: 1) present a general and novel problem formulation that is well aligned with real world scenarios in comparison to recent literature 2) devise a conceptually simple architecture that is capable of learning an adaptive policy from a small number of expert demonstrations (order of 10s) collected from only one source environment, 3) design an adversarial loss for addressing the covariate shift issue in discriminator learning.2 Related Work
Historically, two main avenues have been heavily studied for imitation learning: 1) Behavioral Cloning (BC) and 2) Inverse Reinforcement Learning (IRL). Though conceptually simple, BC suffers from compound errors caused by covariate shift, and subsequently, often requires a large quantity of demonstrations (pomerleau1989alvinn), or access to the expert policy (ross2011reduction) in order to recover a stable policy. Recent advancements in imitation learning (ho2016generative; finn2016guided) have adopted an adversarial formation that interleaves between 1) discriminating the generated policy against the expert demonstrations and 2) a policy improvement step where the policy aims to fool the learned discriminator.
Dynamics randomization (tobin2017domain; sadeghi2016cad2rl; mandlekar2017adversarially; tan2018sim; pinto2017robust; peng2018sim; chebotar2018closing; rajeswaran2016epopt) has been one of the prevailing vehicles for addressing varying simulation to realworld domain statistics. This avenue of methods typically involves perturbing the environment dynamics (often times adversarially) in simulation in order to learn an adaptive policy that is robust enough to bridge the “Reality Gap”. While dynamics randomization has been explored in an RL setting, it has a critical limitation in the imitation learning context: large domain shifts might result in directional differences in dynamics, therefore, the demonstrated actions might no longer be admissible for solving the task in the target domain. Our method (Figure 1) also involves training in a variety of environments with different dynamics. However, we propose conditioning the policy on an explicitly learned dynamics embedding to enable adaptive policies based on online system ID.
yu2017preparing adopted a similar approach towards building adaptive policies. They learn an online system identification model and condition the policy on the predicted model parameters in an RL setting. In comparison to their work, we do not assume access to the ground truth reward signals or the ground truth physics parameters at evaluation time, which makes this work’s problem formulation a harder learning problem, but with greater potential for realworld applications. We will compare our method with yu2017preparing in the experimental section.
Third person imitation learning (stadie2017third) also employs a GRL (ganin2014unsupervised) under a GAILlike formulation with the goal of learning expert behaviors in a new domain. In comparison, our method also enables learning adaptive policies by employing an online dynamics identification component, so that the policies can be transferred to a class of domains, as opposed to one domain. In addition, learned policies using our proposed method can handle online dynamics perturbations.
Meta learning (finn2017model) has also been applied to address varying source to target domain dynamics (duan2017one; nagabandi2018learning). The idea behind meta learning in the context of robotic learning is to learn a meta policy that is “initialized” for a variety of tasks in simulation, and then finetune the policy in the realworld setting given a specific goal. After the metalearning phase, the agent requires significantly fewer environment interactions to obtain a policy that solves the task. In comparison to meta learning based approaches, finetuning on the test environment is not required in our method, with the caveat being that this is true only within the target domain where the dynamics posterior is effective.
2.1 Background
In this section, we will briefly review GAIL (ho2016generative). Inspired by GANs, the GAIL objective for policy learning on an MDP (see A.2 for a formal definition) is defined as:
(1) 
Where denotes the expert policy that generated the demonstrations; is the policy to imitate the expert; is a discriminator that learns to distinguish between and with generated stateaction pairs. In comparison to GAN optimization, the GAIL objective is rarely differentiable since differentiation through the environment step is often intractable. Optimization is instead achieved via RLbased policy gradient algorithms, e.g., PPO (schulman2017proximal) or off policy methods, e.g., TD3 (ilyaDAC). Without an explicit reward function, GAIL relies on reward signals provided by the learned discriminator, where a common reward formulation is .
3 ADaptive Adversarial Imitation Learning (ADAIL)
3.1 Problem Definition
Suppose we are given a class of environments with different dynamics but similar goals, a domain generator which takes in a code and generates an environment , and a set of expert demonstrations collected from one source environment . In adaptive imitation learning, one attempts to learn an adaptive policy that can generalized across environments within . We assume that the ground truth dynamics parameters , which are used to generate the simulated environments, are given (or manually sampled) during the training phase.
3.2 Algorithm Overview
We allow the agent to interact with a class of similar simulated environments with varying dynamics parameters, which we call “adaptive training”. To be able to capture highlevel goals from a small set of demonstrations, we adopt a approach similar to GAIL. To provide consistent feedback signals during training across environments with different dynamics, the discriminator should be dynamicsinvariant. We enable this desirable feature by learning a dynamicsinvariant feature layer for the discriminator by 1) adding another head to the discriminator to predict the dynamics parameters, and 2) inserting a GRL inbetween and the dynamicsinvariant feature layer. The new discriminator design is illustrated in Figure 7 and is discussed in more detail in Section 3.4. In addition, to enable adaptive policies, we introduced a dynamics posterior that takes a rollout trajectory and outputs an embedding, on which the policy is conditioned. Intuitively, explicit dynamics learning endows the agent with the ability to identify the system and act differently against changes in dynamics. Note that a policy can learn to infer dynamics implicitly, without the need for an external dynamics embedding. However, we find experimentally that policies conditioned explicitly on the environment parameters outperform those that do not. The overall architecture is illustrated in Figure 1. We call the algorithm Adaptive Adversarial Imitation Learning (ADAIL), with the following objective (note that for brevity, we for now omit the GRL term discussed in Section 3.4):
(2) 
Where is a learned latent dynamics representation that is associated with the rollout environment in each gradient step; is a rollout trajectory using in the corresponding environment; is a “dynamics posterior” for inferring the dynamics during test time; The last term in the objective, , is a general form of the expected log likelihood of given . One can employ various supervised and unsupervised methods towards optimizing this term. We will explore a few methods in the following subsections.
The algorithm is outlined in Algorithm 1.
3.3 Adaptive Training
Adaptive training is achieved through 1) allowing the agent to interact with a class of similar simulated environments within class , and 2) learning a dynamics posterior for predicting the dynamics based on rollouts. The environment class is defined as a set of parameterized environments with degrees of freedom, where is the total number of latent dynamics parameters that we can change. We assume that we have access to an environment generator that takes in a sample of the dynamics parameters and generates an environment. At each time when an onpolicy rollout is initiated, we resample the dynamics parameters based on a predefined prior distribution .
3.4 Learning a DynamicsInvariant Discriminator
GAIL learns from the expert demonstrations by matching an implicit stateaction occupancy measure. However, this formulation might be problematic in our training setting, where onpolicy rollouts are collected from environments with varying dynamics. In nonsource environments, the discriminator can no longer provide canonical feedback signals. This motivates us to learn a dynamicsinvariant feature space, where, the behaviororiented features are preserved but dynamicsidentifiable features are removed. We approach this problem by assuming that the behaviororiented characteristics and dynamicsidentifiable characteristics are loosely coupled and thereby we can learn a dynamicsinvariant representation for the discriminator. In particular, we employ a technique called a Gradient Reversal Layer (GRL) (ganin2014unsupervised), which is widely used in image domain adaptation (bousmalis2016domain). The dynamicsinvariant features layer is shared with the original discriminator classification head, illustrated in Figure 7.
3.5 Direct Supervised Dynamics Latent Variable Learning
Perhaps one of the best latent representations of the dynamics is the ground truth physics parameterization (gravity, friction, limb length, etc). In this section we explore supervised learning for inferring dynamics. A neural network is employed to represent the dynamics posterior, which is learned via supervised learning by regressing to the ground truth physics parameters given a replay buffer of policy rollouts. We update the regression network using a Huber loss to match environment dynamics labels. Details about the Huber loss can be found in appendix
A.3. During training, we condition the learned policy on the ground truth physics parameters. During evaluation, on the other hand, the policy is conditioned on the predicted physics parameters from the posterior.We use (state, action, next state) as the posterior’s input, i.e.,
, and a 3layer fullyconnected neural network to output the Ndimensional environment parameters. Note that one can use a recurrent neural network and longer rollout history for modeling complex dynamic structures, however we found that this was not necessary for the chosen evaluation environments.
3.6 VAEbased Unsupervised Dynamics Latent Variable Learning
For many cases, the number of varying latent parameters of the environment is high, one might not know the set of latent parameters that will vary in a real world laboratory setting, or the latent parameters are oftentimes strongly correlated (e.g., gravity and friction) in terms of their effect on environment dynamics. In this case, predicting the exact latent parameterization is hard. The policy is mainly concerned with the end effector of the latent parameters. This motivates us to use a unsupervised tool to extract a latent dynamics embedding. In this section, we explore a VAEbased unsupervised approach similar to conditional VAE (sohn2015learning) with an additional contrastive regularization loss, for learning the dynamics without ground truth labels.
With the goal of capturing the underlying dynamics, we avoid directly reconstructing the (state, action, next state) tuple, . Otherwise, the VAE would likely capture the latent structure of the state space. Instead, the decoder is modified to takein the stateaction pair, , and a latent code, , and outputs the next state, . The decoder now becomes a forward dynamics predictive model. The unsupervised dynamics latent variable learning method is illustrated in Figure 2.
The evidence lower bound (ELBO) used is:
(3) 
Where is the dynamics posterior (encoder); is a forward dynamics predictive model (decoder); is a Gaussian prior over the latent code . Similar to davis2007information and hsu2015neural, to avoid the encoder learning an identity mapping on , we add the following contrastive regularization to the loss,
Where and are sampled from the same rollout trajectory; and are sampled from different rollout trajectories. is a constant. We use this regularization to introduce additional supervision in order to improve the robustness of the latent posterior.
The overall objective for the dynamics learner is
(4) 
where is a scalar to control the relative strength of the regularization term. The learned posterior (encoder) infers the latent dynamics, which is used for conditioning the policy. The modified algorithm can be found in the appendix (Algorithm 2).
4 Experiments
4.1 Environments
To evaluate the proposed algorithm we consider 4 simulated environments: CartPole, Hopper, HalfCheetah and Ant. The chosen dynamics parameters are specified in Table 1, and an example of one such parameter (HalfCheetah gravity component ) is shown in Figure 3. During training the parameters are sampled uniformly from the chosen range. Source domain parameters are also given in Table 1. For each source domain, we collect 16 expert demonstrations.
Gym CartPoleV0: We vary the force magnitude in continuous range in our training setting. Note that the force magnitude can take negative values, which flips the force direction.
3 Mujoco Environments: Hopper, HalfCheetah, and Ant: With these three environments, we vary 2d dynamics parameters: gravity xcomponent and friction.
Environment  Paramater 1  Parameter 2  Source  

CartPoleV0  [1,1]  
Hopper  [1.0, 1.0]  [1.5, 2.5]  
HalfCheetah  [3.0, 3.0]  [0.0, 2.0]  
Ant  [5.0, 5.0]  [0.0, 4.0] 
4.2 ADAIL on Simulated Control Tasks
Is the dynamics posterior component effective under large dynamics shifts?
We first demonstrate the effectiveness of the dynamics posterior under large dynamics shifts on a toy Gym environment, Cartpole, by varying 1d force magnitude. As the direction of the force changes, blindly mimicking the demonstrations collected from the source domain () would not work on target domains with . This result is evident when comparing ADAIL to GAIL with dynamics randomization. As shown in Figure 3(a), GAIL with Dynamics Randomization failed to generalize to , whereas, ADAIL is able to achieve the same performance as . We also put a comparison with ADAILrand, where the policy is conditioned on uniformly random values of the dynamics parameters, which completely breaks the performance across the domains.
How does the GRL help improve the robustness of performance across domains?
To demonstrate the effectiveness of GRL in the adversarial imitation learning formulation, we do a comparative study with and without GRL on GAIL with dynamics randomization in the Hopper environment. The results are shown in Figure 3(b).

How does the overall algorithm work in comparison with baseline methods?
We compare the performance of ADAIL with a few baseline methods, including 1) the PPO expert which was used to collect demonstrations; 2) the UPtrue algorithm of yu2017preparing, which is essentially a PPO policy conditioned on ground truth physics parameters; and 3) GAIL with dynamics randomization, which is unmodified GAIL training on a variety of environments with varying dynamics. The results of this experiment are show in in Figure 5.
HalfCheetah The experiments show that 1) as expected the PPO expert (Plot 4(a)) has limited adaptability to unseen dynamics. 2) UPtrue (Plot 4(b)) achieves similar performance across test environments. Note that since UPtrue has access to the ground truth reward signals and the policy is conditioned on ground truth dynamics parameters, the Plot 4(b) shows an approximate expected upper bound for our proposed method since we do not assume access to reward signals during policy training, or to ground truth physics parameters at policy evaluation time. 3) GAIL with dynamics randomization (Plot 4(c)) can generalize to some extent, but failed to achieve the demonstrated performance in the source environment (gravity x = 0.0, friction = 0.5) 4) Plots 8(f) 8(g) show evaluation of the proposed method ADAIL with policy conditioned on ground truth physics parameters and predicted physics parameters respectively; ADAIL matches the expert performance in the source environment (gravity x = 0.0, friction = 0.5) and generalizes to unseen dynamics. In particular, when the environment dynamics favors the task, the adaptive agent was able to obtain even higher performance (around friction = 1.2, gravity = 2).
Ant and Hopper. We again show favorable performance on both Ant and Hopper in Figure 5.
How does the algorithm generalize to unseen environments?
To understand how ADAIL generalizes to environments not sampled at training time, we do a suite of studies in which the agent is only allowed to interact in a limited set of environments. Figure 6 shows the performance of ADAIL on different settings, where a region of environment parameters including the expert source environment are “blackedout". This case is particularly challenging since the policy is not allowed to access the domain from which the expert demonstrations were collected, and so our dynamicsinvariant discriminator is essential. For additional held out experiments see Figure 10.
The experiments show that, 1) without training on the source environment, ADAIL with the ground truth parameters tends to have performance drops on the blackout region but largely is able to generalize (Figure 5(a)); 2) the posterior’s RMSE raises on the blackout region (Figure 5(c)); 3) consequently ADAIL with the predicted dynamics parameters suffers from the posterior error on the blackout region (Figure 5(b)).
How does unsupervised version of the algorithm perform?
VAEADAIL on HalfCheetah. With the goal of understanding the characteristics of the learned dynamics latent embedding through the unsupervised method and its impact on the overall algorithm, as a proof of concept we apply VAEADAIL to HalfCheetah environment varying a 1D continuous dynamics, friction. The performance is shown in Figure 8.
5 Conclusion
In this work we proposed the ADaptive Adversarial Imitation Learning (ADAIL) algorithm for learning adaptive control policies from a limited number of expert demonstrations. We demonstrated the effectiveness of ADAIL on two challenging MuJoCo test suites and compared against recent stateoftheart. We showed that ADAIL extends the generalization capacities of policies to unseen environments, and we proposed a variant of our algorithm, VAEADAIL, that does not require environment dynamics labels at training time. We will release the code to aid in reproduction upon publication.
References
Appendix A Appendix
a.1 Discriminator with Gradients Reversal Layer (GRL)
a.2 Markov Decision Process
An infinitehorizon, discounted Markov decision process (MDP) is defined as a tuple
, with state space , action space, transition probability distribution
, reward function , initial state distribution , and the discount factor . Let be a trajectory of states and actions, and the total discounted reward for the trajectory. The goal of RL algorithms is to find a policy to maximize the expected discounted cumulative reward, , where . In the imitation learning setting, the reward function is not given, whereas, a set of expert demonstrations, are provided, where is sampled by rolling out an expert policy in the MDP.a.3 Huber Loss For Dynamics Embedding Loss
We use the following loss function when training the dynamics embedding posterior:
(5) 
Where controls the joint position between L2 and L1 penalty in Huber loss.
Lemma 1. Minimizing the above Huber loss is equivalent to maximizing the log likelihood, , assuming
is distributed as a Gaussian distribution when
, and as a Laplace distribution otherwise. See appendix A.4 for the proof.a.4 Lemma 1 Proof
Proof. For ,
(6)  
(7)  
(8)  
(9)  
(10) 
Likewise, we can prove for .
a.5 VAEADAIL Algorithm
a.6 VAEADAIL Experiment on HalfCheetah
a.7 HalfCheetah ADAIL Performance Comparison
a.8 Heldout Environment Experiment
a.9 Hyperparameters
a.9.1 Adail
We use fully connected neural networks with 2 hidden layers for all three components of the system. The network hyperparameters for each of the test environments with 2D dynamics parameters are shown in Table
2. For all the baseline methods, we use the same set of hyperparameters.Environment  Policy  Discriminator  Posterior  

Architecture  Learning rate  Architecture  Learning rate  Architecture  Learning rate  
CartPoleV0  (s,a)  64  64  (a)  0.0005586  (s,a)  32  32  1  0.000167881  (s,a,s’)76140(1,c)  0.00532 
Hopper  (s,a)  64  64  (a)  0.000098646  (s,a)  32  32  1  0.0000261  (s,a,s’)241236(2,c)  0.00625 
HalfCheetah  (s,a)  64  64  (a)  0.00005586  (s,a)  32  32  1  0.0000167881  (s,a,s’)150150(2,c)  0.003 
Ant  (s,a)  64  64  (a)  0.000047  (s,a)  32  32  1  0.000037  (s,a,s’)72177(2,c)  0.002353 
a.9.2 VaeAdail
In Table 3 we show the network architectures and learning rates for VAEADAIL.
Encoder (Posterior)  Decoder  Policy  Discriminator  

Architecture  (s,a,s’)  200  200  (c)  (s,a,c)  200  200  (s’)  (s,a)  64  64  (a)  (s,a)  32  32  (1,c) 
Learning rate  0.000094  0.000094  0.00005596  0.000046077 
Comments
There are no comments yet.