ADAIL: Adaptive Adversarial Imitation Learning

08/23/2020
by   Yiren Lu, et al.
Google
8

We present the ADaptive Adversarial Imitation Learning (ADAIL) algorithm for learning adaptive policies that can be transferred between environments of varying dynamics, by imitating a small number of demonstrations collected from a single source domain. This is an important problem in robotic learning because in real world scenarios 1) reward functions are hard to obtain, 2) learned policies from one domain are difficult to deploy in another due to varying source to target domain statistics, 3) collecting expert demonstrations in multiple environments where the dynamics are known and controlled is often infeasible. We address these constraints by building upon recent advances in adversarial imitation learning; we condition our policy on a learned dynamics embedding and we employ a domain-adversarial loss to learn a dynamics-invariant discriminator. The effectiveness of our method is demonstrated on simulated control tasks with varying environment dynamics and the learned adaptive agent outperforms several recent baselines.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 6

page 7

page 8

page 13

03/10/2021

Learning from Imperfect Demonstrations from Agents with Varying Dynamics

Imitation learning enables robots to learn from demonstrations. Previous...
02/20/2020

Support-weighted Adversarial Imitation Learning

Adversarial Imitation Learning (AIL) is a broad family of imitation lear...
06/02/2020

Cross-Domain Imitation Learning with a Dual Structure

In this paper, we consider cross-domain imitation learning (CDIL) in whi...
11/03/2021

Smooth Imitation Learning via Smooth Costs and Smooth Policies

Imitation learning (IL) is a popular approach in the continuous control ...
02/02/2020

Combating False Negatives in Adversarial Imitation Learning

In adversarial imitation learning, a discriminator is trained to differe...
02/27/2020

State-only Imitation with Transition Dynamics Mismatch

Imitation Learning (IL) is a popular paradigm for training agents to ach...
06/01/2021

What Matters for Adversarial Imitation Learning?

Adversarial imitation learning has become a popular framework for imitat...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Humans and animals can learn complex behaviors via imitation. Inspired by these learning mechanisms, Imitation Learning (IL) has long been a popular method for training autonomous agents from human-provided demonstrations. However, human and animal imitation differs markedly from commonly used approaches in machine learning. Firstly, humans and animals tend to imitate the

goal of the task rather than the particular motions of the demonstrator (baker2007goal)

. Secondly, humans and animals can easily handle imitation scenarios where there is a shift in embodiment and dynamics between themselves and a demonstrator. The first feature of human IL can be represented within the framework of Inverse Reinforcement Learning (IRL)

(ng2000algorithms; abbeel2004apprenticeship; ziebart2008maximum), which at a high level casts the problem of imitation as one of matching outcomes rather than actions. Recent work in adversarial imitation learning (ho2016generative; finn2016guided) has accomplished this by using a discriminator to judge whether a given behavior is from an expert or imitator, and then a policy is trained using the discriminator expert likelihood as a reward. While successful in multiple problem domains, this approach makes it difficult to accommodate the second feature of human learning: imitation across shifts in embodiment and dynamics. This is because in the presence of such shifts, the discriminator may either simply use the embodiment or dynamics to infer whether it is evaluating expert behavior, and as a consequence fails to provide a meaningful reward signal.

In this paper we are concerned with the problem of learning adaptive policies that can be transferred to environments with varying dynamics, by imitating a small number of expert demonstrations collected from a single source domain. This problem is important in robotic learning because it is better aligned with real world constraints: 1) reward functions are hard to obtain, 2) learned policies from one domain are hard to deploy to different domains due to varying source to target domain statistics, and 3) the target domain dynamics oftentimes changes while executing the learned policy. As such, this work assumes ground truth rewards are not available, and furthermore we assume that expert demonstrations come from only a single domain (i.e. an instance of an environment where dynamics cannot be exactly replicated by the policy at training time). To the best of our knowledge, this is the first work to tackle this challenging problem formulation.

Our proposed method solves the above problem by building upon the GAIL (ho2016generative; finn2016guided) framework, by firstly conditioning the policy on a learned dynamics embedding (“context variable” in policy search literature (deisenroth2013survey)

). We propose two embedding approaches on which the policy is conditioned, namely, a direct supervised learning approach and a variational autoencoder (VAE)

(kingma2013auto) based unsupervised approach. Secondly, to prevent the discriminator from inferring whether it is evaluating the expert behavior or imitator behavior purely through the dynamics, we propose using a Gradient Reversal Layer (GRL) to learn a dynamics-invariant discriminator. We demonstrate the effectiveness of the proposed algorithm on benchmark Mujoco simulated control tasks. The main contributions of our work include: 1) present a general and novel problem formulation that is well aligned with real world scenarios in comparison to recent literature 2) devise a conceptually simple architecture that is capable of learning an adaptive policy from a small number of expert demonstrations (order of 10s) collected from only one source environment, 3) design an adversarial loss for addressing the covariate shift issue in discriminator learning.

2 Related Work

Historically, two main avenues have been heavily studied for imitation learning: 1) Behavioral Cloning (BC) and 2) Inverse Reinforcement Learning (IRL). Though conceptually simple, BC suffers from compound errors caused by covariate shift, and subsequently, often requires a large quantity of demonstrations (pomerleau1989alvinn), or access to the expert policy (ross2011reduction) in order to recover a stable policy. Recent advancements in imitation learning (ho2016generative; finn2016guided) have adopted an adversarial formation that interleaves between 1) discriminating the generated policy against the expert demonstrations and 2) a policy improvement step where the policy aims to fool the learned discriminator.

Dynamics randomization (tobin2017domain; sadeghi2016cad2rl; mandlekar2017adversarially; tan2018sim; pinto2017robust; peng2018sim; chebotar2018closing; rajeswaran2016epopt) has been one of the prevailing vehicles for addressing varying simulation to real-world domain statistics. This avenue of methods typically involves perturbing the environment dynamics (often times adversarially) in simulation in order to learn an adaptive policy that is robust enough to bridge the “Reality Gap”. While dynamics randomization has been explored in an RL setting, it has a critical limitation in the imitation learning context: large domain shifts might result in directional differences in dynamics, therefore, the demonstrated actions might no longer be admissible for solving the task in the target domain. Our method (Figure 1) also involves training in a variety of environments with different dynamics. However, we propose conditioning the policy on an explicitly learned dynamics embedding to enable adaptive policies based on online system ID.

yu2017preparing adopted a similar approach towards building adaptive policies. They learn an online system identification model and condition the policy on the predicted model parameters in an RL setting. In comparison to their work, we do not assume access to the ground truth reward signals or the ground truth physics parameters at evaluation time, which makes this work’s problem formulation a harder learning problem, but with greater potential for real-world applications. We will compare our method with yu2017preparing in the experimental section.

Third person imitation learning (stadie2017third) also employs a GRL (ganin2014unsupervised) under a GAIL-like formulation with the goal of learning expert behaviors in a new domain. In comparison, our method also enables learning adaptive policies by employing an online dynamics identification component, so that the policies can be transferred to a class of domains, as opposed to one domain. In addition, learned policies using our proposed method can handle online dynamics perturbations.

Meta learning (finn2017model) has also been applied to address varying source to target domain dynamics (duan2017one; nagabandi2018learning). The idea behind meta learning in the context of robotic learning is to learn a meta policy that is “initialized” for a variety of tasks in simulation, and then fine-tune the policy in the real-world setting given a specific goal. After the meta-learning phase, the agent requires significantly fewer environment interactions to obtain a policy that solves the task. In comparison to meta learning based approaches, fine-tuning on the test environment is not required in our method, with the caveat being that this is true only within the target domain where the dynamics posterior is effective.

2.1 Background

In this section, we will briefly review GAIL (ho2016generative). Inspired by GANs, the GAIL objective for policy learning on an MDP (see A.2 for a formal definition) is defined as:

(1)

Where denotes the expert policy that generated the demonstrations; is the policy to imitate the expert; is a discriminator that learns to distinguish between and with generated state-action pairs. In comparison to GAN optimization, the GAIL objective is rarely differentiable since differentiation through the environment step is often intractable. Optimization is instead achieved via RL-based policy gradient algorithms, e.g., PPO (schulman2017proximal) or off policy methods, e.g., TD3 (ilyaDAC). Without an explicit reward function, GAIL relies on reward signals provided by the learned discriminator, where a common reward formulation is .

3 ADaptive Adversarial Imitation Learning (ADAIL)

3.1 Problem Definition

Suppose we are given a class of environments with different dynamics but similar goals, a domain generator which takes in a code and generates an environment , and a set of expert demonstrations collected from one source environment . In adaptive imitation learning, one attempts to learn an adaptive policy that can generalized across environments within . We assume that the ground truth dynamics parameters , which are used to generate the simulated environments, are given (or manually sampled) during the training phase.

3.2 Algorithm Overview

We allow the agent to interact with a class of similar simulated environments with varying dynamics parameters, which we call “adaptive training”. To be able to capture high-level goals from a small set of demonstrations, we adopt a approach similar to GAIL. To provide consistent feedback signals during training across environments with different dynamics, the discriminator should be dynamics-invariant. We enable this desirable feature by learning a dynamics-invariant feature layer for the discriminator by 1) adding another head to the discriminator to predict the dynamics parameters, and 2) inserting a GRL in-between and the dynamics-invariant feature layer. The new discriminator design is illustrated in Figure 7 and is discussed in more detail in Section 3.4. In addition, to enable adaptive policies, we introduced a dynamics posterior that takes a roll-out trajectory and outputs an embedding, on which the policy is conditioned. Intuitively, explicit dynamics learning endows the agent with the ability to identify the system and act differently against changes in dynamics. Note that a policy can learn to infer dynamics implicitly, without the need for an external dynamics embedding. However, we find experimentally that policies conditioned explicitly on the environment parameters outperform those that do not. The overall architecture is illustrated in Figure 1. We call the algorithm Adaptive Adversarial Imitation Learning (ADAIL), with the following objective (note that for brevity, we for now omit the GRL term discussed in Section 3.4):

(2)
1:Inputs:
2:An environment class .
3:Initial parameters of policy , discriminator , and posterior .
4:A set of expert demonstrations on one of the environment . An environment generator that takes a code and generates an environment . A prior distribution of .
5:for i = 1, 2, .. do
6:      Sample and Generate an environment
7:      Sample trajectories in and
8:      Update the discriminator parameters with the gradients:
9:      Update the discriminator parameters again with the following loss, such that the gradients are reversed when back-prop through the dynamics-invariant layer:
10:      Update the posterior parameters with gradients
11:      Update policy using policy optimization method (PPO) with:
12:Output: Learned policy , and posterior .
Algorithm 1 ADAIL

Where is a learned latent dynamics representation that is associated with the rollout environment in each gradient step; is a roll-out trajectory using in the corresponding environment; is a “dynamics posterior” for inferring the dynamics during test time; The last term in the objective, , is a general form of the expected log likelihood of given . One can employ various supervised and unsupervised methods towards optimizing this term. We will explore a few methods in the following subsections.

The algorithm is outlined in Algorithm 1.

Figure 1: The ADAIL architecture. “Environment” is sampled from a population of environments with varying dynamics, “Demonstrations” are collected from one environment within the environment distribution, “Posterior” is the dynamics predictor, ; Latent code “” represents the ground truth or learned dynamics parameters; The policy input is extended to include the latent dynamics embedding .

3.3 Adaptive Training

Adaptive training is achieved through 1) allowing the agent to interact with a class of similar simulated environments within class , and 2) learning a dynamics posterior for predicting the dynamics based on rollouts. The environment class is defined as a set of parameterized environments with degrees of freedom, where is the total number of latent dynamics parameters that we can change. We assume that we have access to an environment generator that takes in a sample of the dynamics parameters and generates an environment. At each time when an on-policy rollout is initiated, we re-sample the dynamics parameters based on a predefined prior distribution .

3.4 Learning a Dynamics-Invariant Discriminator

GAIL learns from the expert demonstrations by matching an implicit state-action occupancy measure. However, this formulation might be problematic in our training setting, where on-policy rollouts are collected from environments with varying dynamics. In non-source environments, the discriminator can no longer provide canonical feedback signals. This motivates us to learn a dynamics-invariant feature space, where, the behavior-oriented features are preserved but dynamics-identifiable features are removed. We approach this problem by assuming that the behavior-oriented characteristics and dynamics-identifiable characteristics are loosely coupled and thereby we can learn a dynamics-invariant representation for the discriminator. In particular, we employ a technique called a Gradient Reversal Layer (GRL) (ganin2014unsupervised), which is widely used in image domain adaptation (bousmalis2016domain). The dynamics-invariant features layer is shared with the original discriminator classification head, illustrated in Figure 7.

3.5 Direct Supervised Dynamics Latent Variable Learning

Perhaps one of the best latent representations of the dynamics is the ground truth physics parameterization (gravity, friction, limb length, etc). In this section we explore supervised learning for inferring dynamics. A neural network is employed to represent the dynamics posterior, which is learned via supervised learning by regressing to the ground truth physics parameters given a replay buffer of policy rollouts. We update the regression network using a Huber loss to match environment dynamics labels. Details about the Huber loss can be found in appendix  

A.3. During training, we condition the learned policy on the ground truth physics parameters. During evaluation, on the other hand, the policy is conditioned on the predicted physics parameters from the posterior.

We use (state, action, next state) as the posterior’s input, i.e.,

, and a 3-layer fully-connected neural network to output the N-dimensional environment parameters. Note that one can use a recurrent neural network and longer rollout history for modeling complex dynamic structures, however we found that this was not necessary for the chosen evaluation environments.

3.6 VAE-based Unsupervised Dynamics Latent Variable Learning

For many cases, the number of varying latent parameters of the environment is high, one might not know the set of latent parameters that will vary in a real world laboratory setting, or the latent parameters are oftentimes strongly correlated (e.g., gravity and friction) in terms of their effect on environment dynamics. In this case, predicting the exact latent parameterization is hard. The policy is mainly concerned with the end effector of the latent parameters. This motivates us to use a unsupervised tool to extract a latent dynamics embedding. In this section, we explore a VAE-based unsupervised approach similar to conditional VAE (sohn2015learning) with an additional contrastive regularization loss, for learning the dynamics without ground truth labels.

With the goal of capturing the underlying dynamics, we avoid directly reconstructing the (state, action, next state) tuple, . Otherwise, the VAE would likely capture the latent structure of the state space. Instead, the decoder is modified to take-in the state-action pair, , and a latent code, , and outputs the next state, . The decoder now becomes a forward dynamics predictive model. The unsupervised dynamics latent variable learning method is illustrated in Figure 2.

Figure 2: VAE-based unsupervised dynamics learning.

The evidence lower bound (ELBO) used is:

(3)

Where is the dynamics posterior (encoder); is a forward dynamics predictive model (decoder); is a Gaussian prior over the latent code . Similar to davis2007information and hsu2015neural, to avoid the encoder learning an identity mapping on , we add the following contrastive regularization to the loss,

Where and are sampled from the same roll-out trajectory; and are sampled from different roll-out trajectories. is a constant. We use this regularization to introduce additional supervision in order to improve the robustness of the latent posterior.

The overall objective for the dynamics learner is

(4)

where is a scalar to control the relative strength of the regularization term. The learned posterior (encoder) infers the latent dynamics, which is used for conditioning the policy. The modified algorithm can be found in the appendix (Algorithm 2).

Figure 3: Vary x-component of gravity in HalfCheetah environment. The red arrows in the picture show the gravity directions.

4 Experiments

4.1 Environments

To evaluate the proposed algorithm we consider 4 simulated environments: CartPole, Hopper, HalfCheetah and Ant. The chosen dynamics parameters are specified in Table 1, and an example of one such parameter (HalfCheetah gravity component ) is shown in Figure 3. During training the parameters are sampled uniformly from the chosen range. Source domain parameters are also given in Table 1. For each source domain, we collect 16 expert demonstrations.

Gym CartPole-V0: We vary the force magnitude in continuous range in our training setting. Note that the force magnitude can take negative values, which flips the force direction.

3 Mujoco Environments: Hopper, HalfCheetah, and Ant: With these three environments, we vary 2d dynamics parameters: gravity x-component and friction.

Environment Paramater 1 Parameter 2 Source
CartPole-V0 [-1,1]
Hopper [-1.0, 1.0] [1.5, 2.5]
HalfCheetah [-3.0, 3.0] [0.0, 2.0]
Ant [-5.0, 5.0] [0.0, 4.0]
Table 1: Environments. = Force magnitude; =Gravity x-component; = Friction. For each environment, we collect 16 expert demonstrations from the source domain.

4.2 ADAIL on Simulated Control Tasks

Is the dynamics posterior component effective under large dynamics shifts?

We first demonstrate the effectiveness of the dynamics posterior under large dynamics shifts on a toy Gym environment, Cartpole, by varying 1d force magnitude. As the direction of the force changes, blindly mimicking the demonstrations collected from the source domain () would not work on target domains with . This result is evident when comparing ADAIL to GAIL with dynamics randomization. As shown in Figure 3(a), GAIL with Dynamics Randomization failed to generalize to , whereas, ADAIL is able to achieve the same performance as . We also put a comparison with ADAIL-rand, where the policy is conditioned on uniformly random values of the dynamics parameters, which completely breaks the performance across the domains.

How does the GRL help improve the robustness of performance across domains?

To demonstrate the effectiveness of GRL in the adversarial imitation learning formulation, we do a comparative study with and without GRL on GAIL with dynamics randomization in the Hopper environment. The results are shown in Figure 3(b).

(a)
(b)
Figure 4: (a): ADAIL on CartPole-V0. Blue: PPO Expert; green: GAIL with Dynamics Randomization; red: ADAIL with latent parameters from the dynamics posterior; light blue: ADAIL with uniformly random latent parameters. (b): GAIL with Dynamics Randomization without (left, ) or with (right, ) GRL on Hopper.

How does the overall algorithm work in comparison with baseline methods?

We compare the performance of ADAIL with a few baseline methods, including 1) the PPO expert which was used to collect demonstrations; 2) the UP-true algorithm of yu2017preparing, which is essentially a PPO policy conditioned on ground truth physics parameters; and 3) GAIL with dynamics randomization, which is unmodified GAIL training on a variety of environments with varying dynamics. The results of this experiment are show in in Figure 5.

HalfCheetah The experiments show that 1) as expected the PPO expert (Plot 4(a)) has limited adaptability to unseen dynamics. 2) UP-true (Plot 4(b)) achieves similar performance across test environments. Note that since UP-true has access to the ground truth reward signals and the policy is conditioned on ground truth dynamics parameters, the Plot 4(b) shows an approximate expected upper bound for our proposed method since we do not assume access to reward signals during policy training, or to ground truth physics parameters at policy evaluation time. 3) GAIL with dynamics randomization (Plot 4(c)) can generalize to some extent, but failed to achieve the demonstrated performance in the source environment (gravity x = 0.0, friction = 0.5) 4) Plots 8(f) 8(g) show evaluation of the proposed method ADAIL with policy conditioned on ground truth physics parameters and predicted physics parameters respectively; ADAIL matches the expert performance in the source environment (gravity x = 0.0, friction = 0.5) and generalizes to unseen dynamics. In particular, when the environment dynamics favors the task, the adaptive agent was able to obtain even higher performance (around friction = 1.2, gravity = 2).

(a) PPO Expert
()
(b) UP-true
()
(c) GAIL-rand
()
(d) ADAIL
()
(e) PPO Expert
()
(f) UP-true
()
(g) GAIL-rand
()
(h) ADAIL
()
(i) PPO Expert
()
(j) UP-true
()
(k) GAIL-rand
()
(l) ADAIL
()
Figure 5: Comparing ADAIL with baselines on Mujoco tasks. Each plot is a heatmap that demonstrates the performance of an algorithm in environments with different dynamics. Each cell of the plot shows 10 episodes averaged cumulative rewards on a particular 2D range of dynamics. Note that to aid visualization, we render plots for Ant in log scale.

Ant and Hopper. We again show favorable performance on both Ant and Hopper in Figure 5.

How does the algorithm generalize to unseen environments?

To understand how ADAIL generalizes to environments not sampled at training time, we do a suite of studies in which the agent is only allowed to interact in a limited set of environments. Figure 6 shows the performance of ADAIL on different settings, where a region of environment parameters including the expert source environment are “blacked-out". This case is particularly challenging since the policy is not allowed to access the domain from which the expert demonstrations were collected, and so our dynamics-invariant discriminator is essential. For additional held out experiments see Figure 10.

The experiments show that, 1) without training on the source environment, ADAIL with the ground truth parameters tends to have performance drops on the blackout region but largely is able to generalize (Figure 5(a)); 2) the posterior’s RMSE raises on the blackout region (Figure 5(c)); 3) consequently ADAIL with the predicted dynamics parameters suffers from the posterior error on the blackout region (Figure 5(b)).

(a) ADAIL-true (5x5)
(b) ADAIL-pred (5x5)
(c) Posterior RMSE (5x5)
Figure 6: Generalization of our policy to held out parameters on the HalfCheetah environment. The red rectangles in plots show the blackout regions not seen during policy training.

How does unsupervised version of the algorithm perform?

VAE-ADAIL on HalfCheetah. With the goal of understanding the characteristics of the learned dynamics latent embedding through the unsupervised method and its impact on the overall algorithm, as a proof of concept we apply VAE-ADAIL to HalfCheetah environment varying a 1D continuous dynamics, friction. The performance is shown in Figure 8.

5 Conclusion

In this work we proposed the ADaptive Adversarial Imitation Learning (ADAIL) algorithm for learning adaptive control policies from a limited number of expert demonstrations. We demonstrated the effectiveness of ADAIL on two challenging MuJoCo test suites and compared against recent state-of-the-art. We showed that ADAIL extends the generalization capacities of policies to unseen environments, and we proposed a variant of our algorithm, VAE-ADAIL, that does not require environment dynamics labels at training time. We will release the code to aid in reproduction upon publication.

References

Appendix A Appendix

a.1 Discriminator with Gradients Reversal Layer (GRL)

Figure 7: Discriminator with Gradients Reversal Layer (GRL). The red layer is the GRL which reverses the gradients during backprop. The yellow layer is a dynamics-invariant layer that is shared with the classification task.

a.2 Markov Decision Process

An infinite-horizon, discounted Markov decision process (MDP) is defined as a tuple

, with state space , action space

, transition probability distribution

, reward function , initial state distribution , and the discount factor . Let be a trajectory of states and actions, and the total discounted reward for the trajectory. The goal of RL algorithms is to find a policy to maximize the expected discounted cumulative reward, , where . In the imitation learning setting, the reward function is not given, whereas, a set of expert demonstrations, are provided, where is sampled by rolling out an expert policy in the MDP.

a.3 Huber Loss For Dynamics Embedding Loss

We use the following loss function when training the dynamics embedding posterior:

(5)

Where controls the joint position between L2 and L1 penalty in Huber loss.

Lemma 1. Minimizing the above Huber loss is equivalent to maximizing the log likelihood, , assuming

is distributed as a Gaussian distribution when

, and as a Laplace distribution otherwise. See appendix  A.4 for the proof.

a.4 Lemma 1 Proof

Proof. For ,

(6)
(7)
(8)
(9)
(10)

Likewise, we can prove for .

a.5 VAE-ADAIL Algorithm

1:Inputs:
2:An environment class .
3:Initial parameters of policy , discriminator , and dynamics posterior .
4:A set of expert demonstrations on one of the environment .
5:for i = 1, 2, .. do
6:     Sample environment .
7:     Sample trajectories in and
8:     Update the discriminator parameters with the gradients
9:     Update the posterior parameters with the objective described in Eq (3) & (4)
10:     Update policy using policy optimization method (TRPO/PPO) with:
11:Output: Learned policy , and posterior .
Algorithm 2 VAE-ADAIL

a.6 VAE-ADAIL Experiment on HalfCheetah

Figure 8: VAE-ADAIL performance on HalfCheetah

a.7 HalfCheetah ADAIL Performance Comparison

(a) PPO Expert
()
(b) GAIL
()
(c) GAIL-rand
()
(d) State-only GAIL-rand
()
(e) UP-true
()
(f) ADAIL-true
()
(g) ADAIL-pred
()
(h) Posterior RMSE
()
Figure 9: Comparing ADAIL with a few baselines on HalfCheetah. Each plot is a heatmap that demonstrates the performance of an algorithm in environments with different dynamics. Each cell of the plot shows 10 episodes averaged cumulative rewards on a particular 2D range of dynamics.

a.8 Held-out Environment Experiment

(a) ADAIL-true (1x1)
(b) ADAIL-pred (1x1)
(c) Posterior RMSE (1x1)
(d) ADAIL-true (3x3)
(e) ADAIL-pred (3x3)
(f) Posterior RMSE (3x3)
(g) ADAIL-true (5x5)
(h) ADAIL-pred (5x5)
(i) Posterior RMSE (5x5)
Figure 10: Generalization of our policy to held out environments. The red rectangles in plots on the first column show the blackout regions not seen during policy training.

a.9 Hyperparameters

a.9.1 Adail

We use fully connected neural networks with 2 hidden layers for all three components of the system. The network hyperparameters for each of the test environments with 2D dynamics parameters are shown in Table 

2. For all the baseline methods, we use the same set of hyperparameters.

Environment Policy Discriminator Posterior
Architecture Learning rate Architecture Learning rate Architecture Learning rate
CartPole-V0 (s,a) - 64 - 64 - (a) 0.0005586 (s,a) - 32 - 32 - 1 0.000167881 (s,a,s’)-76-140-(1,c) 0.00532
Hopper (s,a) - 64 - 64 - (a) 0.000098646 (s,a) - 32 - 32 - 1 0.0000261 (s,a,s’)-241-236-(2,c) 0.00625
HalfCheetah (s,a) - 64 - 64 - (a) 0.00005586 (s,a) - 32 - 32 - 1 0.0000167881 (s,a,s’)-150-150-(2,c) 0.003
Ant (s,a) - 64 - 64 - (a) 0.000047 (s,a) - 32 - 32 - 1 0.000037 (s,a,s’)-72-177-(2,c) 0.002353
Table 2: ADAIL network architectures and learning rates on test environments

a.9.2 Vae-Adail

In Table 3 we show the network architectures and learning rates for VAE-ADAIL.

Encoder (Posterior) Decoder Policy Discriminator
Architecture (s,a,s’) - 200 - 200 - (c) (s,a,c) - 200 - 200 - (s’) (s,a) - 64 - 64 - (a) (s,a) - 32 - 32 - (1,c)
Learning rate 0.000094 0.000094 0.00005596 0.000046077
Table 3: VAE-ADAIL network architectures and learning rates