Sample-Efficient Imitation Learning via Generative Adversarial Nets

09/06/2018 ∙ by Lionel Blondé, et al. ∙ University of Geneva 0

Recent work in imitation learning articulate their formulation around the GAIL architecture, relying on the adversarial training procedure introduced in GANs. Albeit successful at generating behaviours similar to those demonstrated to the agent, GAIL suffers from a high sample complexity in the number of interactions it has to carry out in the environment in order to achieve satisfactory performance. In this work, we dramatically shrink the amount of interactions with the environment by leveraging an off-policy actor-critic architecture. Additionally, employing deterministic policy gradients allows us to treat the learned reward as a differentiable node in the computational graph, while preserving the model-free nature of our approach. Our experiments span a variety of continuous control tasks.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Reinforcement learning (RL) is a powerful and extensive framework enabling a learner to tackle complex continuous control tasks (Sutton2017-ow)

. Leveraging strong function approximators such as multi-layer neural networks, deep reinforcement learning alleviates the customary preliminary workload consisting in hand-crafting relevant features for the learning agent to work on. While being freed from this engineering burden opens up the framework to an even broader range of complex control and planning tasks, RL remains hindered by its reliance on meaningful reward design, referred to as

reward shaping. Albeit intuitively appealing, shaping often requires an intimidating amount of engineering via trial and error to yield natural-looking behaviours and makes the system prone to premature convergence to local minima (Ng1999-lv).

Imitation learning breaks free from the preliminary reward function hand-crafting step as it does not need access to a reinforcement signal. Instead, imitation learning learns to perform a task directly from expert demonstrations. The emerging policies mimic the behaviour displayed by the expert in those demonstrations. Learning from demonstrations (LfD) has enabled significant advances in robotics (Billard2008-jb) and autonomous driving (Pomerleau1989-nh; Pomerleau1990-lm). Such models were fit from the expert demonstrations alone in a supervised fashion, without gathering new data in simulation. Albeit efficient when data is abundant, they tend to be frail as the agent strays from the expert trajectories. The ensuing compounding of errors causes a covariate shift (Ross2010-eb; Ross2011-dn). This approach, referred to as behavioral cloning, is therefore poorly adapted for imitation. Those limitations stem from the sequential nature of the problem.

The caveats of behavioral cloning have recently been successfully addressed by Ho and Ermon (Ho2016-bv) who introduced a model-free imitation learning method called Generative Adversarial Imitation Learning (GAIL). Leveraging Generative Adversarial Networks (GAN) (Goodfellow2014-yk), GAIL alleviates the limitations of the supervised approach by a) learning a reward function that explains the behaviour shown in the demonstrations and b) following an RL procedure in an inner loop, consisting in performing rollouts in a simulated environment with the learned reward function as reinforcement signal. Several works have built on GAIL to overcome the weaknesses it inherits from GANs, with a particular emphasis on avoiding the mode collapse (Goodfellow2017-pv), resulting in policies that fail to display to diversity of demonstrated behaviours or skills. We do not tackle the mode collapse problem in this paper. Works that do attempt to overcome this limitation (Li2017-rk; Hausman2017-hb; Kuefler2017-zu) fall short of addressing the sample inefficiency of GAIL and therefore still necessitate a considerable amount of interactions with the simulated environment. In this paper, we tackle this sample inefficiency limitation. Note that ‘sample efficient’ here means that we focus on limiting the number of agent-environment interactions, as opposed to reducing the number of expert demonstrations needed by the agent. Albeit important, limiting the number of needed demonstrations in not in the direct scope of this work.

Failures of previous works to address the exceeding sample complexity stems from the on-policy nature of the RL procedure they employ. Specifically, in likelihood ratio policy gradient methods, every interaction in a given rollout is used to compute the Monte Carlo estimate of the state value by summing the rewards accumulated during the current trajectory. The experienced

transitions (atomic unit of interaction in RL) are then disregarded. Holding on to past trajectories to carry out more than a single optimization step might appear viable but often results to destructively large policy updates (Schulman2017-ou)

. Gradients based on those estimates therefore suffer from high variance, which can be reduced by intensifying the sampling, hence the deterring sample efficiency.

In this work, we deal with this sample inefficiency in the number of simulator queries by leveraging a policy gradient method with function approximation (Sutton1999-ii), referred to as actor-critic methods. By designing an off-policy learning procedure relying on the use of retained past experiences, we considerably shrink the amount of interactions necessary to learn good imitation policies. We build on Deep Deterministic Policy Gradients (Lillicrap2015-xa), a state-of-the-art off-policy actor-critic method based on deterministic policy gradients. This also allows us to exploit further information involving the learned reward function, such its gradient. Previous methods either ignore it by treating the reward signal as a scalar in a model-free fashion or build a model of the environment to exploit it. Our method achieves the best of both worlds as it can perform a backward pass from the discriminator to the generator (policy) while remaining model-free.

2 Related Work

Imitation learning aims to learn how to perform tasks solely from expert demonstrations. Two approaches are typically adopted to tackle imitation learning problem: a) behavioral cloning (BC) (Pomerleau1989-nh; Pomerleau1990-lm), which learns a policy via regression on the state-action pairs from the expert trajectories, and b) apprenticeship learning (AL) (Abbeel2004-rb), which posits the existence of some unknown reward function under which the expert policy is optimal and learns a policy by i) recovering the reward that the expert is assumed to maximise (an approach called inverse reinforcement learning (IRL)) and ii) running an RL procedure with this recovered signal. As a supervised approach, BC is limited to the available demonstrations to learn a regression model, whose predictions worsen drastically as the agent strays from the demonstrated trajectories. It then becomes increasingly difficult for the model to recover as the errors compound (Ross2010-eb; Ross2011-dn; Bagnell2015-ni). Only the presence of correcting behaviour in the demonstration dataset can allow BC to produce robust policies. AL alleviates this weakness by entangling learning the reward function and learning the mimicking policy, leveraging the return of the latter to adjust the parameters of the former. Models are trained on traces of interaction with the environment rather than on a fixed state pool, leading to greater generalization to states absent from the demonstrations. Albeit preventing errors from compounding, IRL comes with a high computational cost, as both modelling the reward function and solving the ensuing RL problem (per learning iteration) can be resource intensive (Syed2008-zo; Syed2008-su; Ho2016-xn; Levine2011-hi).

In an attempt to overcome the shortcomings of IRL, Ho and Ermon (Ho2016-bv) managed to bypass the need for learning the reward function assumed to have been optimised by the expert when collecting the demonstrations. The proposed approach to AL, Generative Adversarial Imitation Learning (GAIL), relies on an essential step consisting in learning a surrogate function measuring the similarity between the learned policy and the expert policy, using Generative Adversarial Networks (Goodfellow2014-yk). The learned similarity metric is then employed as a reward proxy to carry out the RL step, preserved from the AL learning scheme. Recently, connections have been drawn between GANs, RL (Pfau2016-ft) and IRL (Finn2016-uj). In this work, we extend GAIL to further exploit the connections between those frameworks and overcome a limitation that was left unaddressed: the burdensome sample inefficiency of the method.

Generative adversarial networks are, under their original formulation, involving a generator and a discriminator each represented by a neural network, making the associated computational graph fully differentiable. In particular, the gradient of the discriminator with respect to the output of the generator is of primary importance as it indicates in which direction the generator should change it order to have better chances to fool the discriminator at the next iteration. In GAIL, the generator’s role is carried out by a stochastic policy, causing the computational graph to no longer be differentiable end-to-end. Following a model-based approach, (Baram2017-es) is able to recover the gradient of the discriminator (learned reward function) with respect to actions (via reparametrization tricks) and with respect to states (via a forward model), making the computational graph fully differentiable. The deterministic policy gradient theorem (Silver2014-dk), leveraged by our method, enables us to directly involve the gradient of the discriminator with respect to the actions to guide our mimicking agent. However, since we adopt a model-free approach, states remain stochastic nodes in the computational graph and therefore block (backward) gradient flows.

3 Background

Setting

We address the problem of an agent learning to act in an environment in order to reproduce the behaviour of an expert demonstrator. No direct supervision is provided to the agent — she is never directly told what the optimal action is — nor does she receives a reinforcement signal from the environment upon interaction. Instead, the agent is provided with a pool of trajectories and must use them to guide its learning process.

Preliminaries

We model this sequential interactive problem over discrete timesteps as a Markov decision process (MDP) , formalised as a tuple . and respectively denote the state and action spaces. The dynamics are defined by a transition distribution with conditional density , along with , the density of the distribution from which the initial state is sampled. Finally, denotes the discount factor and the reward function. We consider only the fully-observable case, in which the current state can be described with the current observation , alleviating the need to involve the entire history of observations. Although our results are presented following the previous infinite-horizon MDP, the MDPs involved in our experiments are episodic, with at episode termination. In the theory, whenever we omit the discount factor, we implicitly assume the existence of an absorbing state along any trajectory generated by the agent.

We formalise the sequential decision making process of the agent by defining a parameterised policy , modelled via a neural network with parameter .

designates the conditional probability density concentrated at action

when the agent is in state . In line with our setting, the agent interacts with , an MDP comprising every element of except its reward function . Since our approach involves learning a surrogate reward function, we define , denoting the MDP resulting from the augmentation of with the learned reward. We can therefore equivalently assume that the agent interacts with . Trajectories are traces of interaction between an agent and an MDP. Specifically, we model trajectories as sequences of transitions , atomic units of interaction. Demonstrations are provided to the agent through a set of expert trajectories , generated by an expert policy in . Note that we adopt the RL definition of transition, as opposed to depicting a transition via a state-action pair as customarily assumed in IL. This preliminary modelling choice allows us to formally manipulate entities from both worlds that our method employs.

In fine, we introduce concepts and notations that will be instrumental in the remainder of this work. The return is the total discounted reward from timestep onwards: , with . The state-action value, or Q-value, is the expected return after picking action in state , and thereafter following policy : , where denotes the expectation taken along trajectories generated by in (respectively for in ) and looking onwards from state and action . We want our agent to find a policy that maximises the expected return from the start state, which constitutes our performance objective, , i.e. . To ease further notations, we finally introduce the discounted state visitation distribution of a policy , denoted by , and defined by , where is the probability of arriving at state at time step when sampling the initial state from and thereafter following policy . In our experiments, we omit the discount factor for state visitation, in line with common practices.

4 Algorithm

Adversarial Inverse Reinforcement Learning

GAIL (Ho2016-bv) strays from previous apprenticeship learning approaches as it does not explicitly attempt to learn the reward function the demonstrator is assumed to optimise. Rather, the IRL step of GAIL learns a similarity metric between agent and expert, which then serves as a synthetic reinforcement signal to guide the policy learned in the RL step. Specifically, GAIL solves the IRL sub-problem by leveraging a new architecture inspired from GANs (Goodfellow2014-yk). In the proposed framework, the agent mimics the behaviour of an expert by adopting a policy that matches the expert policy .

Leveraging a GAN to learn a surrogate reward, GAIL introduces an extra neural network to play the role of discriminator, while the role of generator is carried out by the policy . tries to accurately assert whether a given state-action pair originates from trajectories of or , while attempts to fool into believing her state-action pairs come from . The situation can be described as a minimax problem , where the value of the two-player adversarial game is:

(1)

is the causal entropy of (Bloem2014-bj), and echoes the exploration-inducing entropy regularization from the RL literature. The optimization is however hindered by the stochasticity of , causing to be non-differentiable with respect to . The solution proposed in (Ho2016-bv) consists in alternating between a gradient step (Adam, (Kingma2014-op)) on to increase with respect to , and a policy optimization step (TRPO, (Schulman2015-jt)) on to decrease with respect to . In other words, while

is trained as a binary classifier to predict if a given state-action pair is real (from

) or generated (from ), the policy is trained by being rewarded for successfully confusing into believing a generated sample is coming from . The reward is defined as the negative of the generator loss. As for the latter, the former can be stated in two variants, which we go over and discuss in supplementary material. In fine, adversarial IRL yields as synthetic reward.

Another approach to overcome the non-differentiability of with respect to is explored in (Baram2017-es), who plugs a forward model of the MDP into the model-free GAIL setting to gain full differentiability of the stochastic computational graph. Our model however only seeks differentiability of the discriminator with respect to actions in order to extract more information about how to adjust the policy parameters to fool than TRPO, which treats the learned reward as a scalar. We achieve this by leveraging deterministic policy gradients (DPG) (Silver2014-dk). Specifically, we employ an off-policy deterministic actor-critic architecture built on DPG (DDPG, (Lillicrap2015-xa)), which also allows us to achieve greater sample efficiency.

Deterministic Policy Gradients

Actor-Critic (AC) methods interleave policy evaluation with policy iteration. Policy evaluation estimates the state-action value function with a function approximator called critic , usually via either Monte-Carlo (MC) estimation or Temporal Difference (TD) learning. Policy iteration updates the policy by greedily optimising it against the estimated critic . Recent work showed that using off-policy TD learning for the critic, by means of experience replay, yields significant gains in sample efficiency (Silver2014-dk). Additionally, a recent ablation study (Hessel2017-ns) for the specific case of DQN (Mnih2013-rb; Mnih2015-iy), hints that approaches such as -step returns might be instrumental in improving the sample efficiency of off-policy actor-critic methods even further. We build on DDPG to a) bring the sample efficiency of off-policy actor-critic methods to GAIL and b) leverage gradient information hitherto disregarded.

As the name states, DDPG (Lillicrap2015-xa) employs deterministic policies: at a given state , the agent acts according to its deterministic policy and selects the action . Alternatively, we can obtain a deterministic policy from any stochastic policy by systematically picking the average action for a given state: . Deterministic policies have zero variance in their predictions for a given state, translating to no exploratory behaviour. The exploration problem is therefore treated independently from how the policy is modelled, by defining a stochastic policy from the learned deterministic policy . In this work, we construct via the combination of two fundamentally different techniques: a) by applying an adaptive perturbation to the learned weights (exploration by noise-injection in parameter space (Plappert2017-rl) (Fortunato2017-af)) and b) by adding temporally-correlated noise sampled from a Ornstein-Uhlenbeck process (Lillicrap2015-xa), well-suited for control tasks involving inertia (e.g. simulated robotics and locomotion tasks). We denote the obtained policy by , where results from applying a) to .

Our model, that we call Sam (Sample-efficient Adversarial Mimic), is a new imitation learning technique that combines deterministic policy gradients with an adversarial reward learning procedure and requires significantly fewer interactions with the environment to mimic expert behaviours. Sam is composed of three interconnected learning modules: the reward module (parameter ), the policy module (parameter ), and the critic module (parameter ) (Figure 1). As an off-policy method, Sam cycles through the following steps: i) the Sam agent uses to interact with , ii) stores the experienced transitions in a replay buffer , iii) samples a mini-batch of transitions from using the off-policy distribution , and iv) updates her parameters (, and ) by performing a training step over . A more detailed description of the training procedure is laid out in the algorithm pseudo-code (Algorithm 1).

Figure 1: Inter-module relationships in different neural architectures (the scope of this figure was inspired from (Pfau2016-ft)

). Modules with distinct loss functions are depicted with empty circles, while filled circles designate environmental entities. Solid and dotted arrows respectively represent (forward) flow of information and (backward) flow of gradient.

Left: Generative Adversarial Imitation Learning (Ho2016-bv) Middle: Actor-Critic architecture (Sutton1999-ii) Right: Sam (this work). Note that in Sam, the critic takes in information from the reward module, while in the vanilla AC architecture, the critic receives the reward from the environment. The gradient flow from the critic to the reward module must however be sealed. Indeed, such a a gradient flow would allow the policy to adjust its parameters to induces values of the reward which yield low TD residuals, hence preventing both critic and reward modules to be learned as intended.

The reward and policy modules are both involved in an GAN’s adversarial training procedure, while the policy and critic modules are trained as an actor-critic architecture. As reminded recently in (Pfau2016-ft), GANs and actor-critic architectures can be both framed as bilevel optimization problems, each involving two competing components, which we just listed out for both architectures. Interestingly, the policy module plays a role in both problems, tying the two bilevel optimization problems together. In one problem, the policy module is trained against the reward module, while in the other, the policy module is trained against the critic module. The reward and critic modules can therefore be seen as serving analogous roles in their respective bilevel optimization problems: forging and maintaining a signal which enables the reward-seeking policy to adopt the desired behaviour. The exhibited analogy translates into both modules having analogous contributions in the gradient estimation of the performance objective, employed to update the policy parameters :

(2)

where

is a hyperparameter representing the relative importance we assign to each contribution. As a reminder,

signifies that transitions are sampled off-policy from the replay buffer . This gradient estimation stems from the policy gradient theorem proved by (Silver2014-dk), augmented by the previously exposed semantic analogy with a term involving another estimate of how well the agent is behaving: the learned surrogate reward . Each estimate ( and ) is trained via a different policy evaluation method, each presenting their specific advantages. The first is updated by adversarial training, providing an accurate estimate of the immediate similarity with expert trajectories. The second however is trained via TD-learning, enabling longer propagation of rewards along trajectories and effectively tackling the credit assignment problem.

Given that the policy is deterministic, we can adopt an off-policy TD-learning procedure to fit the critic (Lillicrap2015-xa), therefore learning the critic solely with samples from . The loss optimised by the critic, noted , involves three components: i) a -step Bellman residual , ii) a -step Bellman residual , and iii) a weight decay regulariser ((Vecerik2017-ue) employs the same losses in the context of RLfD, but also uses a weight decay regulariser for the policy):

(3)

where is a hyperparameter that determines how much decay is used. The losses i) and ii) are defined respectively based on the -step and -step lookahead versions of the Bellman equation,

(4)
(5)

yielding the critic losses:

(6)

Both and ((4), (5)) depend on , which might cause severe instability. In order to prevent the critic from diverging, we use separate target networks for both policy and critic (, ) to calculate , which slowly track the learned parameters (, ). We also found that using a -step TD backup was necessary for our method to learn well-behaved policies.

1 Initialise replay buffer
2 Initialise network parameters for each module (, , )
3 Initialise target network parameters (, ) as respective copies of (, )
4 for  do
5       for  do
             // Collect and store samples from environment
6             Interact with environment following
7             Store experienced transition in replay buffer
8       end for
9      for  do
10             # Update SAM
11             for  do
                   // Update (d)iscriminator (synthetic reward)
12                   Sample minibatch of previously experienced transitions from
13                   Sample minibatch of transitions from the expert dataset , such that
14                   Update synthetic reward parameter with the equal mixture by following the gradient:
15             end for
16            for  do
                   // Update (g)enerator (actor-critic)
17                   Sample minibatch of previously experienced transitions from
18                   Update policy parameter by following the gradient:
19                   Update critic parameters by minimizing critic loss:
20                   Update target network parameters (, ) to slowly track (, ), respectively
21             end for
22            
23       end for
24      
25 end for
Algorithm 1 Sam: Sample-efficient Adversarial Mimic

5 Results

Our agents were trained in physics-based control environments, built with the MuJoCo physics engine (Todorov2012-gc), and wrapped via the OpenAI Gym (Brockman2016-sb) API. Tasks simulated in the environments range from legacy balance-oriented tasks to locomotion tasks of various complexities. Time and resource constraints led us to consider the 4 first environments when ordered by increasing complexity: InvertedPendulum (Cartpole), InvertedDoublePendulum (Acrobot), Reacher and Hopper. For each environment, an expert was designed by training an agent for 10M timesteps using the TRPO (Schulman2015-jt) implementation from (Dhariwal2017-kt). The episode horizon (maximum episode length) was left to its default value per environment. We created a dataset of expert trajectories per environment. Trajectories are not limited in length but often coincide with the episode horizon, due to the good performance of the experts. Moreover, they are extracted without randomisation, which ensures that two compared models are trained on exactly the same subset of extracted trajectories.

Figure 2: Evaluation of the reward module. The regression line supports our claim: the surrogate reward, learned solely from expert demonstrations, is a good proxy of the true simulator reward.
Figure 3: Performance comparison between Sam and GAIL. The markers represent the number of expert trajectories the agent had access to. The figure shows that our method has a considerably better sample-efficiency than GAIL in the explored set of environments. Note that the horizontal axis has a logarithmic scale. We remind the reader that ‘demonstration’ here means ‘expert trajectory’. The number of interactions have been averaged over the pool of parallel workers used to run the experiment (32). High-resolution versions of these plots are provided in supplementary material.

Since we claim to improve the sample efficiency of GAIL (Ho2016-bv) and, to the best of our knowledge, do not share this claim with other works, we will compare Sam directly to GAIL as baseline. Implementation details are provided in supplementary material.

Our results are presented in Figure 3, which traces the evolution of the environmental reward collected by Sam and GAIL agents as they sequentially interact with the environment. For the three first environments (due to limited time and resources), we evaluated the performance of the agents when provided with various quantities of demonstrations. Sam agents consistently reach expert performance with less interactions than GAIL agent, which does not always achieve the demonstrator’s performance (e.g.  in the Reacher environment). The sample-efficiency we gain over GAIL is considerable: Sam needs more than one order of magnitude less interactions with the environment to attain asymptotic expert performance. Note that the horizontal axis is scaled logarithmically. While GAIL requires full traces of agent–environment interaction per iteration as it relies on Monte-Carlo estimates, Sam only requires a couple of transitions per iteration since it performs policy evaluation via Temporal Difference learning. Instead of sampling transitions from the environment, performing an update and discarding the transitions, Sam keeps experiential data in memory and can therefore leverage decorrelated transitions collected in previous iterations to perform an off-policy update. Our method therefore requires considerably fewer new samples (interactions) per iteration, as it can re-exploit the transitions previously experienced.

For the two first environments, GAIL and Sam have comparable wall-clock time. As the tasks become more complexe however, Sam’s wall clock time becomes greater, which is explained by the number of sub-iterations our method performs on the replay buffer per iteration, not involving any new samples. On the Reacher task, GAIL took a tremendous amount of iterations to assimilate the demonstrated skill and reproduce a similar behaviour, without even fully reaching the expert’s performance. Training duration were therefore in favour of Sam. For the Hopper environment however, while a GAIL agent takes on average 1.5 hours to adopt an expert-like behaviour, a Sam agent takes approximatively 30 hours. While these results shed light on a clear trade-off between wall-clock time and number of interactions, we claim that in real-world scenarios (e.g. robotic manipulation, autonomous cars), reducing the required interaction with the world is significantly more desirable, for obvious safety and cost reasons. We report the hyperparameters used in our experiments in supplementary material.

The gradient in Equation (2) involves . In this work, we only present results for , as the preliminary experiments we conducted with have not lead to expert-like policies. Time and resource constraints however did not allow for a full exploration of this tradeoff. We therefore leave further investigation for future work.

6 Conclusion

In this work, we introduced a method, called Sample-efficient Adversarial Mimic (Sam), that meaningfully overcomes one considerable drawback of the Generative Adversarial Imitation Learning (Ho2016-bv) algorithm: the number of agent–environment interactions it requires to learn expert-like policies. We demonstrate that our method shrinks the number of interactions by an order of magnitude, and sometimes more. Leveraging an off-policy procedure was key to that success.

Appendix A Studied environments

The environments we dealt were provided through the OpenAI Gym (Brockman2016-sb) API, building on the MuJoCo physics engine (Todorov2012-gc), to model physical interactive scenarios between an agent and the environment she is thrown into. The control tasks modelled by the environments involve locomotion tasks as well as tasks in which the agent must reach and remain in a state of dynamic balance.

Environment State DoFs Action DoFs
InvertedPendulum-v2
InvertedDoublePendulum-v2
Reacher-v2
Hopper-v2
Figure 4: Degrees of freedom (DoF) of the 4 considered MuJoCo simulated environments. DoFs of both continuous action and state spaces are presented, for the studied physical control tasks. Actions spaces are bounded along every dimension, while the state spaces are unbounded.

Appendix B Reward function variants

The reward is defined as the negative of the generator loss. As for the latter, the former can be stated in two variants, the saturating version and the non-saturating version, respectively

(7)

The non-saturating alternative is recommended in the original GAN paper as well as in (Fedus2017-bk) more recently, as the generator loss suffers from vanishing gradients only in areas where the generated samples are already close to the real data. GAIL relies on policy optimization to update the generator, which makes this vanishing gradient argument vacuous. Besides, in the context of simulated locomotion environments, the saturated version proved to prevail in our experiments, as our agents were unable to overcome the extremely low rewards incurred early in training when using the non-saturating rewards. With the saturated version, signals obtained in early failure cases were close to zero, which was more numerically forgiving for our agents to kick off.

Appendix C Experimental setup

Our algorithm and baseline implement the MPI interface: each experiment has been launch concurrently with 32 parallel workers (32 different seeds), each having its own interaction with the environment, its own replay buffer, its own optimisers and its own network updates. However, every iteration and for a given network (e.g. the critic), the gradients of the 32 Adam (Kingma2014-op) optimisers are pulled together, averaged, and a unique average gradient is distributed to the worker for immediate usage. We rely on the MPI-optimised Adam optimiser available at the OpenAI Baselines (Dhariwal2017-kt) repository, released to enhance and encourage reproducibility. Our experiments have all been conducted relying solely on a single 16-core CPU workstation (AMD Ryzen™Threadripper 1950X CPU). We are currently working on gaining access to GPU infrastructures to fully take advantage of the off-policy nature of Sam, and will then repurpose the CPU threads to focus exclusively on simulator query.

The implementation of GAIL we used takes inspiration from the original implementation as well as from the GAIL implementation recently integrated to (Dhariwal2017-kt), itself stemming from the original implementation. Hyperparameters of the core algorithm are preserved in our implementation of the baseline (table provided in supplementary material). Additionally, both Sam and GAIL implementations use exactly the same discriminator implementation, highlighting that Sam’s better sample efficiency is due to its architecture and orchestration of modules, rather than caused by a better implementation of the reward module, that both models share (e.g. by leveraging a GAN improvement).

Appendix D Hyperparameters settings

In our training procedure, we adopted an alternating scheme consisting in performing 3 training iterations of the actor-critic architecture for one training iteration of the synthetic reward, in line with common practises in the GAN literature (the actor-critic acts as generator, while the synthetic reward plays the role of discriminator). This training pattern applies for both the GAIL baseline and our algorithm

Sam.

As supported by the ending discussion of the GAIL paper, performing a behavioral cloning (Pomerleau1989-nh; Pomerleau1990-lm) pre-training step to warm-start GAIL can potentially yield expert-like policies in fewer number of ensuing GAIL training iterations. It is especially appealing in so far as the behavioral cloning agent does not interact with the environment at all while training. We therefore intended to precede the training of our experiments (for GAIL and Sam) with a behavioral cloning pre-training phase. However, although the previous training pipeline enables a reduction of training iterations for GAIL, we did not witness a consistent benefit for Sam in our preliminary experiments. Our proposed explanation of this phenomenon is that by pre-training both policy and critic individually as regression problems over the expert demonstrations dataset, we hinder the entanglement of the policy and critic training procedures exploited in Sam. We believe that by adopting a more elaborate pre-training procedure, we will be able to overcome this issue, and therefore leave further exploration for future work.

Hyperparameter Value
# workers 32
num demos
policy # layers 2
policy layer widths
policy hidden activations tanh
discriminator # layers 2
discriminator layer widths
discriminator hidden activations

leaky ReLU

discount factor 0.995
generator training steps 3
discriminator training steps 1
non-saturating reward? false
entropy regularization coefficient 0
# timesteps (interactions) per iteration 1000
minibatch size 128
normalize observations? true
# timesteps upper bound 25M (total across workers)
BC pre-training false
Figure 5: Hyperparameters used to train GAIL agents.
Hyperparameter Value
# workers 32
num demos
policy # layers 2
policy layer widths
policy hidden activations leaky ReLU
policy layer normalisation (Ba2016-bs) true
policy output activation tanh
critic # layers 2
critic layer widths
critic hidden activations leaky ReLU
critic layer normalisation (Ba2016-bs) true
discriminator # layers 2
discriminator layer widths
discriminator hidden activations leaky ReLU
discount factor 0.995

policy gradient interpolation factor

0
generator training steps 3
discriminator training steps 1
non-saturating reward? false
entropy regularization coefficient 0
# timesteps (interactions) per iteration 4
minibatch size 32
# training steps per iteration 10
replay buffer size 100K
normalise observations? true
normalise returns? true
Pop-Art (Van_Hasselt2016-bh)? true
reward scaling factor 1000
critic weight decay regularization coefficient 0.001
critic -step TD loss coefficient 1
critic -step TD loss coefficient 1
TD lookahead length 100
adaptive parameter noise for 0.2
Ornstein-Uhlenbeck additive noise for 0.2
# timesteps upper bound 25M (total across workers)
BC pre-training false
Figure 6: Hyperparameters used to train Sam agents.

Appendix E Enhanced plots

Figure 7: Performance comparison between Sam and GAIL. The markers represent the number of expert trajectories the agent had access to. The figure shows that our method has a considerably better sample-efficiency than GAIL in the explored set of environments. Note that the horizontal axis has a logarithmic scale. We remind the reader that ‘demonstration’ here means ‘expert trajectory’. The number of interactions have been averaged over the pool of parallel workers used to run the experiment (32).
Figure 8: Performance comparison between Sam and GAIL. The markers represent the number of expert trajectories the agent had access to. The figure shows that our method has a considerably better sample-efficiency than GAIL in the explored set of environments. Note that the horizontal axis has a logarithmic scale. We remind the reader that ‘demonstration’ here means ‘expert trajectory’. The number of interactions have been averaged over the pool of parallel workers used to run the experiment (32).