Imitation Learning (IL) treats the task of learning a policy from a set of expert demonstrations. IL is effective on control problems that are challenging for traditional reinforcement learning methods, either due to reward function design challenges or the inherent difficult of the task itself [abbeel2004apprenticeship, ross2011reduction].
Most IL work can be divided into two branches: behavioral cloning and inverse reinforcement learning. Behavioral cloning casts IL as a supervised learning objective and seeks to imitate the expert’s actions using the provided demonstrations as a fixed dataset[pomerleau1991efficient]. This usually requires a lot of expert data and results in agents that struggle to generalize. As an agent deviates from the demonstrated behaviors – straying outside the state distribution on which it was trained – the risks of making additional errors increase, a problem known as compounding error [ross2011reduction].
Inverse reinforcement learning aims to reduce compounding error by learning a reward function under which the expert policy is optimal [abbeel2004apprenticeship]. Once learned, an agent can be trained (with any RL algorithm) to learn how to act at any given state of the environment. Early methods were prohibitively expensive on large environments because they required training the RL agent to convergence at each learning step of the reward function [ziebart2008maximum, abbeel2004apprenticeship]. Recent approaches instead apply an adversarial formulation (Adversarial Imitation Learning, AIL) in which a discriminator learns to distinguish between expert and agent behaviors to learn the reward optimized by the expert. AIL methods require only a few policy improvement steps for each discriminator update [ho2016generative, fu2017learning, finn2016connection].
While these advances have allowed imitation learning to tackle bigger and more complex environments [kuefler2017imitating, ding2019goal], it has also significantly complexified the implementation and learning dynamics of imitation learning algorithms. It is worth asking how much of this complexity is actually mandated. For example, in recent work, reddy2019sqil have shown that competitive performance can be obtained by handcrafting a very simple reward function to incentivize expert-like behaviors and manage to imitate it through off-policy direct RL. reddy2019sqil therefore remove the reward learning component of AIL and focus on the RL loop, yielding a regularized version of Behavioral Cloning. Motivated by these results, we also seek to simplify the AIL framework but following the opposite direction: keeping the reward learning module and removing the policy improvement loop.
We propose a simpler yet competitive AIL framework. Motivated by finn2016connection
who use the optimal discriminator form, we propose a structured discriminator that estimates the probability of demonstrated and generated behavior using a single parameterized maximum entropy policy. Discriminator learning and policy learning therefore occur simultaneously, rendering seamless generator updates: once the discriminator has been trained for a few epochs, we simply use its policy model to generate new rollouts. We call this approach Adversarial Soft Advantage Fitting (ASAF).
We make the following contributions:
Algorithmic: we present a novel algorithm (ASAF) designed to imitate expert demonstrations without any reinforcement learning step.
Theoretical: we show that our method retrieves the expert policy when trained to optimality.
Empirical: we show that ASAF outperforms prevalent IL algorithms on a variety of discrete and continuous control tasks. We also show that, in practice, ASAF can be easily modified to account for different trajectory lengths (from full length to transition-wise).
Markov Decision Processes (MDPs)
We use hazan2018provably’s notation and consider the classic -horizon -discounted MDP . For simplicity, we assume that and are finite. Successor states are given by the transition distribution , and the initial state is drawn from . Transitions are rewarded with with being bounded. The discount factor and the episode horizon are and , where for . Finally, we consider stationary stochastic policies that produce trajectories when executed on .
The probability of trajectory under policy is and the corresponding marginals are defined as and , respectively. With these marginals, we define the normalized discounted state and state-action occupancy measures as and where the partition function is equal to if and to if and . Intuitively, the state (or state-action) occupancy measure can be interpreted as the discounted visitation distribution of the states (or state-action pairs) that the agent encounters when navigating with policy . The expected sum of discounted rewards can be expressed in term of the occupancy measures as follows:
In the entropy-regularized reinforcement learning (RL) framework [haarnoja2018soft], the optimal policy maximizes its entropy at each visited state in addition to the standard RL objective:
As shown in [ziebart2010modeling, haarnoja2017reinforcement] that the corresponding optimal policy is
Maximum Causal Entropy Inverse Reinforcement Learning
In the problem of inverse reinforcement learning (IRL), it is assumed that the MDP’s reward function is unknown but that demonstrations from using expert’s policy are provided. Maximum causal entropy IRL [ziebart2008maximum] proposes to fit a reward function from a set of reward functions and retrieve the corresponding optimal policy by solving the optimization problem
In brief, the problem reduces to finding a reward function for which the expert policy is optimal. In order to do so, the optimization procedure searches high entropy policies that are optimal with respect to and minimizes the difference between their returns and the return of the expert policy, eventually reaching a policy that approaches . Most of the proposed solutions [abbeel2004apprenticeship, ziebart2010modeling, ho2016generative] transpose IRL to the problem of distribution matching; abbeel2004apprenticeship and ziebart2008maximum used linear function approximation and proposed to match the feature expectation; ho2016generative proposed to cast Eq. (4) with a convex reward function regularizer into the problem of minimizing the Jensen-Shannon divergence between the state-action occupancy measures:
Connections between Generative Adversarial Networks (GANs) and IRL
For the data distribution and the generator distribution defined on the domain , the GAN objective [goodfellow2014generative] is
In goodfellow2014generative, the maximizer of the inner problem in Eq. (6) is shown to be
and the optimizer for Eq. (6) is . Later, finn2016connection and ho2016generative concurrently proposed connections between GANs and IRL. The Generative Adversarial Imitation Learning (GAIL) formulation in ho2016generative is based on matching state-action occupancy measures, while finn2016connection considered matching trajectory distributions. Our work is inspired by the discriminator proposed and used by finn2016connection,
where with reward approximator motivated by maximum causal entropy IRL. Note that Eq. (8) matches the form of the optimal discriminator in Eq. (7). Although finn2016connection do not empirically support the effectiveness of their method, the adversarial IRL approach of fu2017learning (AIRL) successfully used a similar discriminator for state-action occupancy measure matching.
3 Imitation Learning without Policy Optimization
In this section, we derive Adversarial Soft Advantage Fitting (ASAF), our novel adversarial imitation learning approach. Specifically, in Section 3.1, we present the theoretical foundations for ASAF to perform imitation learning on full-length trajectories. Intuitively, our method is based on the use of such structured discriminators – that match the optimal discriminator form – to fit the trajectory distribution induced by the expert policy. This approach requires being able to evaluate and sample from the learned policy and allows us to learn that policy and train the discriminator simultaneously, thus drastically simplifying the training procedure. We present in Section 3.2 parametrization options that satisfy these requirements. Finally, in Section 3.3, we explain how to implement a practical algorithm that can be used for arbitrary trajectory-lengths, including the transition-wise case.
3.1 Adversarial Soft Advantage Fitting – Theoretical setting
Before introducing our method, we derive GAN training with a structured discriminator.
GAN with structured discriminator
Suppose that we have a generator distribution and some arbitrary distribution that can both be evaluated efficiently, e.g., categorical distribution or probability density with normalizing flows [rezende2015variational]. We call a structured discriminator a function of the form which matches the optimal discriminator form for Eq. (7). Considering our new GAN objective, we get:
While the unstructured discriminator from Eq. (6) learns a mapping from
to a Bernoulli distribution, we now learn a mapping fromto an arbitrary distribution from which we can analytically compute . One can therefore say that is parameterized by . For the optimization problem of Eq. (9), we have the following optima:
The optimal discriminator parameter for any generator in Eq. (9) is equal to the expert’s distribution, , and the optimal discriminator parameter is also the optimal generator, i.e.,
See Appendix A.1
Intuitively, Lemma 1 shows that the optimal discriminator parameter is also the target data distribution of our optimization problem (i.e., the optimal generator). In other words, solving the inner optimization yields the solution of the outer optimization. In practice, we update to minimize the discriminator objective and use it directly as to sample new data.
Matching trajectory distributions with structured discriminator
Motivated by the GAN with structured discriminator, we consider the trajectory distribution matching problem in IL. Here, we optimise Eq. (9) with which yields the following objective:
with the structured discriminator:
Here we used the fact that decomposes into two distinct products: which depends on the stationary policy and which accounts for the environment dynamics. Crucially, cancels out in the numerator and denominator leaving as the sole parameter of this structured discriminator. In this way, can evaluate the probability of a trajectory being generated by the expert policy simply by evaluating products of stationary policy distributions and . With this form, we can get the following result:
The optimal discriminator parameter for any generator policy in Eq. (10) is such that , and using generator policy minimizes , i.e.,
See Appendix A.2
3.2 A Specific Policy Class
The derivations of Section 3.1 rely on the use of a learnable policy that can both be evaluated and sampled from in order to fit the expert policy. A number of parameterization options that satisfy these conditions are available.
First of all, we observe that since is independent of and , we can add the entropy of the expert policy to the MaxEnt IRL objective of Eq. (4) without modifying the solution to the optimization problem:
The max over policies implies that when optimising , has already been made optimal with respect to the causal entropy augmented reward function and therefore it must be of the form presented in Eq. (2). Moreover, since is optimal w.r.t. the difference in performance is always non-negative and its minimum of 0 is only reached when is also optimal w.r.t. , in which case must also be of the form of Eq. (2).
With discrete action spaces we propose to parameterize the MaxEnt policy defined in Eq. (2) with the following categorical distribution , where is a model parameterized by that approximates .
With continuous action spaces, the soft value function involves an intractable integral over the action domain. Therefore, we approximate the MaxEnt distribution with a Normal distribution with diagonal covariance matrix like it is commonly done in the literature[haarnoja2018soft, nachum2018trustpcl]
. By parameterizing the mean and variance we get a learnable density function that can be easily evaluated and sampled from.
3.3 Adversarial Soft Advantage Fitting (ASAF) – practical algorithm
Section 3.1 shows that assuming can be evaluated and sampled from, we can use the structured discriminator of Eq. (11) to learn a policy that matches the expert’s trajectory distribution. Section 3.2 proposes parameterizations for discrete and continuous action spaces that satisfy those assumptions.
In practice, as with GANs [goodfellow2014generative], we do not train the discriminator to convergence as gradient-based optimisation cannot be expected to find the global optimum of non-convex problems. Instead, Adversarial Soft Advantage Fitting (ASAF) alternates between two simple steps: (1) training by minimizing the binary cross-entropy loss,
We derived ASAF considering full trajectories, yet it might be preferable in practice to split full trajectories into smaller chunks. This is particularly true in environments where trajectory length varies a lot or tends to infinity. To investigate whether the practical benefits of using partial trajectories hurt ASAF’s performance, we also consider a variation, ASAF-w, where we treat trajectory-windows of size w as if they were full trajectories. Note that considering windows as full trajectories results in approximating that the initial state of these sub-trajectories have equal probability under the expert’s and the generator’s policy (this is easily seen when deriving Eq. (11)).
|In the limit, ASAF-1 (window-size of 1) becomes a transition-wise algorithm which can be desirable if one wants to collect rollouts asynchronously or has only access to unsequential expert data. While ASAF-1 may work well in practice it essentially assumes that the expert’s and the generator’s policies have the same state occupancy measure, which is incorrect until actually recovering the true expert policy.|
Finally, to offer a complete family of algorithms based on the structured discriminator approach, we show in Appendix B that this assumption is not mandatory and derive a transition-wise algorithm based on Soft Q-function Fitting (rather than soft advantages) that also gets rid of the RL loop. We call this algorithm ASQF. While theoretically sound, we found that in practice, ASQF is outperformed by ASAF-1 in more complex environments (see Section 5.1).
4 Related works
ziebart2008maximum first proposed MaxEnt IRL, the foundation of modern IL. ziebart2010modeling further elaborated MaxEnt IRL as well as deriving the optimal form of the MaxEnt policy at the core of our methods. finn2016connection
proposed a GAN formulation to IRL that leveraged the energy based models ofziebart2010modeling. finn2016guided’s implementation of this method, however, relied on processing full trajectories with Linear Quadratic Regulator and on optimizing with guided policy search, to manage the high variance of trajectory costs. To retrieve robust rewards, fu2017learning proposed a straightforward transposition of [finn2016connection]
to state-action transitions. In doing so, they had to however do away with a GAN objective during policy optimization, consequently minimizing the Kullback–Leibler divergence from the expert occupancy measure to the policy occupancy measure (instead of the Jensen-Shannon divergence)[ghasemipour2019divergence].
Later works [sasaki2018sample, Kostrikov2020Imitation] move away from the Generative Adversarial formulation. To do so, sasaki2018sample directly express the expectation of the Jensen-Shannon divergence between the occupancy measures in term of the agent’s Q-function, which can then be used to optimize the agent’s policy with off-policy Actor-Critic [degris2012off]. Similarly, Kostrikov2020Imitation use Dual Stationary Distribution Correction Estimation [nachum2019dualdice] to approximate the Q-function on the expert’s demonstrations before optimizing the agent’s policy under the initial state distribution using the reparametrization trick [haarnoja2018soft]. While [sasaki2018sample, Kostrikov2020Imitation] are related to our methods in their interests in learning directly the value function, they differ in their goal and thus in the resulting algorithmic complexity. Indeed, they aim at improving the sample efficiency in terms of environment interaction and therefore move away from the algorithmically simple Generative Adversarial formulation towards more complicated divergence minimization methods. In doing so, they further complicate the Imitation Learning methods while still requiring to explicitly learn a policy. Yet, simply using the Generative Adversarial formulation with an Experience Replay Buffer can significantly improve the sample efficiency [kostrikov2018discriminatoractorcritic]. Additionally, by focusing on distribution matching alone, they abandon the MaxEnt IRL framework and fall back into the policy ambiguity problem as several policy can generate the same set of demonstrations [ziebart2008maximum]. For these reasons, and since our aim is to propose efficient yet simple methods, we focus on the Generative Adversarial formulation and the MaxEnt IRL framework.
While reddy2019sqil share our interest for simpler IL methods, they pursue an opposite approach to ours. They propose to eliminate the reward learning steps of IRL by simply hard-coding a reward of 1 for expert’s transitions and of 0 for agent’s transitions. They then use Soft Q-learning [haarnoja2017reinforcement] to learn a value function by sampling transitions in equal proportion from the expert’s and agent’s buffers. Unfortunately, once the learner accurately mimics the expert, it collects expert-like transitions that are labeled with a reward of 0 since they are generated and not coming from the demonstrations. This effectively causes the reward of expert-like behavior to decay as the agent improves and can severely destabilize learning to a point where early-stopping becomes required [reddy2019sqil].
5 Results and discussion
We evaluate our methods on a variety of discrete and continuous control tasks. Our results show that, in addition to drastically simplifying the adversarial IRL framework, our methods perform on par or better than previous approaches on all but one environment. When trajectory length is really long or drastically varies across episodes (see MuJoCo experiments Section 5.3), we find that using sub-trajectories with fixed window-size (ASAF-w) significantly outperfroms its full trajectory counterpart ASAF.
5.1 Experimental setup
We compare our algorithms ASAF, ASAF-w and ASAF-1 against GAIL [ho2016generative], the predominant Adversarial Imitation Learning algorithm in the litterature, and AIRL [fu2017learning], one of its variations that also leverages the access to the generator’s policy distribution. Additionally, we compare against SQIL [reddy2019sqil], a recent Reinforcement Learning-only approach to Imitation Learning that proved successful on high-dimensional tasks. Our implementations of GAIL and AIRL use PPO [schulman2017proximal] instead of TRPO [schulman2015trust] as it has been shown to improve performance [kostrikov2018discriminatoractorcritic].
For all tasks except MuJoCo, we selected the best performing hyperparameters through a random search of equal budget for each algorithm-environment pair (see AppendixC) and the best configuration is retrained on ten random seeds. For the MuJoCo experiments, hyperparameters were optimised by hand and based on previous publications (See Appendix C). Notably, GAIL required extensive tuning of both its RL and IRL components to achieve satisfactory performances. Our method ASAF-w on the other hand showed much more stable and robust to hyperparameterization, which is likely due to its simplicity. SQIL used the same SAC implementation [haarnoja2018soft] that was used to generate the expert demonstrations.
Finally for each task, all algorithms use the same neural network architectures for their policy and/or discriminator (see full description in AppendixC
). Expert demonstrations are either generated by hand (mountaincar), using open-source bots (Pommerman) or from our implementations of SAC and PPO (all remaining). More details are given in AppendixD.
5.2 Experiments on classic control and Box2D tasks (discrete and continuous)
Figure 1 shows that ASAF and its approximate variations ASAF-1 and ASAF-w quickly converge to expert’s performance (here w was tuned to values between 32 to 200, see Appendix C for selected window-sizes). This indicates that the practical benefits of using shorter trajectories or even just transitions does not hinder performance on these simple tasks. Note that for Box2D and classic control environments, we retrain the best configuration of each algorithm for twice as long than was done in the hyperparameter search, which allows to uncover unstable learning behaviors. Figure 1 shows that our methods display much more stable learning: their performance rises until they match the expert’s and does not decrease once it is reached. This is a highly desirable property for an imitation learning algorithm since in practice one does not have access to a reward function and thus cannot monitor the performance of the learning algorithm to trigger early-stopping. The baselines on the other hand experience occasional performance drops. For GAIL and AIRL, this is likely due to the concurrent RL and IRL loops, whereas for SQIL, it has been noted that an effective reward decay can occur when accurately mimicking the expert [reddy2019sqil]. This instability is particularly severe in the continuous control case. In practice, all three baselines use early stopping to avoid performance decay [reddy2019sqil].
5.3 Experiments on MuJoCo (continuous control)
To scale up our evaluations in continuous control we use the popular MuJoCo simulator. In this domain, the trajectory length is either fixed at a large value (1000 steps on HalfCheetah) or varies a lot across episodes (Hopper and Walker2d). Figure 2 shows that these trajectory characteristics hinder ASAF’s learning as ASAF requires collecting multiple episodes for every update, while ASAF-w performs well and is more sample-efficient than ASAF in these scenarios. We also evaluate GAIL both with and without gradient penalty (GP) on discriminator updates [gulrajani2017improved, kostrikov2018discriminatoractorcritic]. While GAIL was originally proposed without GP [ho2016generative], we empirically found that GP prevents the discriminator to overfit and enables RL to exploit dense rewards, which highly improves its sample efficiency. Importantly, GAIL required complicated tuning due to the interaction between RL and IRL to obtain reasonable performance while ASAF-w achieved better performance on both Hopper and HalfCheetah with significantly less hand-tuning. However, ASAF-w performs slightly worse than GAIL w/ GP in Walker2d. We hypothesize that this may be due to the high variance in the trajectory length of Walker2d, rather than task complexity, since Walker2d and HalfCheetah have similar complexity in terms of observation and action spaces.
5.4 Experiments on Pommerman (discrete control)
Finally, to scale up our evaluations in discrete control environments, we consider the domain of Pommerman [resnick2018pommerman], a challenging and very dynamic discrete control environment that uses rich and high-dimensional observation spaces (see Appendix D). We perform evaluations of all of our methods and baselines on a 1 vs 1 task where a learning agent plays against a random agent, the opponent. The goal for the learning agent is to navigate to the opponent and eliminate it using expert demonstrations provided by the champion algorithm of the FFA 2018 competition [zhou2018hybrid]. We removed the ability of the opponent to lay bombs so that it doesn’t accidentally eliminate itself. Since it can still move around, it is however surprisingly tricky to eliminate: the expert has to navigate across the whole map, lay a bomb next to the opponent and retreat to avoid eliminating itself. This entire routine has then to be repeated several times until finally succeeding. We refer to this task as Pommerman Random-Tag. Note that since we measure success of the imitation task with the win-tie-lose outcome (sparse performance metric), a learning agent has to truly reproduce the expert behavior until the very end of trajectories to achieve higher scores. Figure 3 shows that all three variations of ASAF outperform all baselines and approach expert performance.
We propose an important simplification to the adversarial inverse reinforcement learning framework by removing the reinforcement learning optimisation loop altogether. We show that, by using a particular form for the discriminator, our method recovers a policy that matches the expert’s trajectory distribution. We evaluate our approach against prior works on many different benchmarking tasks and show that our method (ASAF) compares favorably to the predominant imitation learning algorithms. The approximate version, ASAF-w, that uses sub-trajectories yields a flexible algorithm that works well both on short and long time horizons. Finally, our approach still involves a reward learning module through its discriminator, and it would be interesting in future work to explore how ASAF can be used to learn robust rewards along the lines of fu2017learning.
Our contributions are mainly theoretical and aim at simplifying current Imitation Learning methods. We do not propose new applications nor use sensitive data or simulator. Yet our method can ease and promote the use, design and development of Imitation Learning algorithms and may eventually lead to applications outside of simple and controlled simulators. We do not pretend to discuss the ethical implications of the general use of autonomous agents but we rather try to investigate what are some of the differences in using Imitation Learning rather than reward oriented methods in the design of such agents.
Using only a scalar reward function to specify the desired behavior of an autonomous agent is a challenging task as one must weight different desiderata and account for unsuspected behaviors and situations. Indeed, it is well known in practice that Reinforcement Learning agents tend to find bizarre ways of exploiting the reward signal without solving the desired task. The fact that it is difficult to specify and control the behavior or an RL agents is a major flaw that prevent current methods to be applied to risk sensitive situations. On the other hand, Imitation Learning proposes a more natural way of specifying nuanced preferences by demonstrating desirable ways of solving a task. Yet, IL also has its drawbacks. First of all one needs to be able to demonstrate the desired behavior and current methods tend to be only as good as the demonstrator. Second, it is a challenging problem to ensure that the agent will be able to adapt to new situations that do not resemble the demonstrations. For these reasons, it is clear for us that additional safeguards are required in order to apply Imitation Learning (and Reinforcement Learning) methods to any application that could effectively have a real world impact.
We wish to thank Eloi Alonso, Olivier Delalleau, Félix G. Harvey and Maxim Peter as well as the researchers of Ubisoft Montreal - La Forge whose feedback and comments greatly improved the revised versions of this work. We would also like to thank Fonds de Recherche Nature et Technologies (FRQNT), Ubisoft Montreal and Mitacs Accelerate Program for providing funding for this work as well as Compute Canada for providing the computing resources.
Appendix A Proofs
a.1 Proof of Lemma 1
Starting with (a), we have:
Assuming infinite discriminator’s capacity, can be made independent for all and we can construct our optimal discriminator as a look-up table with the optimal discriminator for each defined as:
with , and .
Recall that and that . Therefore the function is defined for . Since it is strictly monotonic over that domain we have that:
Taking the derivative and setting to zero, we get:
The second derivative test confirms that we have a maximum, i.e. . The values of at the boundaries of the domain of definition of tend to , therefore is the global maximum of w.r.t. . Finally, the optimal global discriminator is given by:
This concludes the proof for (a).
The proof for (b) can be found in the work of goodfellow2014generative. We reproduce it here for completion. Since from (a) we know that , we can write the GAN objective for the optimal discriminator as:
Where and are respectively the Kullback-Leibler and the Jensen-Shannon divergences. Since the Jensen-Shannon divergence between two distributions is always non-negative and zero if and only if the two distributions are equal, we have that .
This concludes the proof for (b). ∎
a.2 Proof of Theorem 1
Like for Lemma 1, we can optimise for each individually. When doing so, can be omitted as it is constant w.r.t . The rest of the proof is identical to the one of but Lemma 1 with and . It follows that the max of is reached for . From that we obtain that the policy that makes the discriminator optimal w.r.t is such that i.e. .
The proof for (b) stems from the observation that choosing (the policy recovered by the optimal discriminator ) minimizes :
By multiplying the numerator and denominator of by it can be shown in exactly the same way as in Appendix A.1 that is the global minimum of . ∎
Appendix B Adversarial Soft Q-Fitting: transition-wise Imitation Learning without Policy Optimization
In this section we present Adversarial Soft Q-Fitting (ASQF), a principled approach to Imitation Learning without Reinforcement Learning that relies exclusively on transitions. Using transitions rather than trajectories presents several practical benefits such as the possibility to deal with asynchronously collected data or non-sequential experts demonstrations. We first present the theoretical setting for ASQF and then test it on a variety of discrete control tasks. We show that while it is theoretically sound, ASQF is often outperformed by ASAF-1, an approximation to ASAF that also allows to rely on transitions instead of trajectories.
We consider the GAN objective of Eq. (6) with , , , and a discriminator of the form of fu2017learning:
for which we present the following theorem.
For any generator policy , the optimal discriminator parameter for Eq. (31) is
Using , the optimal generator policy is
The beginning of the proof closely follows the proof of Appendix A.1.
We solve for each individual pair and note that is strictly monotonic on so,
Taking the derivative and setting it to 0, we find that
We confirm that we have a global maximum with the second derivative test and the values at the border of the domain i.e.