Imitation Learning as f-Divergence Minimization

05/30/2019 ∙ by Liyiming Ke, et al. ∙ Carnegie Mellon University University of Washington 9

We address the problem of imitation learning with multi-modal demonstrations. Instead of attempting to learn all modes, we argue that in many tasks it is sufficient to imitate any one of them. We show that the state-of-the-art methods such as GAIL and behavior cloning, due to their choice of loss function, often incorrectly interpolate between such modes. Our key insight is to minimize the right divergence between the learner and the expert state-action distributions, namely the reverse KL divergence or I-projection. We propose a general imitation learning framework for estimating and minimizing any f-Divergence. By plugging in different divergences, we are able to recover existing algorithms such as Behavior Cloning (Kullback-Leibler), GAIL (Jensen Shannon) and Dagger (Total Variation). Empirical results show that our approximate I-projection technique is able to imitate multi-modal behaviors more reliably than GAIL and behavior cloning.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

We study the problem of imitation learning from demonstrations that have multiple modes. This is often the case for tasks with multiple, diverse near-optimal solutions. Here the expert has no clear preference between different choices (e.g. navigating left or right around obstacles (Ross et al., 2013)). Imperfect human-robot interface also lead to variability in inputs (e.g. kinesthetic demonstrations with robot arms (Finn et al., 2016b)). Experts may also vary in skill, preferences and other latent factors. We argue that in many such settings, it suffices to learn a single mode of the expert demonstrations to solve the task. How do state-of-the-art imitation learning approaches fare when presented with multi-modal inputs?

Consider the example of imitating a racecar driver navigating around an obstacle. The expert sometimes steers left, other times steers right. What happens if we apply behavior cloning (Pomerleau, 1988)

on this data? The learner policy (a Gaussian with fixed variance) interpolates between the modes and drives into the obstacle.

Interestingly, this oddity is not restricted to behavior cloning. Li et al. (2017) show that a more sophisticated approach, GAIL Ho and Ermon (2016), also exhibits a similar trend. Their proposed solution, InfoGAIL (Li et al., 2017), tries to recover all the latent modes and learn a policy for each one. For demonstrations with several modes, recovering all such policies will be prohibitively slow to converge.

Figure 1: Behavior cloning fails with multi-modal demonstrations (a) Expert demonstrates going left or right around obstacle. (b) Learner interpolates between modes and crashes into obstacle.

Our key insight is to view imitation learning algorithms as minimizing divergence between the expert and the learner trajectory distributions. Specifically, we examine the family of -divergences. Since they cannot be minimized exactly, we adopt estimators from Nowozin et al. (2016). We show that behavior cloning minimizes the Kullback-Leibler (KL) divergence (M-projection), GAIL minimizes the Jensen-Shannon (JS) divergence and DAgger minimizes the Total Variation (TV). Since both JS and KL divergence exhibit a mode-covering behavior, they end up interpolating across modes. On the other hand, the reverse-KL divergence (I-projection) has a mode-seeking behavior and elegantly collapses on a subset of modes fairly quickly.

The contributions and organization of the remainder of the paper111We refer the reader to supplementary for appendices containing detailed exposition. is as follows.

  1. We introduce a unifying framework for imitation learning as minimization of -divergence between learner and trajectory distributions (Section 3).

  2. We propose algorithms for minimizing estimates of any -divergence. Our framework is able to recover several existing imitation learning algorithms for different divergences. We closely examine reverse KL divergence and propose efficient algorithms for it (Section 4).

  3. We argue for using reverse KL to deal with multi-modal inputs (Section 5). We empirically demonstrate that reverse KL collapses to one of the demonstrator modes on both bandit and RL environments, whereas KL and JS unsafely interpolate between the modes (Section 6).

2 Related Work

Imitation learning (IL) has a long-standing history in robotics as a tool to program desired skills and behavior in autonomous machines (Osa et al., 2018; Argall et al., 2009; Billard et al., 2016; Bagnell, 2015)

. Even though IL has of late been used to bootstrap reinforcement learning (RL) 

(Ross and Bagnell, 2014; Sun et al., 2017, 2018; Cheng et al., 2018; Rajeswaran et al., 2017), we focus on the original problem where an extrinsic reward is not defined. We ask the question – “what objective captures the notion of similarity to expert demonstrations?”. Note that this question is orthogonal to other factors such as whether we are model-based / model-free or whether we use a policy / trajectory representation.

IL can be viewed as supervised learning (SL) where the learner selects the same action as the expert (referred to as behavior cloning 

(Pomerleau, 1989)). However small errors lead to large distribution mismatch. This can be somewhat alleviated by interactive learning, such as DAgger (Ross et al., 2011). Although shown to be successful in various applications (Ross et al., 2013; Kim et al., 2013; Gupta et al., 2017), there are domains where it’s impractical to have on-policy expert labels (Laskey et al., 2017b, 2016). More alarmingly, there are counter-examples where the DAgger objective results in undesirable behaviors (Laskey et al., 2017a). We discuss this further in Appendix C.

Another way is to view IL as recovering a reward (IRL) (Ratliff et al., 2009, 2006) or Q-value (Piot et al., 2017) that makes the expert seem optimal. Since this is overly strict, it can be relaxed to value matching which, for linear rewards, further reduces to matching feature expectations (Abbeel and Ng, 2004)

. Moment matching naturally leads to maximum entropy formulations 

(Ziebart et al., 2008) which has been used successfully in various applications (Finn et al., 2016b; Wulfmeier et al., 2015). Interestingly, our divergence estimators also match moments suggesting a deeper connection.

The degeneracy issues of IRL can be alleviated by a game theoretic framework where an adversary selects a reward function and the learner must compete to do as well as the expert (Syed and Schapire, 2008; Ho et al., 2016). Hence IRL can be connected to min-max formulations (Finn et al., 2016a) like GANs (Goodfellow et al., 2014). GAIL (Ho and Ermon, 2016), SAM (Blondé and Kalousis, 2018) uses this to directly recover policies. AIRL (Fu et al., 2017), EAIRL (Qureshi and Yip, 2018) uses this to recover rewards. This connection to GANs leads to interesting avenues such as stabilizing min-max games (Peng et al., 2018b), learning from pure observations (Torabi et al., 2018b, a; Peng et al., 2018a) and links to f-divergence minimization (Nowozin et al., 2016; Nguyen et al., 2010).

In this paper, we view IL as -divergence minimization between learner and expert. Our framework encompasses methods that look at specific measures of divergence such as minimizing relative entropy (Boularias et al., 2011) or symmetric cross-entropy (Rhinehart et al., 2018). Note that Ghasemipour et al. (2018) also independently arrives at such connections between f-divergence and IL.222Although the algorithms we propose for RKL and our analysis of DAgger is different significantly. We particularly focus on multi-modal expert demonstrations which has generally treated by clustering data and learning on each cluster (Babes et al., 2011; Dimitrakakis and Rothkopf, 2011). InfoGAN (Chen et al., 2016) formalizes this within the GAN framework to recover latent clusters which is then extended to IL (Hausman et al., 2017; Li et al., 2017). Instead, we look at the role of divergence with such inputs.

3 Problem Formulation


We work with a finite horizon Markov Decision Process (MDP)

where is a set of states, is a set of actions, and is the transition dynamics. is the initial distribution over states and is the time horizon. In IL paradigm, the MDP does not include a reward function.

We examine stochastic policies . Let a trajectory be a sequence of state-action pairs . It induces a distribution of trajectories and state as:


The average state distribution across time can be computed as 333Alternatively . Refer to Theorem B.1 in Appendix D.

The f-divergence family

Divergences, such as the well known Kullback-Leibler (KL) divergence, measure differences between probability distributions. We consider a broad class of such divergences called

f-divergences (Csiszár et al., 2004; Liese and Vajda, 2006). Given probability distributions and

over a finite set of random variables

, such that is absolutely continuous w.r.t , we define the f-divergence:


where is a convex, lower semi-continuous function. Different choices of recover different different divergences, e.g. KL, Jensen Shannon or Total Variation (see (Nowozin et al., 2016) for a full list).

Imitation learning as f-divergence minimization

Imitation learning is the process by which a learner tries to behave similarly to an expert based on inference from demonstrations or interactions. There are a number of ways to formalize “similarity” (Section 2) – either as a classification problem where learner must select the same action as the expert (Ross et al., 2011) or as an inverse RL problem where learner recovers a reward to explain expert behavior (Ratliff et al., 2009). Neither of the formulations is error free.

We argue that the metric we actually care about is matching the distribution of trajectories . One such reasonable objective is to minimize the -divergence between these distributions


Interestingly, different choice of -divergence leads to different learned policies (more in Section 5).

Since we have only sample access to the expert state-action distribution, the divergence between the expert and the learner has to be estimated. However, we need many samples to accurately estimate the trajectory distribution as the size of the trajectory space grows exponentially with time, i.e. . Instead, we can choose to minimize the divergence between the average state-action distribution as the following:


We show that this lower bounds the original objective, i.e. trajectory distribution divergence.

Theorem 3.1 (Proof in Appendix A).

Given two policies and , the f-divergence between trajectory distribution is lower bounded by f-divergence between average state-action distribution.

4 Framework for Divergence Minimization

The key problem is that we don’t know the expert policy and only get to observe it. Hence we are unable to compute the divergence exactly and must instead estimate it based on sample demonstrations. We build an estimator which lower bounds the state-action, and thus, trajectory divergence. The learner then minimizes the estimate.

4.1 Variational approximation of divergence

Let’s say we want to measure the -divergence between two distributions and . Assume they are unknown but we have i.i.d samples, i.e., and . Can we use these to estimate the divergence? Nguyen et al. (2010) show that we can indeed estimate it by expressing in it’s variational form, i.e. , where is the convex conjugate 444For a convex function , the convex conjugate is . Also . Plugging this in the expression for -divergence (2) we have


Here is a function approximator which we refer to as an estimator. The lower bound is both due to Jensen’s inequality and the restriction to an estimator class . Intuitively, we convert divergence estimation to a discriminative classification problem between two sample sets.

How should we choose estimator class ? We can find the optimal estimator by taking the variation of the lower bound (5) to get . Hence should be flexible enough to approximate the subdifferential everywhere

. Can we use neural networks discriminators 

(Goodfellow et al., 2014) as our class ? Nowozin et al. (2016) show that to satisfy the range constraints, we can parameterize where is an unconstrained discriminator and is an activation function. We plug this in (5) and the result in (4) to arrive at the following problem.

1:Sample trajectories from expert
2:Initialize learner and estimator parameters ,
3:for  to  do
4:     Sample trajectories from learner
5:     Update estimator
6:     Apply policy gradient
7:     where
8:end for
Algorithm 1 fVIM
Problem 1 (Variational Imitation (VIM)).

Given a divergence , compute a learner and discriminator as the saddle point of the following optimization


where are sample expert demonstrations, are samples learner rollouts.

We propose the algorithmic framework fVIM (Algorithm 1) which solves (6) iteratively by updating estimator via supervised learning and learner via policy gradients. Algorithm 1 is a meta-algorithm. Plugging in different -divergences (Table 1), we have different algorithms

  1. KLVIM: Minimizing forward KL divergence

  2. RKLVIM: Minimizing reverse KL divergence (removing constant factors)

  3. JSVIM: Minimizing Jensen-Shannon divergence


    where .

4.2 Recovering existing imitation learning algorithms

We show how various existing IL approaches can be recovered deferring to Appendix C for details.

Behavior Cloning (Pomerleau, 1988) – Kullback-Leibler (KL) divergence. For KL, setting in (3) and applying Markov we have which is simply behavior cloning with a cross entropy loss for multi-class classification.

Generative Adversarial Imitation Learning (GAIL) (Ho and Ermon, 2016) – Jensen-Shannon (JS) divergence. We see that JS-VIM (9) is exactly the GAIL optimization (without the entropic regularizer).

Reverse KL
Total Variation
Table 1: List of -Divergences used, conjugates, optimal estimators and activation function

Dataset Aggregation (DAgger(Ross et al., 2011) – Total Variation (TV) distance. Using the fact that TV is a distance metric, and Psinker’s inequality we have the following upper bound on TV

DAgger solves this non i.i.d problem in an iterative supervised learning manner with an interactive expert. Counter-examples to DAgger (Laskey et al., 2017a) can now be explained as an artifact of this divergence.

4.3 Alternate techniques for Reverse KL minimization via interactive learning

We highlight the Reverse KL divergence which has received relatively less attention in IL literature. We briefly summarize our approaches, deferring to Appendix D and Appendix E for details. RKLVIM (8) has some shortcomings. First, it’s a double lower bound approximation due to Theorem 3.1) and (5). Secondly, the optimal estimator is a state-action density ratio which maybe quite complex (Table 1). Finally, the optimization (6) may be slow to converge. Interestingly, Reverse KL has a special structure that we can exploit to do even better if we have an interactive expert!

Hence we can directly minimize action distribution divergence. Since this is on states induced by , this falls under the regime of interactive learning (Ross et al., 2011) where we query the expert on states visited by the learner. We explore two different interactive learning techinques for I-projection.

Variational action divergence minimization. Apply the RKLVIM idea but on action divergence:


Unlike RKLVIM, we collect a fresh batch of data from both an interactive expert and learner every iteration. We show that this estimator is far easier to approximate than RKLVIM (Appendix D).

Density ratio minimization via no regret online learning. We first upper bound action divergence:

Given a batch of data from an interactive expert and the learner, we invoke an off-shelf density ratio estimator (DRE) (Kanamori et al., 2012) to get . Since the optimization is a non i.i.d learning problem, we solve it by dataset aggregation. Note this does not require invoking policy gradients. In fact, if we choose an expressive enough policy class, this method gives us a global performance guarantee which neither GAIL or any fVIM provides (Appendix E).

5 Multi-modal Trajectory Demonstrations

(a) The ideal policy.

(b) RKL mode collapsing

(c) KL/JS mode covering
Figure 2: Illustration of the safety concerns of mode-covering behavior. (a) Expert demonstrations and policy roll-outs are shown in blue and red, respectively. (b) RKL receives only a small penalty for the safe behavior whereas KL receives an infinite penalty. (c) The opposite is true for the unsafe behavior where learner crashes.

We now examine multi-modal expert demonstrations. Consider the demonstrations in Fig. 2 which avoid colliding with a tree by turning left or right with equal probability. Depending on the policy class, it may be impossible to achieve zero divergence for any choice of -divergence (Fig. 1(a)), e.g., is Gaussian with fixed variance. Then the question becomes, if the globally optimal policy in our policy class achieves non-zero divergence, how should we design our objective to fail elegantly and safely? In this example, one can imagine two reasonable choices: (1) replicate one of the modes very well (i.e. mode-collapsing) or (2) cover both the modes plus the region between them (i.e. mode-covering). We argue that in many imitation learning tasks the former behavior is preferable, as it produces trajectories similar to previously observed demonstrations.

Mode-covering in KL.

This divergence exhibits strong mode-covering tendencies as in Fig. 1(c)

. Examining the definition of the KL divergence, we see that there is a significant penalty for failing to completely support the demonstration distribution, but no explicit penalty for generating outlier samples. In fact, if

, then the divergence is infinite. However, the opposite does not hold. Thus, the KLVIM optimal policy in belongs to the second behavior class – which the agent to frequently crash into the tree.

Mode-collapsing in RKL.

At the other end of the multi-modal behavior spectrum lies the RKL divergence, which exhibits strong mode-seeking behavior as in Fig. 1(b), due to switching the expectation over with . Note there is no explicit penalty for failing to entirely cover , but an arbitrarily large penalty for generating samples which would are improbable under the demonstrator distribution. This results in always turning left or always turning right around the tree, depending on the initialization and mode mixture. For many tasks, failing in such a manner is predictable and safe, as we have already seen similar trajectories from the demonstrator.


This divergence may fall into either behavior class, depending on the MDP, the demonstrations, and the optimization initialization. Examining the definition, we see the divergence is symmetric and expectations are taken over both and . Thus, if either distribution is unsupported (i.e.  or vice versa) the divergence remains finite. Later, we empirically show that although it is possible to achieve safe mode-collapse with JS on some tasks, this is not always the case.

6 Experiments

In this section, we empirically validate the following Hypotheses:

  • [itemsep=.3pt,topsep=0pt]

  • The globally optimal policy for RKL imitates a subset of the demonstrator modes, whereas JS and KL tend to interpolate between them.

  • The sample-based estimator for KL and JS underestimates the divergence more than RKL.

  • The policy gradient optimization landscape for KL and JS with continuously parameterized policies is more susceptible to local minima, compared to RKL.

To test these hypothesis, we introduce two environments, Bandit and GridWorld, and some useful policy classes. (1) The bandit environment consists and three actions, , and and a single state. The expert is multi-modal, i.e. it chooses and with equal probability as in Fig. 2(a). We choose a specific policy class which has policies , , and . selects , selects and stochastically selects , , or with probability . Later, we also consider a continuously parameterized policy class (Appendix G) for use with policy gradient methods.

The GridWorld environment is a world (Fig. 2(b)) with a start (S) and a terminal (T) state. Its center state is undesirable and the demonstrator moves to left or right at S to avoid the center  (Fig. 2(c)). The environment has control noise and transition noise . Fig. 2(d) shows the resulting multi-modal demonstration. We specify a policy class such that agents can go to up, right, down, left at each state. Later we consider a continuously parameterized policy class (Appendix G) for use with policy gradient.

(a) Bandit expert policy
(b) Gridworld
(c) Expert policy
(d) Rollouts
Figure 3: The demonstrator policy in the Bandit environment (2(a)); the GridWorld environment (2(b)), the GridWorld demonstrator policy (2(c)), and example rollouts and state visit frequency with noise(2(d))

Policy enumeration

To test H1, we enumerate through all policies in , exactly compute their stationary distributions , and select the policy with the smallest exact -divergence. Note that this is guaranteed to produce the optimal policy. Our results on the bandit and gridworld ( Table 3(a) and 3(b)) show that the globally optimal solution to the RKL objective successfully collapses to a single mode (e.g. A and Right, respectively), whereas KL and JS interpolate between the modes (i.e. M and Up, respectively).

Whether the optimal policy is mode-covering or collapsing depends on the stochasticity in the policy, parameterized by in the bandit case. Fig 4 shows how the divergences and resulting optimal policy changes as a function of . Note that RKL strongly prefers mode collapsing, KL strongly prefers mode covering, and JS is between the two other divergences.

Divergence estimation

To test H2, we compare the sample-based estimation of -divergence to the true value in Fig. 5. We highlight the preferred policies under each objective (in the 1 percentile of estimations). For the highlighted group, the estimation is often much lower than the true divergence for KL and JS, perhaps due to the sampling issue discussed in Appendix F.

Policy gradient optimization landscape

To test H3, we compare KLVIM, RKLVIM and JSVIM using policy gradient and solve for a locally optimal policy using policy gradient. Table 3(c) and 3(d) shows that RKL-VIM empirically produces policies that collapses to a single mode whereas JS and KL-VIM do not.

An interesting phenomena is that the policies produced by RKLVIM typically have low JS divergences compared with policies produced by JSVIM, as shown in Fig. 6. This suggests that the optimization landscape itself may be more amenable for imitation learning.

Table 2: Globally optimal policies produced by policy enumeration (3(a) and 3(b)), and locally optimal policies produced by policy gradient (3(c) and 3(d)). Note that in all cases, the RKL policy tends to collapse to one of the demonstrator modes, whereas the other policies interpolate between the modes, resulting in unsafe behavior.
(e) RKL
(f) JS
(g) KL
Figure 4: The true divergences and corresponding globally optimal bandit policy as a function of the control noise . Note that RKL strongly prefers the mode collapse policy (except at high control noise), KL strongly prefers the mode covering policy , and JS is between the two.
(a) RKL
(b) JS
(c) KL
Figure 5: Comparing -divergence with the estimated values. Each are normalized to between . Preferred policies under each objective (in the 1 percentile of estimations) are in red. The normalized estimations appear to be typically lower than the normalized true values for JS and KL.
(a) RKL
(b) JS
(c) KL
Figure 6: Divergences of locally optimal policies produced by RKLVIM, JSVIM and KLVIM. Each point is a policy produced by one of the fVIM. Interestingly, policies produced by RKLVIM also tend to have low JS divergence compared with policies produced by directly optimizing for JS divergences.


This work was (partially) funded by the National Institute of Health R01 (# R01EB019335), National Science Foundation CPS (#1544797), National Science Foundation NRI (#1637748), the Office of Naval Research, the RCTA, Amazon, and Honda Research Institute USA.


Appendix A Lower Bounding -Divergence of Trajectory Distribution with State Action Distribution

We begin with a lemma that relates -divergence between two vectors and their sum.

Lemma A.1 (Generalized log sum inequality).

Let and be non-negative numbers. Let and . Let be a convex function. We have the following:


where (12) is due to Jensen’s inequality since is convex and and . ∎

We use Lemma A.1 to prove a more general lemma that relates the f-divergence defined over two spaces where one of the space is rich enough in information to explain away the other.

Lemma A.2 (Information loss).

Let and be two random variables. Let be a joint probability distribution. The marginal distributions are and . Assume that can explain away . This is expressed as follows – given any two probability distribution , , assume the following equality holds for all :


Under these conditions, the following inequality holds:


We get (17) by applying and . We get (18) applying the equality constraint from (13). We get (19) from Lemma A.1 by setting and summing over all keeping fixed. ∎

We are now ready to prove Theorem 3.1 using Lemma A.2.

Proof of Theorem 3.1.

Let random variable belong to the space of trajectories . Let random variable belong to the space of state action pairs

. Note that for any joint distribution

and , the following is true


This is because a trajectory contains all information about , i.e. . Upon applying Lemma A.2 we have the inequality


The bound is reasonable as it merely states that information gets lost when temporal information is discarded. Note that the theorem also extends to average state distributions, i.e.

Corollary A.1.

Divergence between trajectory distribution is lower bounded by state distribution.

How tight is this lower bound? We examine the gap

Corollary A.2.

The gap between the two divergences is


where we use . ∎

Let be the set of trajectories that contain , i.e., . The gap is the conditional f-divergence of scaled by . The gap comes from whether we treat as separate events (in the case of trajectories) or as the same event (in the case of ).

Appendix B Relating -Divergence of Trajectory Distribution with Expected Action Distribution

In this section we explore the relation of divergences between induced trajectory distribution and induced action distribution. We begin with a general lemma

Lemma B.1.

Given a policy and a general feature function , the expected feature counts along induced trajectories is the same as expected feature counts on induced state action distribution


Expanding the LHS we have


where (27) is due to marginalizing out the future, (28) is due to the fact that state space is same across time and (29) results from applying average state distribution definition. ∎

We can use this lemma to get several useful equalities such as the average state visitation frequency

Theorem B.1.

Given a policy , if we do a tally count of states visited by induced trajectories, we recover the average state visitation frequency.


Apply Lemma B.1 with

Unfortunately Lemma B.1 does not hold for -divergences in general. But we can analyze a subclass of -divergences that satisfy the following triangle inequality:


Examples of such divergences are Total Variation distance or Squared Hellinger distance.

We now show that for such divergences (which are actually distances), we can upper bound the -divergence. Contrast this to the lower bound discussed in Appendix A. The upper bound is attractive because the trajectory divergence is the term that we actually care about bounding.

Also note the implications of the upper bound – we now need expert labels on states collected by the learner . Hence we need an interactive expert that we can query from arbitrary states.

Theorem B.2 (Upper bound).

Given two policies and , and f-divergences that satisfy the triangle inequality, divergence between the trajectory distribution is upper bounded by the expected divergence between the action distribution on states induced by .

Proof of Theorem b.2.

We will introduce some notations to aid in explaining the proof. Let a trajectory segment be


Recall that the probability of a trajectory induced by policy is


We also introduce a non-stationary policy that executes and then thereafter. Hence, the probability of a trajectory induced by is


Let us consider the divergence between distributions and and apply the triangle inequality (31) with respect to