1 Introduction
We study the problem of imitation learning from demonstrations that have multiple modes. This is often the case for tasks with multiple, diverse nearoptimal solutions. Here the expert has no clear preference between different choices (e.g. navigating left or right around obstacles (Ross et al., 2013)). Imperfect humanrobot interface also lead to variability in inputs (e.g. kinesthetic demonstrations with robot arms (Finn et al., 2016b)). Experts may also vary in skill, preferences and other latent factors. We argue that in many such settings, it suffices to learn a single mode of the expert demonstrations to solve the task. How do stateoftheart imitation learning approaches fare when presented with multimodal inputs?
Consider the example of imitating a racecar driver navigating around an obstacle. The expert sometimes steers left, other times steers right. What happens if we apply behavior cloning (Pomerleau, 1988)
on this data? The learner policy (a Gaussian with fixed variance) interpolates between the modes and drives into the obstacle.
Interestingly, this oddity is not restricted to behavior cloning. Li et al. (2017) show that a more sophisticated approach, GAIL Ho and Ermon (2016), also exhibits a similar trend. Their proposed solution, InfoGAIL (Li et al., 2017), tries to recover all the latent modes and learn a policy for each one. For demonstrations with several modes, recovering all such policies will be prohibitively slow to converge.
Our key insight is to view imitation learning algorithms as minimizing divergence between the expert and the learner trajectory distributions. Specifically, we examine the family of divergences. Since they cannot be minimized exactly, we adopt estimators from Nowozin et al. (2016). We show that behavior cloning minimizes the KullbackLeibler (KL) divergence (Mprojection), GAIL minimizes the JensenShannon (JS) divergence and DAgger minimizes the Total Variation (TV). Since both JS and KL divergence exhibit a modecovering behavior, they end up interpolating across modes. On the other hand, the reverseKL divergence (Iprojection) has a modeseeking behavior and elegantly collapses on a subset of modes fairly quickly.
The contributions and organization of the remainder of the paper^{1}^{1}1We refer the reader to supplementary for appendices containing detailed exposition. is as follows.

We introduce a unifying framework for imitation learning as minimization of divergence between learner and trajectory distributions (Section 3).

We propose algorithms for minimizing estimates of any divergence. Our framework is able to recover several existing imitation learning algorithms for different divergences. We closely examine reverse KL divergence and propose efficient algorithms for it (Section 4).
2 Related Work
Imitation learning (IL) has a longstanding history in robotics as a tool to program desired skills and behavior in autonomous machines (Osa et al., 2018; Argall et al., 2009; Billard et al., 2016; Bagnell, 2015)
. Even though IL has of late been used to bootstrap reinforcement learning (RL)
(Ross and Bagnell, 2014; Sun et al., 2017, 2018; Cheng et al., 2018; Rajeswaran et al., 2017), we focus on the original problem where an extrinsic reward is not defined. We ask the question – “what objective captures the notion of similarity to expert demonstrations?”. Note that this question is orthogonal to other factors such as whether we are modelbased / modelfree or whether we use a policy / trajectory representation.IL can be viewed as supervised learning (SL) where the learner selects the same action as the expert (referred to as behavior cloning
(Pomerleau, 1989)). However small errors lead to large distribution mismatch. This can be somewhat alleviated by interactive learning, such as DAgger (Ross et al., 2011). Although shown to be successful in various applications (Ross et al., 2013; Kim et al., 2013; Gupta et al., 2017), there are domains where it’s impractical to have onpolicy expert labels (Laskey et al., 2017b, 2016). More alarmingly, there are counterexamples where the DAgger objective results in undesirable behaviors (Laskey et al., 2017a). We discuss this further in Appendix C.Another way is to view IL as recovering a reward (IRL) (Ratliff et al., 2009, 2006) or Qvalue (Piot et al., 2017) that makes the expert seem optimal. Since this is overly strict, it can be relaxed to value matching which, for linear rewards, further reduces to matching feature expectations (Abbeel and Ng, 2004)
. Moment matching naturally leads to maximum entropy formulations
(Ziebart et al., 2008) which has been used successfully in various applications (Finn et al., 2016b; Wulfmeier et al., 2015). Interestingly, our divergence estimators also match moments suggesting a deeper connection.The degeneracy issues of IRL can be alleviated by a game theoretic framework where an adversary selects a reward function and the learner must compete to do as well as the expert (Syed and Schapire, 2008; Ho et al., 2016). Hence IRL can be connected to minmax formulations (Finn et al., 2016a) like GANs (Goodfellow et al., 2014). GAIL (Ho and Ermon, 2016), SAM (Blondé and Kalousis, 2018) uses this to directly recover policies. AIRL (Fu et al., 2017), EAIRL (Qureshi and Yip, 2018) uses this to recover rewards. This connection to GANs leads to interesting avenues such as stabilizing minmax games (Peng et al., 2018b), learning from pure observations (Torabi et al., 2018b, a; Peng et al., 2018a) and links to fdivergence minimization (Nowozin et al., 2016; Nguyen et al., 2010).
In this paper, we view IL as divergence minimization between learner and expert. Our framework encompasses methods that look at specific measures of divergence such as minimizing relative entropy (Boularias et al., 2011) or symmetric crossentropy (Rhinehart et al., 2018). Note that Ghasemipour et al. (2018) also independently arrives at such connections between fdivergence and IL.^{2}^{2}2Although the algorithms we propose for RKL and our analysis of DAgger is different significantly. We particularly focus on multimodal expert demonstrations which has generally treated by clustering data and learning on each cluster (Babes et al., 2011; Dimitrakakis and Rothkopf, 2011). InfoGAN (Chen et al., 2016) formalizes this within the GAN framework to recover latent clusters which is then extended to IL (Hausman et al., 2017; Li et al., 2017). Instead, we look at the role of divergence with such inputs.
3 Problem Formulation
Preliminaries
We work with a finite horizon Markov Decision Process (MDP)
where is a set of states, is a set of actions, and is the transition dynamics. is the initial distribution over states and is the time horizon. In IL paradigm, the MDP does not include a reward function.The fdivergence family
Divergences, such as the well known KullbackLeibler (KL) divergence, measure differences between probability distributions. We consider a broad class of such divergences called
fdivergences (Csiszár et al., 2004; Liese and Vajda, 2006). Given probability distributions andover a finite set of random variables
, such that is absolutely continuous w.r.t , we define the fdivergence:(2) 
where is a convex, lower semicontinuous function. Different choices of recover different different divergences, e.g. KL, Jensen Shannon or Total Variation (see (Nowozin et al., 2016) for a full list).
Imitation learning as fdivergence minimization
Imitation learning is the process by which a learner tries to behave similarly to an expert based on inference from demonstrations or interactions. There are a number of ways to formalize “similarity” (Section 2) – either as a classification problem where learner must select the same action as the expert (Ross et al., 2011) or as an inverse RL problem where learner recovers a reward to explain expert behavior (Ratliff et al., 2009). Neither of the formulations is error free.
We argue that the metric we actually care about is matching the distribution of trajectories . One such reasonable objective is to minimize the divergence between these distributions
(3) 
Interestingly, different choice of divergence leads to different learned policies (more in Section 5).
Since we have only sample access to the expert stateaction distribution, the divergence between the expert and the learner has to be estimated. However, we need many samples to accurately estimate the trajectory distribution as the size of the trajectory space grows exponentially with time, i.e. . Instead, we can choose to minimize the divergence between the average stateaction distribution as the following:
(4) 
We show that this lower bounds the original objective, i.e. trajectory distribution divergence.
Theorem 3.1 (Proof in Appendix A).
Given two policies and , the fdivergence between trajectory distribution is lower bounded by fdivergence between average stateaction distribution.
4 Framework for Divergence Minimization
The key problem is that we don’t know the expert policy and only get to observe it. Hence we are unable to compute the divergence exactly and must instead estimate it based on sample demonstrations. We build an estimator which lower bounds the stateaction, and thus, trajectory divergence. The learner then minimizes the estimate.
4.1 Variational approximation of divergence
Let’s say we want to measure the divergence between two distributions and . Assume they are unknown but we have i.i.d samples, i.e., and . Can we use these to estimate the divergence? Nguyen et al. (2010) show that we can indeed estimate it by expressing in it’s variational form, i.e. , where is the convex conjugate ^{4}^{4}4For a convex function , the convex conjugate is . Also . Plugging this in the expression for divergence (2) we have
(5)  
Here is a function approximator which we refer to as an estimator. The lower bound is both due to Jensen’s inequality and the restriction to an estimator class . Intuitively, we convert divergence estimation to a discriminative classification problem between two sample sets.
How should we choose estimator class ? We can find the optimal estimator by taking the variation of the lower bound (5) to get . Hence should be flexible enough to approximate the subdifferential everywhere
. Can we use neural networks discriminators
(Goodfellow et al., 2014) as our class ? Nowozin et al. (2016) show that to satisfy the range constraints, we can parameterize where is an unconstrained discriminator and is an activation function. We plug this in (5) and the result in (4) to arrive at the following problem.Problem 1 (Variational Imitation (VIM)).
Given a divergence , compute a learner and discriminator as the saddle point of the following optimization
(6) 
where are sample expert demonstrations, are samples learner rollouts.
We propose the algorithmic framework f–VIM (Algorithm 1) which solves (6) iteratively by updating estimator via supervised learning and learner via policy gradients. Algorithm 1 is a metaalgorithm. Plugging in different divergences (Table 1), we have different algorithms

KL–VIM: Minimizing forward KL divergence
(7) 
RKL–VIM: Minimizing reverse KL divergence (removing constant factors)
(8) 
JS–VIM: Minimizing JensenShannon divergence
(9) where .
4.2 Recovering existing imitation learning algorithms
We show how various existing IL approaches can be recovered deferring to Appendix C for details.
Behavior Cloning (Pomerleau, 1988) – KullbackLeibler (KL) divergence. For KL, setting in (3) and applying Markov we have which is simply behavior cloning with a cross entropy loss for multiclass classification.
Generative Adversarial Imitation Learning (GAIL) (Ho and Ermon, 2016) – JensenShannon (JS) divergence. We see that JSVIM (9) is exactly the GAIL optimization (without the entropic regularizer).
Divergence  

KullbackLeibler  
Reverse KL  
JensenShannon  
Total Variation 
Dataset Aggregation (DAgger) (Ross et al., 2011) – Total Variation (TV) distance. Using the fact that TV is a distance metric, and Psinker’s inequality we have the following upper bound on TV
DAgger solves this non i.i.d problem in an iterative supervised learning manner with an interactive expert. Counterexamples to DAgger (Laskey et al., 2017a) can now be explained as an artifact of this divergence.
4.3 Alternate techniques for Reverse KL minimization via interactive learning
We highlight the Reverse KL divergence which has received relatively less attention in IL literature. We briefly summarize our approaches, deferring to Appendix D and Appendix E for details. RKL–VIM (8) has some shortcomings. First, it’s a double lower bound approximation due to Theorem 3.1) and (5). Secondly, the optimal estimator is a stateaction density ratio which maybe quite complex (Table 1). Finally, the optimization (6) may be slow to converge. Interestingly, Reverse KL has a special structure that we can exploit to do even better if we have an interactive expert!
Hence we can directly minimize action distribution divergence. Since this is on states induced by , this falls under the regime of interactive learning (Ross et al., 2011) where we query the expert on states visited by the learner. We explore two different interactive learning techinques for Iprojection.
Variational action divergence minimization. Apply the RKL–VIM idea but on action divergence:
(10) 
Unlike RKL–VIM, we collect a fresh batch of data from both an interactive expert and learner every iteration. We show that this estimator is far easier to approximate than RKL–VIM (Appendix D).
Density ratio minimization via no regret online learning. We first upper bound action divergence:
Given a batch of data from an interactive expert and the learner, we invoke an offshelf density ratio estimator (DRE) (Kanamori et al., 2012) to get . Since the optimization is a non i.i.d learning problem, we solve it by dataset aggregation. Note this does not require invoking policy gradients. In fact, if we choose an expressive enough policy class, this method gives us a global performance guarantee which neither GAIL or any f–VIM provides (Appendix E).
5 Multimodal Trajectory Demonstrations
We now examine multimodal expert demonstrations. Consider the demonstrations in Fig. 2 which avoid colliding with a tree by turning left or right with equal probability. Depending on the policy class, it may be impossible to achieve zero divergence for any choice of divergence (Fig. 1(a)), e.g., is Gaussian with fixed variance. Then the question becomes, if the globally optimal policy in our policy class achieves nonzero divergence, how should we design our objective to fail elegantly and safely? In this example, one can imagine two reasonable choices: (1) replicate one of the modes very well (i.e. modecollapsing) or (2) cover both the modes plus the region between them (i.e. modecovering). We argue that in many imitation learning tasks the former behavior is preferable, as it produces trajectories similar to previously observed demonstrations.
Modecovering in KL.
This divergence exhibits strong modecovering tendencies as in Fig. 1(c)
. Examining the definition of the KL divergence, we see that there is a significant penalty for failing to completely support the demonstration distribution, but no explicit penalty for generating outlier samples. In fact, if
, then the divergence is infinite. However, the opposite does not hold. Thus, the KL–VIM optimal policy in belongs to the second behavior class – which the agent to frequently crash into the tree.Modecollapsing in RKL.
At the other end of the multimodal behavior spectrum lies the RKL divergence, which exhibits strong modeseeking behavior as in Fig. 1(b), due to switching the expectation over with . Note there is no explicit penalty for failing to entirely cover , but an arbitrarily large penalty for generating samples which would are improbable under the demonstrator distribution. This results in always turning left or always turning right around the tree, depending on the initialization and mode mixture. For many tasks, failing in such a manner is predictable and safe, as we have already seen similar trajectories from the demonstrator.
JensenShannon.
This divergence may fall into either behavior class, depending on the MDP, the demonstrations, and the optimization initialization. Examining the definition, we see the divergence is symmetric and expectations are taken over both and . Thus, if either distribution is unsupported (i.e. or vice versa) the divergence remains finite. Later, we empirically show that although it is possible to achieve safe modecollapse with JS on some tasks, this is not always the case.
6 Experiments
In this section, we empirically validate the following Hypotheses:

[itemsep=.3pt,topsep=0pt]

The globally optimal policy for RKL imitates a subset of the demonstrator modes, whereas JS and KL tend to interpolate between them.

The samplebased estimator for KL and JS underestimates the divergence more than RKL.

The policy gradient optimization landscape for KL and JS with continuously parameterized policies is more susceptible to local minima, compared to RKL.
To test these hypothesis, we introduce two environments, Bandit and GridWorld, and some useful policy classes. (1) The bandit environment consists and three actions, , and and a single state. The expert is multimodal, i.e. it chooses and with equal probability as in Fig. 2(a). We choose a specific policy class which has policies , , and . selects , selects and stochastically selects , , or with probability . Later, we also consider a continuously parameterized policy class (Appendix G) for use with policy gradient methods.
The GridWorld environment is a world (Fig. 2(b)) with a start (S) and a terminal (T) state. Its center state is undesirable and the demonstrator moves to left or right at S to avoid the center (Fig. 2(c)). The environment has control noise and transition noise . Fig. 2(d) shows the resulting multimodal demonstration. We specify a policy class such that agents can go to up, right, down, left at each state. Later we consider a continuously parameterized policy class (Appendix G) for use with policy gradient.
Policy enumeration
To test H1, we enumerate through all policies in , exactly compute their stationary distributions , and select the policy with the smallest exact divergence. Note that this is guaranteed to produce the optimal policy. Our results on the bandit and gridworld ( Table 3(a) and 3(b)) show that the globally optimal solution to the RKL objective successfully collapses to a single mode (e.g. A and Right, respectively), whereas KL and JS interpolate between the modes (i.e. M and Up, respectively).
Whether the optimal policy is modecovering or collapsing depends on the stochasticity in the policy, parameterized by in the bandit case. Fig 4 shows how the divergences and resulting optimal policy changes as a function of . Note that RKL strongly prefers mode collapsing, KL strongly prefers mode covering, and JS is between the two other divergences.
Divergence estimation
To test H2, we compare the samplebased estimation of divergence to the true value in Fig. 5. We highlight the preferred policies under each objective (in the 1 percentile of estimations). For the highlighted group, the estimation is often much lower than the true divergence for KL and JS, perhaps due to the sampling issue discussed in Appendix F.
Policy gradient optimization landscape
To test H3, we compare KL–VIM, RKL–VIM and JS–VIM using policy gradient and solve for a locally optimal policy using policy gradient. Table 3(c) and 3(d) shows that RKLVIM empirically produces policies that collapses to a single mode whereas JS and KLVIM do not.
An interesting phenomena is that the policies produced by RKL–VIM typically have low JS divergences compared with policies produced by JS–VIM, as shown in Fig. 6. This suggests that the optimization landscape itself may be more amenable for imitation learning.




Acknowledgements
This work was (partially) funded by the National Institute of Health R01 (# R01EB019335), National Science Foundation CPS (#1544797), National Science Foundation NRI (#1637748), the Office of Naval Research, the RCTA, Amazon, and Honda Research Institute USA.
References

Abbeel and Ng [2004]
Pieter Abbeel and Andrew Y Ng.
Apprenticeship learning via inverse reinforcement learning.
In
Proceedings of the twentyfirst international conference on Machine learning
, page 1. ACM, 2004.  Argall et al. [2009] Brenna D Argall, Sonia Chernova, Manuela Veloso, and Brett Browning. A survey of robot learning from demonstration. Robotics and autonomous systems, 57(5):469–483, 2009.
 Babes et al. [2011] Monica Babes, Vukosi Marivate, Kaushik Subramanian, and Michael L Littman. Apprenticeship learning about multiple intentions. In Proceedings of the 28th International Conference on Machine Learning (ICML11), pages 897–904, 2011.
 Bagnell [2015] J. Andrew (Drew) Bagnell. An invitation to imitation. Technical Report CMURITR1508, Carnegie Mellon University, Pittsburgh, PA, March 2015.
 Billard et al. [2016] Aude G Billard, Sylvain Calinon, and Rüdiger Dillmann. Learning from humans. In Springer handbook of robotics, pages 1995–2014. Springer, 2016.
 Blondé and Kalousis [2018] Lionel Blondé and Alexandros Kalousis. Sampleefficient imitation learning via generative adversarial nets. arXiv preprint arXiv:1809.02064, 2018.

Boularias et al. [2011]
Abdeslam Boularias, Jens Kober, and Jan Peters.
Relative entropy inverse reinforcement learning.
In
Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics
, pages 182–189, 2011.  Chen et al. [2016] Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter Abbeel. Infogan: Interpretable representation learning by information maximizing generative adversarial nets. In Advances in neural information processing systems, pages 2172–2180, 2016.
 Cheng et al. [2018] ChingAn Cheng, Xinyan Yan, Nolan Wagener, and Byron Boots. Fast policy learning through imitation and reinforcement. arXiv preprint arXiv:1805.10413, 2018.
 Csiszár et al. [2004] Imre Csiszár, Paul C Shields, et al. Information theory and statistics: A tutorial. Foundations and Trends® in Communications and Information Theory, 1(4):417–528, 2004.
 Dimitrakakis and Rothkopf [2011] Christos Dimitrakakis and Constantin A Rothkopf. Bayesian multitask inverse reinforcement learning. In European Workshop on Reinforcement Learning, pages 273–284. Springer, 2011.
 Finn et al. [2016a] Chelsea Finn, Paul Christiano, Pieter Abbeel, and Sergey Levine. A connection between generative adversarial networks, inverse reinforcement learning, and energybased models. arXiv preprint arXiv:1611.03852, 2016a.
 Finn et al. [2016b] Chelsea Finn, Sergey Levine, and Pieter Abbeel. Guided cost learning: Deep inverse optimal control via policy optimization. In International Conference on Machine Learning, pages 49–58, 2016b.
 Fu et al. [2017] Justin Fu, Katie Luo, and Sergey Levine. Learning robust rewards with adversarial inverse reinforcement learning. arXiv preprint arXiv:1710.11248, 2017.
 Ghasemipour et al. [2018] Seyed Kamyar Seyed Ghasemipour, Shixiang Gu, and Richard Zemel. Understanding the relation between maximumentropy inverse reinforcement learning and behaviour cloning. Workshop ICLR, 2018.
 Goodfellow et al. [2014] Ian Goodfellow, Jean PougetAbadie, Mehdi Mirza, Bing Xu, David WardeFarley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014.

Gupta et al. [2017]
Saurabh Gupta, James Davidson, Sergey Levine, Rahul Sukthankar, and Jitendra
Malik.
Cognitive mapping and planning for visual navigation.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pages 2616–2625, 2017.  Hausman et al. [2017] Karol Hausman, Yevgen Chebotar, Stefan Schaal, Gaurav Sukhatme, and Joseph J Lim. Multimodal imitation learning from unstructured demonstrations using generative adversarial nets. In Advances in Neural Information Processing Systems, pages 1235–1245, 2017.
 Ho and Ermon [2016] Jonathan Ho and Stefano Ermon. Generative adversarial imitation learning. In Advances in Neural Information Processing Systems, pages 4565–4573, 2016.
 Ho et al. [2016] Jonathan Ho, Jayesh Gupta, and Stefano Ermon. Modelfree imitation learning with policy optimization. In International Conference on Machine Learning, pages 2760–2769, 2016.
 Kanamori et al. [2012] Takafumi Kanamori, Taiji Suzuki, and Masashi Sugiyama. Statistical analysis of kernelbased leastsquares densityratio estimation. Machine Learning, 86(3):335–367, 2012.
 Kim et al. [2013] Beomjoon Kim, Amirmassoud Farahmand, Joelle Pineau, and Doina Precup. Learning from limited demonstrations. In Advances in Neural Information Processing Systems, pages 2859–2867, 2013.

Laskey et al. [2016]
Michael Laskey, Sam Staszak, Wesley YuShu Hsieh, Jeffrey Mahler, Florian T
Pokorny, Anca D Dragan, and Ken Goldberg.
Shiv: Reducing supervisor burden in dagger using support vectors for efficient learning from demonstrations in high dimensional state spaces.
In Robotics and Automation (ICRA), 2016 IEEE International Conference on, pages 462–469. IEEE, 2016. 
Laskey et al. [2017a]
Michael Laskey, Caleb Chuck, Jonathan Lee, Jeffrey Mahler, Sanjay Krishnan,
Kevin Jamieson, Anca Dragan, and Ken Goldberg.
Comparing humancentric and robotcentric sampling for robot deep learning from demonstrations.
In Robotics and Automation (ICRA), 2017 IEEE International Conference on, pages 358–365. IEEE, 2017a.  Laskey et al. [2017b] Michael Laskey, Jonathan Lee, Wesley Hsieh, Richard Liaw, Jeffrey Mahler, Roy Fox, and Ken Goldberg. Iterative noise injection for scalable imitation learning. arXiv preprint arXiv:1703.09327, 2017b.
 Li et al. [2017] Yunzhu Li, Jiaming Song, and Stefano Ermon. Infogail: Interpretable imitation learning from visual demonstrations. In Advances in Neural Information Processing Systems, pages 3812–3822, 2017.
 Liese and Vajda [2006] Friedrich Liese and Igor Vajda. On divergences and informations in statistics and information theory. IEEE Transactions on Information Theory, 52(10):4394–4412, 2006.
 Nguyen et al. [2010] XuanLong Nguyen, Martin J Wainwright, and Michael I Jordan. Estimating divergence functionals and the likelihood ratio by convex risk minimization. IEEE Transactions on Information Theory, 56(11):5847–5861, 2010.
 Nowozin et al. [2016] Sebastian Nowozin, Botond Cseke, and Ryota Tomioka. fgan: Training generative neural samplers using variational divergence minimization. In Advances in Neural Information Processing Systems, pages 271–279, 2016.
 Osa et al. [2018] Takayuki Osa, Joni Pajarinen, Gerhard Neumann, J. Andrew Bagnell, Pieter Abbeel, and Jan Peters. An algorithmic perspective on imitation learning. Foundations and TrendsÂ® in Robotics, 7(12):1–179, 2018. ISSN 19358253. doi: 10.1561/2300000053. URL http://dx.doi.org/10.1561/2300000053.
 Peng et al. [2018a] Xue Bin Peng, Angjoo Kanazawa, Jitendra Malik, Pieter Abbeel, and Sergey Levine. Sfv: Reinforcement learning of physical skills from videos. In SIGGRAPH Asia 2018 Technical Papers, page 178. ACM, 2018a.
 Peng et al. [2018b] Xue Bin Peng, Angjoo Kanazawa, Sam Toyer, Pieter Abbeel, and Sergey Levine. Variational discriminator bottleneck: Improving imitation learning, inverse rl, and gans by constraining information flow. arXiv preprint arXiv:1810.00821, 2018b.
 Piot et al. [2017] Bilal Piot, Matthieu Geist, and Olivier Pietquin. Bridging the gap between imitation learning and inverse reinforcement learning. IEEE transactions on neural networks and learning systems, 28(8):1814–1826, 2017.
 Pomerleau [1988] Dean Pomerleau. ALVINN: an autonomous land vehicle in a neural network. In Advances in Neural Information Processing Systems 1, [NIPS Conference, Denver, Colorado, USA, 1988], pages 305–313, 1988. URL http://papers.nips.cc/paper/95alvinnanautonomouslandvehicleinaneuralnetwork.
 Pomerleau [1989] Dean A Pomerleau. Alvinn: An autonomous land vehicle in a neural network. In Advances in neural information processing systems, pages 305–313, 1989.
 Qureshi and Yip [2018] Ahmed H Qureshi and Michael C Yip. Adversarial imitation via variational inverse reinforcement learning. arXiv preprint arXiv:1809.06404, 2018.
 Rajeswaran et al. [2017] Aravind Rajeswaran, Vikash Kumar, Abhishek Gupta, Giulia Vezzani, John Schulman, Emanuel Todorov, and Sergey Levine. Learning complex dexterous manipulation with deep reinforcement learning and demonstrations. arXiv preprint arXiv:1709.10087, 2017.
 Ratliff et al. [2006] Nathan D Ratliff, J Andrew Bagnell, and Martin A Zinkevich. Maximum margin planning. In Proceedings of the 23rd international conference on Machine learning, pages 729–736. ACM, 2006.
 Ratliff et al. [2009] Nathan D Ratliff, David Silver, and J Andrew Bagnell. Learning to search: Functional gradient techniques for imitation learning. Autonomous Robots, 27(1):25–53, 2009.
 Rhinehart et al. [2018] Nicholas Rhinehart, Kris M. Kitani, and Paul Vernaza. R2p2: A reparameterized pushforward policy for diverse, precise generative path forecasting. In The European Conference on Computer Vision (ECCV), September 2018.
 Ross and Bagnell [2014] Stephane Ross and J Andrew Bagnell. Reinforcement and imitation learning via interactive noregret learning. arXiv preprint arXiv:1406.5979, 2014.
 Ross et al. [2011] Stéphane Ross, Geoffrey J Gordon, and Drew Bagnell. A reduction of imitation learning and structured prediction to noregret online learning. In AISTATS, 2011.
 Ross et al. [2013] Stéphane Ross, Narek MelikBarkhudarov, Kumar Shaurya Shankar, Andreas Wendel, Debadeepta Dey, J Andrew Bagnell, and Martial Hebert. Learning monocular reactive uav control in cluttered natural environments. In Robotics and Automation (ICRA), 2013 IEEE International Conference on, pages 1765–1772. IEEE, 2013.
 Sun et al. [2017] Wen Sun, Arun Venkatraman, Geoffrey J Gordon, Byron Boots, and J Andrew Bagnell. Deeply aggrevated: Differentiable imitation learning for sequential prediction. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pages 3309–3318. JMLR. org, 2017.
 Sun et al. [2018] Wen Sun, J Andrew Bagnell, and Byron Boots. Truncated horizon policy search: Combining reinforcement learning & imitation learning. arXiv preprint arXiv:1805.11240, 2018.
 Syed and Schapire [2008] Umar Syed and Robert E Schapire. A gametheoretic approach to apprenticeship learning. In Advances in neural information processing systems, pages 1449–1456, 2008.
 Torabi et al. [2018a] Faraz Torabi, Garrett Warnell, and Peter Stone. Behavioral cloning from observation. arXiv preprint arXiv:1805.01954, 2018a.
 Torabi et al. [2018b] Faraz Torabi, Garrett Warnell, and Peter Stone. Generative adversarial imitation from observation. arXiv preprint arXiv:1807.06158, 2018b.
 Wulfmeier et al. [2015] Markus Wulfmeier, Peter Ondruska, and Ingmar Posner. Maximum entropy deep inverse reinforcement learning. arXiv preprint arXiv:1507.04888, 2015.
 Ziebart et al. [2008] Brian D Ziebart, Andrew L Maas, J Andrew Bagnell, and Anind K Dey. Maximum entropy inverse reinforcement learning. In AAAI, volume 8, pages 1433–1438. Chicago, IL, USA, 2008.
Appendix A Lower Bounding Divergence of Trajectory Distribution with State Action Distribution
We begin with a lemma that relates divergence between two vectors and their sum.
Lemma A.1 (Generalized log sum inequality).
Let and be nonnegative numbers. Let and . Let be a convex function. We have the following:
(11) 
Proof.
(12) 
where (12) is due to Jensen’s inequality since is convex and and . ∎
We use Lemma A.1 to prove a more general lemma that relates the fdivergence defined over two spaces where one of the space is rich enough in information to explain away the other.
Lemma A.2 (Information loss).
Let and be two random variables. Let be a joint probability distribution. The marginal distributions are and . Assume that can explain away . This is expressed as follows – given any two probability distribution , , assume the following equality holds for all :
(13) 
Under these conditions, the following inequality holds:
(14) 
Proof.
(15)  
(16)  
(17)  
(18)  
(19)  
(20) 
Proof of Theorem 3.1.
Let random variable belong to the space of trajectories . Let random variable belong to the space of state action pairs
. Note that for any joint distribution
and , the following is true(21) 
This is because a trajectory contains all information about , i.e. . Upon applying Lemma A.2 we have the inequality
(22) 
∎
The bound is reasonable as it merely states that information gets lost when temporal information is discarded. Note that the theorem also extends to average state distributions, i.e.
Corollary A.1.
Divergence between trajectory distribution is lower bounded by state distribution.
How tight is this lower bound? We examine the gap
Corollary A.2.
The gap between the two divergences is
Proof.
where we use . ∎
Let be the set of trajectories that contain , i.e., . The gap is the conditional fdivergence of scaled by . The gap comes from whether we treat as separate events (in the case of trajectories) or as the same event (in the case of ).
Appendix B Relating Divergence of Trajectory Distribution with Expected Action Distribution
In this section we explore the relation of divergences between induced trajectory distribution and induced action distribution. We begin with a general lemma
Lemma B.1.
Given a policy and a general feature function , the expected feature counts along induced trajectories is the same as expected feature counts on induced state action distribution
(23) 
Proof.
We can use this lemma to get several useful equalities such as the average state visitation frequency
Theorem B.1.
Given a policy , if we do a tally count of states visited by induced trajectories, we recover the average state visitation frequency.
(30) 
Proof.
Apply Lemma B.1 with ∎
Unfortunately Lemma B.1 does not hold for divergences in general. But we can analyze a subclass of divergences that satisfy the following triangle inequality:
(31) 
Examples of such divergences are Total Variation distance or Squared Hellinger distance.
We now show that for such divergences (which are actually distances), we can upper bound the divergence. Contrast this to the lower bound discussed in Appendix A. The upper bound is attractive because the trajectory divergence is the term that we actually care about bounding.
Also note the implications of the upper bound – we now need expert labels on states collected by the learner . Hence we need an interactive expert that we can query from arbitrary states.
Theorem B.2 (Upper bound).
Given two policies and , and fdivergences that satisfy the triangle inequality, divergence between the trajectory distribution is upper bounded by the expected divergence between the action distribution on states induced by .
Proof of Theorem b.2.
We will introduce some notations to aid in explaining the proof. Let a trajectory segment be
(32) 
Recall that the probability of a trajectory induced by policy is
(33) 
We also introduce a nonstationary policy that executes and then thereafter. Hence, the probability of a trajectory induced by is
(34) 
Let us consider the divergence between distributions and and apply the triangle inequality (31) with respect to
(35)  
(36)  
(37)  
(38)  
(39)  
Comments
There are no comments yet.