The Past and Present of Imitation Learning: A Citation Chain Study

01/08/2020
by   Nishanth Kumar, et al.
Brown University
0

Imitation Learning is a promising area of active research. Over the last 30 years, Imitation Learning has advanced significantly and been used to solve difficult tasks ranging from Autonomous Driving to playing Atari games. In the course of this development, different methods for performing Imitation Learning have fallen into and out of favor. In this paper, I explore the development of these different methods and attempt to examine how the field has progressed. I focus my analysis on surveying 4 landmark papers that sequentially build upon each other to develop increasingly impressive Imitation Learning methods.

READ FULL TEXT VIEW PDF

Authors

page 2

page 3

11/21/2019

State Alignment-based Imitation Learning

Consider an imitation learning problem that the imitator and the expert ...
06/23/2021

Imitation Learning: Progress, Taxonomies and Opportunities

Imitation learning aims to extract knowledge from human experts' demonst...
11/25/2020

Experiments in Autonomous Driving Through Imitation Learning

This report demonstrates several methods used to make a self-driving veh...
07/18/2016

Imitation Learning with Recurrent Neural Networks

We present a novel view that unifies two frameworks that aim to solve se...
11/16/2018

An Algorithmic Perspective on Imitation Learning

As robots and other intelligent agents move from simple environments and...
05/23/2019

Action Assembly: Sparse Imitation Learning for Text Based Games with Combinatorial Action Spaces

We propose a computationally efficient algorithm that combines compresse...
05/15/2019

Simitate: A Hybrid Imitation Learning Benchmark

We present Simitate --- a hybrid benchmarking suite targeting the evalua...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Imitation Learning is a promising area of active research. Early research in ’programming by example’ began in Software Development [8]

before attracting the interest of Robotics and Artificial Intelligence (AI) researchers, who began using the terms ’Learning from Demonstration’ and ’Imitation Learning’ to describe their line of work. Over the last

years, Imitation Learning has advanced significantly and been used to solve difficult tasks ranging from Autonomous Driving [11] to playing Atari games [4]. In the course of this development, different methods for performing Imitation Learning have fallen into and out of favor. In this paper, I will explore the development of these different methods and attempt to examine how the field has progressed.

I will be discussing 4 landmark papers that sequentially cite and inform each other. In discussing these papers, I will focus on their ’big ideas’ and how each idea influenced those that came after it. In order of their publication date, these are:

  1. ’ALVINN: An Autonomous Land Vehicle in a Neural Network

    by Pomerleau [11]

  2. ’Apprenticeship Learning via Inverse Reinforcement Learning

    by Abbeel and Ng [1]

  3. ’Generative Adversarial Imitation Learning’ by Ho and Ermon [9]

  4. ’A Divergence Minimization Perspective on Imitation Learning Methods’ by Ghasemipour, Zemel and Gu [6].

Before discussing the papers themselves, I will provide some essential context by describing the fundamental problem that Imitation Learning attempts to solve.

Ii Background: The Problem of Imitation Learning

111Material in this section adapted from Levine [10]

At a high level, ’Imitation Learning’ attempts to learn a general skill after observing some demonstrations of an expert performing the skill. Crucially, this learned skill is expected to generalize to situations where the learner has not observed the expert’s actions.

More formally, let represent a set of observations obtained from sensors at time that constitute the state. Let represent a set of actions (eg. voltage sent to motors of a robot) taken by the expert at time . Let represent a trajectory that is a set of state, expert action pairs [ for ]. Imitation Learning essentially assumes it is given of these trajectories as input. The expected output is a policy (where

represents a set of policy-parameters) that maps states to actions. Specifically, it is taken to be a probability distribution that is a function of actions to be taken and states observed at each time step. Crucially, Imitation Learning hopes that the learned policy’s distribution of actions will match the experts distribution of actions (i.e, the distribution that

are drawn from).

Iii Pomerleau’s Autonomous Driving Paper

Given the substantial research into and success of Machine Learning methods, the most natural thing to do might be to frame the Imitation Learning problem as a ’Supervised Learning

222For more information on Supervised Learning, refer to Wikipedia’s comprehensive and well-written article [13] problem. This can be done rather simply: use any Supervised Learning algorithm to learn a function mapping states () to actions () after having trained on the collected trajectories of states observed and expert actions [ for ]. This is the essence of Pomerleau’s approach to using Imitation Learning for Autonomous Driving.

Fig. 1: An image from Pomerleau’s paper depicting his neural network.

As can be seen from Fig. 1, Pomerleau used a Deep Neural Network (DNN) to perform his Supervised Learning of the expert actions. The network took two inputs: video and data from a laser range finder mounted on the car. It used these to produce a 45-unit long vector representing the curvature that the vehicle would need to travel along to reach the center of the road. It also produced an output indicating how much the road visually contrasted its surroundings. This output was fed back into the network as an input to improve its prediction accuracy.

Unlike many contemporary DNN’s, Pomerleau didn’t use a real-world dataset to train his DNN because it was infeasible for him to collect, store and process the necessary amount of data. Instead, he created a ’road simulator’ that would produce realistic images of roads with added noise and varying lighting conditions. Pomerleau’s simulator also generated laser range-finder input corresponding to his images. Finally, the simulator would produce ’expert’ actions based on knowing the ground-truth curvature of the road it produced.

Pomerleau’s approach exhibited rather impressive performance. His trained DNN was able to drive CMU’s NAVLAB (a modified Chevy van equipped with sensors and computers) at a speed of meters per second through a meter area of CMU’s campus under sunny conditions. This performance was comparable to that achieved by the state-of-the-art hand-engineered vision and navigation algorithms of the time. Given that this happened in - a time when most computers had a RAM of approximately MB and DNN’s were widely believed to be useless - Pomerleau’s results are truly remarkable.

Even though Pomerleau only tested his algorithm on one small environment in optimal weather conditions, his results are still noteworthy. After having developed his road simulator, Pomerleau was able to accomplish in ” minutes of training time” what took CMU’s Vision and Navigation groups months of hand-tuning their algorithms. This work was one of Imitation Learning’s first major successes and it seems to have catalyzed a strong research interest in the field [2] [3] [1].

Iv Abbeel’s Inverse Reinforcement Learning Paper

Inspired by the success of Pomerleau [11] and others in using Imitation Learning to solve complex problems, this paper sought to view the Imitation Learning Problem through a different lens. While almost all previous work had framed Imitation Learning as a Supervised Learning problem, this paper leveraged Reinforcement Learning (RL).

Roughly, RL attempts to learn a policy that maps states that an agent could be in () to actions to be taken from that state () at every time step. Crucially, the policy is learned such that the actions it takes maximize some reward function . At the time that this paper was written, a number of RL algorithms existed to learn this policy 333For a more thorough treatment of RL, refer to the Introduction and initial chapters of Sutton and Barto [14].

It is important to note that RL algorithms require a reward function () to produce the policy . Abbeel and Ng’s key idea in this paper is that such an can be derived from the expert trajectories that Imitation Learning assumes as input. This is what they call ’Inverse Reinforcement Learning’ (IRL). Once this reward function () is obtained, one can use any RL algorithm to obtain a policy that maximizes this reward function. If we assume that the expert was attempting to maximize , then the policy returned by the RL algorithm will (roughly) match the expert’s policy. In this way, the Imitation Learning problem can be solved by first using IRL to derive a reward function and then using RL to obtain a policy.

Fig. 2: A visual illustration of how Imitation Learning can be performed by the combination of RL and IRL, obtained from [12]. Since RL uses a reward function as input and outputs optimal behavior, the method that takes an expert’s optimal behavior as input and outputs a reward function is termed ’Inverse’ RL.

Unfortunately, Abbeel’s IRL method cannot take as input only the expert trajectories. It also requires some set of features over states that it can use to learn the reward function. In the case of driving a car, these features might be specific aspects we might want a car to optimize such as whether it has just collided with a car, whether it is in the middle of a lane, etc. Thus, Abbeel’s method requires this extra human-specified input in addition to the expert trajectories that Pomerleau [11]’s method operates on directly.

Abbeel is able to use his method to train a car to perform various complicated behaviors in simulation. He demonstrates 5 different learned behaviors that follow policies ranging from simply avoiding all other cars on the road to only driving within the right lane and going off-road to overtake other cars to simply intentionally crashing into the fist car detected.

Overall, this paper introduced a different way of viewing the Imitation Learning problem than that used by Pomerleau [11] and most others before it. It is important to note that this paper did not claim that using IRL and RL to perform Imitation Learning is necessarily better than using Supervised Learning, it is simply different. One aspect of Abbeel’s method that was appealing to many in the community is that it learns an explicit reward function. By inspecting this, one can determine the quantities the agent is attempting to optimize and thus gain some understanding of why the policy is taking the actions it takes. In this manner, Abbeel’s method is more human-understandable and explainable than the Supervised Learning approaches that came before it.

V Ho’s GAIL Paper

In the years after the publication of Abbeel and Ng [1]’s IRL paper, a significant research interest developed in using IRL methods to learn complex, real-world skills. It was discovered that IRL methods are less error-prone than Supervised Learning methods for Imitation Learning. Since the Supervised Learning methods are trained to mimic decisions the expert made at each time-step, small deviations made early-on in the trajectory can lead to compounding errors later in the trajectory. IRL methods, on the other hand, learn the expert’s reward function and are thus able to correct small errors simply by optimizing for the maximum reward. Intuitively, IRL methods have an explicit notion of the expert’s goal and can thus optimize for this while direct Supervised Learning are simply attempting to copy what the expert did and generalize slightly [9].

However, IRL methods were found to be slow and inefficient with data. This is because they need to iteratively estimate the expert’s reward function

and run an RL algorithm to convergence. RL algorithms are notorious for requiring a large number of environment steps to converge. This is especially crippling when attempting to learn complex behaviors in large, high-dimensional environments. Ho’s GAIL paper attempts to remedy precisely this efficiency issue.

Ho’s paper is riddled with mathematical intricacies, but at a high-level the key insight they had is this: the process of performing IRL then RL implicitly seeks to produce a policy () whose distribution of state-action pairs is similar to the state-action pair distribution of the expert policy (). This process of training a policy to match the state-action distribution of an expert policy can be done using a Generative Adversarial Neural Network (GAN) [7], which is a specific Neural Network architecture built to learn to match an arbitrary distribution. Performing this distribution matching with a GAN instead of with the combination of IRL and RL is significantly more data-efficient because it does not need to run an RL algorithm to convergence during each training iteration. Using a GAN in this way to more-efficiently perform IRL followed by RL is what the authors refer to as ’Generative Adversarial Imitation Learning’ (GAIL)444Specifically, GAIL can be used to learn the same policy that Maximum Entropy IRL [15] followed by RL would learn..

Fig. 3: An image from a presentation by one of the authors 666Link to presentation here:
http://efrosgans.eecs.berkeley.edu/CVPR18_slides/GAIL_by_Ermon.pdf
. GAIL was able to learn to make humanoid and ant walk in simulation with only approximately timesteps of state-action pairs from a trajectory. The humanoid has approximately joints and the ant has approximately joints that the policy must learn to control individually to generate the desired behavior.

The paper experimentally demonstrates that GAIL is much more efficient with respect to expert data than ’Behavior Cloning’ methods on a number of high-dimensional tasks in simulation. ’Behavior Cloning’ methods are nothing but the direct Supervised Learning methods for Imitation Learning used by Pomerleau [11] and discussed in Section III. Furthermore, GAIL is compared with two versions of state-of-the-art IRL algorithms, including a version of the algorithm from Abbeel and Ng [1]. GAIL is shown to achieve superior performance given significantly fewer expert trajectories.

In summary, this paper introduced a novel method called GAIL that can induce the same policy that IRL methods would, but in a more data-efficient manner. Importantly, even though GAIL uses a DNN to learn its policy, it still requires access to the environment just as IRL methods do. This contrasts the Supervised Learning methods (such as Pomerleau’s) that can be trained only on the expert demonstrations without needing access to the environment. Furthermore, GAIL is only more efficient than IRL methods in terms of expert demonstration data. The authors note that it is not very efficient in terms of the number of interactions required with the environment.

Vi Ghasemipour’s Divergence Minimization Perspective Paper

Ho and Ermon [9] experimentally demonstrated that GAIL is more data-efficient than direct Supervised Learning (or Behavior Cloning). However, they did not provide any strong theoretical justification for why GAIL is able to do this. In fact, as mentioned at the end of Section IV, IRL methods are simply different than direct Supervised Learning methods. Aside from some high-level intuitions, it was theoretically unclear why they might offer more robust performance. Ghasemipour et al. [6]’s recent paper sets out to understand the differences between the various approaches to Imitation Learning at a theoretical level to build a unifying perspective amongst them and hopefully use this perspective to develop novel methods.

Section V introduced Ho and Ermon [9]’s insight that IRL methods are really just attempting to learn a policy whose state-action distribution matches the state-action distribution of the expert’s policy. Ghasemipour builds on this by proving that all Imitation Learning methods are simply attempting to minimize some measure of divergence between the expert’s state-action distribution and the learned policy’s state-action distribution. Let’s use to denote the distribution of states and actions encountered when following the expert’s policy. Similarly, let denote the distribution of states and actions encountered when following the learned policy. Ghasemipour specifically shows that GAIL and related methods seek to minimize the divergence between and whereas direct Supervised Learning methods seek to minimize the divergence between and , where denotes actions taken at time and denotes the state at time . Ghasemipour experimentally demonstrates that it is precisely this difference - that direct Supervised Learning methods attempt to match the expert’s distribution of actions conditioned on states while GAIL and other IRL-based methods attempt to match the expert’s joint distribution of actions and states - that makes GAIL perform better than the direct Supervised Learning methods on complex tasks in high-dimensional state spaces.

Additionally, given the insight that all Imitation Learning methods just seek to match state-action distributions, Ghasemipour raises the question of whether we should even infer these distributions from expert demonstrations. Instead, he introduces a variant of a state-of-the-art IRL method Fu et al. [5] that he calls FAIRL and then uses a version of this to learn to perform tasks that are directly-specified. Specifically, instead of providing expert demonstrations, Ghasemipour hand-specifies the distribution of states and actions he wants the learned policy to follow.

Fig. 4: An image from Ghasemipour’s paper showing a visualization of experiments performed on learning a policy to match a hand-specified distribution. Subfigures (a) through (d) are visualizations of the distributions for movement actions Ghasemipour specified and corresponding visualizations of the distributinos the learned policies had. The images on the very left showcase the simulation environments for the ’Fetch’ (top) and ’Pusher’ (bottom) tasks respectively.

To summarize, this paper presented a theoretical justification for why recent IRL-based methods (such as GAIL) achieve higher performance on complex Imitation Learning tasks when trained on much fewer expert trajectories than direct Supervised Learning methods. After experimentally verifying their theoretical claim, the authors introduce a new method to perform Imitation Learning without expert trajectories by directly learning to match some hand-specified state-action distribution. In domains where it is easy to hand-specify such distributions (such as simple pushing or movement domains), the author’s method is potentially easier to performing Imitation Learning with expert demonstrations. However, it is unlikely that such hand-specification is easy, or even possible, in all domains of interest.

Vii Conclusion and Future Outlook

There has been much progress with Imitation Learning methods over the past years. Such methods have gone from being able to learn simple, specific tasks given ideal conditions and hand-engineered features (such as Pomerleau [11]’s early success with steering a car) to learning complex tasks in high-dimensional simulation environments (such as block-pushing or teaching an ant to walk) from very few expert demonstrations. Furthermore, we as a community have developed a wide array of methods - ranging from direct Supervised Learning to using Reinforcement Learning - to solve the Imitation Learning problem. These different methods have demonstrated different advantages and disadvantages, and we have recently developed a theoretical understanding of why such differences exist.

Despite all this progress, there are many open questions that remain to be answered and much work that remains to be done. Recent work has shown impressive experimental results in simulation domains, however it remains to be seen whether such methods will transfer well to complex tasks in the real-world. This might be especially difficult because, as noted by Ho and Ermon [9], GAIL and related state-of-the-art methods are not that efficient in terms of interaction with the environment needed for them to successfully learn new skills. Furthermore, as Ghasemipour et al. [6] points out, expert demonstrations may not be the easiest way to induce policies for certain tasks. Thus, features of a task that make it easily amenable to solving via Imitation Learning with expert demonstrations should be studied. Additionally, novel ways of providing state-action distributions for a learner to match (for example, via language commands, etc.) can be explored. Finally, while Ghasemipour et al. [6]’s work helped deepen our theoretical understanding of Imitation Learning methods, there is much more to be understood (for example, the exact effect of the measure of divergence chosen (for example, KL versus JS) on Imitation Learning’s performance) that could help develop novel methods that can solve increasingly complex tasks in real-world settings.

References

  • [1] P. Abbeel and A. Y. Ng (2004) Apprenticeship learning via inverse reinforcement learning. In Proceedings of the Twenty-first International Conference on Machine Learning, ICML ’04, New York, NY, USA, pp. 1–. External Links: ISBN 1-58113-838-5, Link, Document Cited by: item 2, §III, §V, §V.
  • [2] A. A. Baloch and A. M. Waxman (1991) Visual learning, adaptive expectations, and behavioral conditioning of the mobile robot mavin. Neural Networks 4 (3), pp. 271 – 302. External Links: ISSN 0893-6080, Document, Link Cited by: §III.
  • [3] M. Bojarski, D. D. Testa, D. Dworakowski, B. Firner, B. Flepp, P. Goyal, L. D. Jackel, M. Monfort, U. Muller, J. Zhang, X. Zhang, J. Zhao, and K. Zieba (2016) End to end learning for self-driving cars. CoRR abs/1604.07316. External Links: Link, 1604.07316 Cited by: §III.
  • [4] X. Cai, Y. Ding, Y. Jiang, and Z. Zhou (2019) Expert-level atari imitation learning from demonstrations only. External Links: arXiv:1909.03773 Cited by: §I.
  • [5] J. Fu, K. Luo, and S. Levine (2017) Learning robust rewards with adversarial inverse reinforcement learning. CoRR abs/1710.11248. External Links: Link, 1710.11248 Cited by: §VI.
  • [6] S. K. S. Ghasemipour, R. Zemel, and S. Gu (2019) A divergence minimization perspective on imitation learning methods. External Links: 1911.02256 Cited by: item 4, §VI, §VII.
  • [7] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial nets. In Advances in Neural Information Processing Systems 27, Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger (Eds.), pp. 2672–2680. External Links: Link Cited by: §V.
  • [8] D. C. Halbert (1984) Programming by example. Ph.D. Thesis, University of California, Berkeley. Note: AAI8512843 Cited by: §I.
  • [9] J. Ho and S. Ermon (2016) Generative adversarial imitation learning. CoRR abs/1606.03476. External Links: Link, 1606.03476 Cited by: item 3, §V, §VI, §VI, §VII.
  • [10] S. V. Levine CS285. University of California at Berkeley. External Links: Link Cited by: footnote 1.
  • [11] D. A. Pomerleau (1989) Advances in neural information processing systems 1. D. S. Touretzky (Ed.), pp. 305–313. External Links: ISBN 1-558-60015-9, Link Cited by: item 1, §I, §IV, §IV, §IV, §V, §VII.
  • [12] J. Rodriguez (2018-11)

    What’s new in deep learning research: openai and deepmind join forces to achieve superhuman…

    .

    Towards Data Science

    .
    External Links: Link Cited by: Fig. 2.
  • [13] (2019-10) Supervised learning. Wikimedia Foundation. External Links: Link Cited by: footnote 2.
  • [14] R. S. Sutton and A. G. Barto (2018) Reinforcement learning: an introduction. A Bradford Book, USA. External Links: ISBN 0262039249, 9780262039246 Cited by: footnote 3.
  • [15] B. D. Ziebart, A. Maas, J. A. Bagnell, and A. K. Dey (2008) Maximum entropy inverse reinforcement learning. In Proceedings of the 23rd National Conference on Artificial Intelligence - Volume 3, AAAI’08, pp. 1433–1438. External Links: ISBN 978-1-57735-368-3, Link Cited by: footnote 4.