Imitation Learning from Video by Leveraging Proprioception

05/22/2019 ∙ by Faraz Torabi, et al. ∙ The University of Texas at Austin 7

Classically, imitation learning algorithms have been developed for idealized situations, e.g., the demonstrations are often required to be collected in the exact same environment and usually include the demonstrator's actions. Recently, however, the research community has begun to address some of these shortcomings by offering algorithmic solutions that enable imitation learning from observation (IfO), e.g., learning to perform a task from visual demonstrations that may be in a different environment and do not include actions. Motivated by the fact that agents often also have access to their own internal states (i.e., proprioception), we propose and study an IfO algorithm that leverages this information in the policy learning process. The proposed architecture learns policies over proprioceptive state representations and compares the resulting trajectories visually to the demonstration data. We experimentally test the proposed technique on several MuJoCo domains and show that it outperforms other imitation from observation algorithms by a large margin.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Imitation learning [Schaal1997, Argall et al.2009, Osa et al.2018] is a popular method by which artificial agents learn to perform tasks. In the imitation learning framework, an expert agent provides demonstrations of a task to a learning agent, and the learning agent attempts to mimic the expert. Unfortunately, many existing imitation learning algorithms have been designed for idealized situations, e.g., they require that the demonstrations be collected in the exact same environment as the one that the imitator is in and/or that the demonstrations include the demonstrator’s actions, i.e., the internal control signals that were used to drive the behavior. These limitations result in the exclusion of a large amount of existing resources, including a large number of videos uploaded to the internet. For example, 300 hours of video are uploaded to YouTube every minute111https://bit.ly/2quPG6O, many of which include different types of tasks being performed. Without new imitation learning techniques, none of this video can be used to instruct artificial agents.

Fortunately, the research community has recently begun to focus on addressing the above limitations by considering the specific problem of imitation from observation (IfO) [Liu et al.2018]. IfO considers situations in which agents attempt to learn tasks by observing demonstrations that contain only state information (e.g., videos). Among IfO

 algorithms that learn tasks by watching videos, most attempt to learn imitation policies that rely solely on self-observation through video, i.e., they use a convolutional neural network (CNN) that maps images of themselves to actions. However, in many cases, the imitating agent also has access to its own

proprioceptive state information, i.e., direct knowledge of itself such as the joint angles and torques associate with limbs. In this paper, we argue that IfO algorithms that ignore this information are missing an opportunity that could potentially improve the performance and the efficiency of the learning process. Therefore, we are interested here in IfO algorithms that can make use of both visual and proprioceptive state information.

In this paper, we build upon our previous work [Torabi et al.2018b] proposing an algorithm that uses a GAN-like [Goodfellow et al.2014] architecture to learn tasks perform IfO directly from videos. Unlike our prior work, however, our method also uses proprioceptive information from the imitating agent during the learning process. We hypothesize that the addition of such information will improve both learning speed and the final performance of the imitator, and we test this hypothesis experimentally in several standard simulation domains. We compare our method with other, state-of-the-art approaches that do not leverage proprioception, and our results validate our hypothesis, i.e., the proposed technique outperforms the others by a large margin.

The rest of this paper is organized as follows. In Section 2, we review related work in imitation from observation. In Section 3

, we review technical details surrounding Markov decision processes, imitation learning, and

IfO. The proposed algorithm is presented in Section 4, and we describe the experiments that we have performed in Section 5.

2 Related Work

In this section, we review research in imitation learning, plan/goal recognition by mirroring, and recent advances in imitation from observation (IfO).

Conventionally, imitation learning is used in autonomous agents to learn tasks from demonstrated state-action trajectories. The algorithms developed for this task can be divided into two general categories, (1) behavioral cloning [Bain and Sommut1999, Ross et al.2011, Daftry et al.2016]

in which the agents learn a direct mapping from the demonstrated states to the actions, and (2) inverse reinforcement learning (IRL)

[Abbeel and Ng2004, Bagnell et al.2007, Baker et al.2009] in which the agents first learn a reward function based on the demonstrations and then learn to perform the task using a reinforcement learning (RL)  [Sutton and Barto1998] algorithm.

In contrast, imitation from observation (IfO) is a framework for learning a task from state-only demonstrations. This framework has recently received a great deal of attention from the research community. The IfO algorithms that have been developed can be categorized as either (1) model-based, or (2) model-free. Model-based algorithms require the agent to learn an explicit model of its environment as part of the imitation learning process. One algorithm of this type is behavioral cloning from observation (BCO) [Torabi et al.2018a], in which the imitator learns a dynamics model of its environment using experience collected by a known policy, and then uses this model to infer the missing demonstrator actions. Using the inferred actions, the imitator then computes an imitation policy using behavioral cloning [Bain and Sammut1995]. Another model-based approach to IfO is imitating latent policies from observation (ILPO) [Edwards et al.2019]. Given the current state of the expert, this approach predicts the next state using a latent policy and a forward dynamics model. It then uses the difference between the predicted state and the actual demonstrator next state to update both the model and the imitation policy. Afterwards, the imitator interacts with its environment to correct the action labels.

Model-free algorithms, on the other hand, do not require any sort of model to learn imitation policies. One set of approaches of this type learns a time-dependent representation of tasks and then relies on hand-designed, time-aligned reward functions to learn the task via RL

. For example, sermanet2017time sermanet2017time propose an algorithm that learns an embedding function using a triplet loss function that seeks to push states that are close together in time closer together in the embedded space, while pushing other states further away. liu2017imitation liu2017imitation also propose a new architecture to learn a state representation—specifically, one that is capable of handling viewpoint differences. gupta2017learning gupta2017learning also propose a neural network architecture to try to learn a state representation that can overcome possible embodiment mismatch between the demonstrator and the imitator. Each of these approaches requires multiple demonstrations of the same task to be time-aligned, which is typically not a realistic assumption. aytar2018playing aytar2018playing propose an

IfO algorithm that first learns an embedding using a self-supervised objective, and then constructs a reward function based on the embedding representation difference between the current state of the imitator and a specific checkpoint generated by the visual demonstration. goo2018learning goo2018learning propose an algorithm that uses a shuffle-and-learn style [Misra et al.2016] loss in order to train a neural network that can predict progress in the task which can then be used as the reward function.

Another set of model-free algorithms follow a more end-to-end approach to learning policies directly from observations. An algorithm of this type is generative adversarial imitation from observation (GAIfO) [Torabi et al.2018b], which uses a GAN-like architecture to bring the state transition distribution of the imitator closer to that of the demonstrator. Another approach of this type is the work of merel2017learning merel2017learning, which is concerned instead with single state distributions. stadie2017third stadie2017third also propose an algorithm in this space that combines adversarial domain confusion methods [Ganin et al.2016] with adversarial imitation learning algorithms in an attempt to overcome changes in viewpoint. The method we propose in this paper also belongs to the category of end-to-end model-free imitation from observation algorithms. However, it is different from the algorithms discussed above in that we explicitly incorporate the imitator’s proprioceptive information in the learning process in order to study the improvement such information can make with respect to the performance and speed of the learning process.

A method that is closely related to imitation from observation is plan/goal recognition through mirroring [Vered et al.2016, Vered et al.2018] in that in it attempts to infer higher-level variables such as the goal or the future plan by observing other agents. However, in plan and goal recognition the observer already has fixed controllers, and then uses these controllers to match/explain the observed agent in order to infer their goal/plan. In imitation from observation, on the other hand, the agent seeks to learn a controller that the observer can use to imitate the observed agent.

3 Background

In this section, we establish notation and provide background information about Markov decision processes (MDPs) and adversarial imitation learning.

3.1 Notation

We consider artificial learning agents operating in the framework of Markov decision processes (MDPs). An MDP can be described as a 6-tuple , where and are state and action spaces,

is a function which represents the probability of an agent transitioning from state

at time to at time by taking action , is a function that represents the reward feedback that the agent receives after taking a specific action at a given state, and is a discount factor. In the context of the notation established above, we are interested here in learning a policy that can be used to select an action at each state.

In this paper, we refer to as the proprioceptive state, i.e., is the most basic, internal state information available to the agent (e.g., the joint angles of a robotic arm). Since we are also concerned with visual observations of agent behavior, we denote these observations as , i.e., an image of the agent at time is denoted as . The visual observations of the agent are determined both by the agent’s current proprioceptive state , and also other factors relating to image formation such as camera position. Importantly, due to phenomena such as occlusion, it is not always possible to infer from alone.

In imitation learning (IL), agents do not receive reward feedback . Instead, they have access to expert demonstrations of the task. These demonstrations are composed of the state and action sequences experienced by the demonstrator. Here, however, we specifically consider the problem of imitation from observation (IfO), in which the agent only has access to sequences of visual observations of the demonstrator performing the task, i.e., .

3.2 Adversarial Imitation Learning

Generative adversarial imitation learning (GAIL) is a recent imitation learning algorithm developed by ho2016generative ho2016generative that formulates the problem of finding an imitating policy as that of solving the following optimization problem:

(1)

where is the entropy function, and the discriminator function

can be thought of as a classifier trained to differentiate between the state-action pairs provided by the demonstrator and those experienced by the imitator. The objective in (

1) is similar the one used in generative adversarial networks (GANs) [Goodfellow et al.2014], and the associated algorithm can be thought of as trying to induce an imitator state-action occupancy measure that is similar to that of the demonstrator. Even more recently, there has been research on methods that seek to improve on GAIL by, e.g., increasing sample efficiency [Kostrikov et al.2019, Sasaki et al.2019] and improving reward representation [Fu et al.2018, Qureshi et al.2019].

The method we propose in this paper is most related to generative adversarial imitation from observation [Torabi et al.2018b], which models the imitating policy using a randomly-initialized convolutional neural network, executes the policy to generate recorded video of the imitator’s behavior, and then trains a discriminator to differentiate between video of the demonstrator and video of the imitator. Next, it uses the discriminator as a reward function for the imitating agent (higher rewards corresponding to behavior the discriminator classifies as coming from the demonstrator), and uses a policy gradient technique (e.g., TRPO [Schulman et al.2015]) to update the policy. The process repeats until convergence. This algorithm differs from what we propose in that GAIfO uses visual data both in the process of discriminator and policy learning. That is, the learned behavior policy maps images to actions using a convolutional neural network. The technique we propose, on the other hand, leverages proprioceptive information in the policy learning step, instead learning policies that map proprioceptive states

to actions using a multilayer perceptron architecture.

Proprioceptivefeatures ()(e.g. joint angles)

MLP()

.

.

.

Env

CNN()

v

Figure 1: A diagrammatic representation of our algorithm. A multilayer perceptron (MLP) is used to model the policy, which takes the proprioceptive features as the input and outputs an action . The agent then executes the action in its environment. While the agent executes the policy, a video of the resulting behavior is recorded. Stacks of four consecutive grayscale images from both the demonstrator and the imitator are then prepared as the input for the discriminator, which is trained to discriminate between data coming from these two sources. Finally, the discriminator function is then used as the reward function to train the policy using PPO (not shown).

4 Proposed Method

As presented in Section 3, we are interested in the problem of imitation from observation (IfO), where an imitating agent has access to visual demonstrations, , of an expert performing a task, and seeks to learn a behavior that is approximately the same as the expert’s. In many previous approaches to this problem, the imitator selects actions on the basis of visual self-observation alone (i.e., using images of itself). We hypothesize that also leveraging available proprioceptive state information, , during the learning process will result in better and faster learning.

Inspired by GAIL, our algorithm is comprised of two pieces: (1) a generator, which corresponds to the imitation policy, and (2) a discriminator, which serves as the reward function for the imitator. We model the imitation policy as a multilayer perceptron (MLP), . The imitating agent, being aware of its own proprioceptive features , feeds them into the policy network and receives as output a distribution over actions from which the selected action can be sampled. The imitator then executes this action and we record a video of the resulting behavior. After several actions have been executed, we have accumulated a collection of visual observations of the imitator’s behavior, .

Meanwhile, we use a convolutional neural network as a discriminator . Given visual observations of the demonstrator, , and observations of the imitator, , we train the discriminator to differentiate between the data coming from these different sources. Since single video frames lack observability in most cases, we instead stack four frames, , and feed this stack as input to the discriminator.

1:  Initialize policy randomly
2:  Initialize discriminator randomly
3:  Obtain visual demonstrations
4:  for  to  do
5:     Execute and record video observation
6:     Update the discriminator using loss
7:     Update by performing PPO updates with gradient steps of
where
8:  end for
Algorithm 1

We train the discriminator to output values closer to zero for the transitions coming from the expert, and values closer to one for those coming from the imitator. Therefore, the discriminator aims to solve the following optimization problem:

(2)

The lower the value outputted by the discriminator, the higher the chance of the input being from the expert. Recall that the objective for the imitator is to mimic the demonstrator, which can be thought of as fooling the discriminator. Therefore, we use

(3)

as the reward to update the imitation policy using RL. In particular, we use proximal policy optimization (PPO) [Schulman et al.2017] with gradient steps of

(4)

where is the state-action value, i.e. the potential reward that the agent receives starting from and taking action :

(5)

As presented, our algorithm uses the visual information in order to learn the reward function by comparing visual data generated by the imitator and the demonstrator. It also takes advantage of proprioceptive state features in the process of policy learning by learning a mapping from those features to actions using a reinforcement learning algorithm. Pseudocode and a diagrammatic representation of our proposed algorithm are presented in Algorithm 1 and Figure 1, respectively.

5 Experiments

The algorithm introduced above combines proprioceptive state information with video observations in an adversarial imitation learning paradigm. We hypothesize that using the extra state information in the proposed way will lead to both faster imitation learning and better performance on the imitated task when compared to similar techniques that ignore proprioception. In this section, we describe the experimental procedure by which we evaluated this hypothesis, and discuss the results.

5.1 Setup

We evaluated our method on a subset of the continuous control tasks available via OpenAI Gym [Brockman et al.2016] and the MuJoCo simulator [Todorov et al.2012]:

  • MountainCarContinuous: This environment includes a 2D path and a vehicle, and the task is for the vehicle to reach a certain target point (Figure 2(a)). The proprioceptive state space and action space are - and -dimensional, respectively.

  • InvertedPendulum: This environment includes a single pendulum on a bar, and the task is to keep the pendulum straight upward by controlling the left-right motion of the pendulum (Figure 2(b)). The proprioceptive state space and action space are - and -dimensional, respectively.

  • InvertedDoublePendulum: This environment includes a double pendulum on a bar, and the task is to keep the pendulum straight upward by controlling the left-right motion of the pendulum (Figure 2(c)). The proprioceptive state space and action space are - and - dimensional, respectively.

  • Hopper: This environment includes three connected rods, and the task is control the rods so as to hop forward as fast as possible (Figure 2(d)). The proprioceptive state space and action space are - and -dimensional, respectively.

  • Walker2d: This environment is comprised of five connected rods, and the task is to control these rods so as to have the agent walk forward as fast as possible (Figure 2(e)). The proprioceptive state space and action space are - and - dimensional, respectively.

  • HalfCheetah: This environment is comprised of seven connected rods, and the task is to control these rods so as to have the agent walk forward as fast as possible (Figure 2(f)). The proprioceptive state space and action space are - and - dimensional, respectively.

Visual observations are similar across each environment, and are represented as grayscale images.

(a) MountainCarContinuous
(b) InvertedPendulum
(c) InvertedDoublePendulum
(d) Hopper
(e) Walker2d
(f) HalfCheetah
Figure 2: Screenshots of the experimental MuJoCo domains considered in this paper.
Figure 3:

The rectangular bars and error bars represent the mean normalized return and the standard error, respectively, as measured over 1000 trials. The normalized values have been scaled in such a way that expert and random performance are

and , respectively. The x-axis represents the number of available video demonstration trajectories.

To generate the demonstration data, we first trained an expert agents using pure reinforcement learning (i.e., not from imitation). More specifically, we used proximal policy optimization (PPO) [Schulman et al.2017] and the ground truth reward function provided by OpenAI Gym. After the expert agents were trained, we recorded , -fps video demonstrations of their behavior.

For our approach, we model imitation policies as multi-layer perceptrons with

-neuron layers. We model the discriminator as a convolutional neural network with three convolutional layers with

-by-

filters that use a stride of

. These layers have , , and

filters, respectively. The last layer is fully-connected to a single output, which represents the output value. For training of these networks, we use the Adam variant of stochastic gradient descent

[Kingma and Ba2015] with a learning rate of . One important parameter in the training process is the the number of updates performed on the discriminator and the policy at each iteration. Therefore, we performed an extensive hyper-parameter search to find the numbers that performed the best for each experiment which are presented in Table 1 in the Appendix.

We compared the proposed method with three other imitation from observation algorithms that do not exploit the imitator’s proprioceptive state information:

  • Time contrastive network (TCN) [Sermanet et al.2018]: TCN uses a convolutional neural network architecture to learn a time-dependent state representation of the task from video demonstrations. It then defines a reward function based on how far the imitator’s state representation is from that of the demonstrator at each time step. There are two versions of this algorithm: (1) multi-view TCN, and (2) single-view TCN. In this work, we only consider cases in which we have demonstrations from a single viewpoint, and so we compare only with single-view TCN.

  • Behavioral cloning from observation (BCO) [Torabi et al.2018a]: In BCO

    , the agent first interacts randomly with its environment and records video of the resulting state-action sequences. From these sequences, an inverse dynamics model is learned. The learned model is used to infer the actions that are missing in the video demonstration. Finally, using the demonstration observations and the inferred actions, the agent performs classical imitation learning using a supervised learning algorithm.

  • Generative adversarial imitation from observation (GAIfO) [Torabi et al.2018b, Torabi et al.2019]: GAIfO first initializes a convolutional neural network policy randomly, and then executes this policy in its environment while recording a video of the agent. It then trains a neural network classifier to discriminate between the visual data coming from the imitator and that coming from the discriminator. Finally, it uses the output of the discriminator as the reward function to train the CNN policy. This process repeats until convergence.

For each of the baseline algorithms above, we used the hyper-parameters and the architectures used in the original paper.

5.2 Results

We hypothesized that our method would outperform the baselines with respect to two criteria: (1) the final performance of the trained imitator, i.e., how the imitator performs the task compared to the demonstrator (as measured by the ground truth reward functions), and (2) the speed of the imitation learning process as measured by number of learning iterations. The results shown here were generated using ten independent trials, where each trial used a different random seed to initialize the environments, model parameters, etc.

Figure 3 depicts our experimental results pertaining to the first criterion, i.e., the final task performance of trained imitating agents in each domain. The rectangular bars and error bars represent the mean return and the standard error, respectively, as measured over trajectories. We report performance using a normalized task score, i.e., scores are scaled in such a way that the demonstrating agent’s performance corresponds to and the performance of an agent with random behavior corresponds to . The x-axis represents the number of demonstration trajectories, i.e., videos, available to the imitator. In general, it can be seen that the proposed method indeed outperforms the baselines in almost all cases, which shows that using the available proprioceptive state information can make a remarkable difference in the final task performance achieved by imitation learning. In the particular case of InvertedPendulum, both GAIfO and the proposed method achieve a final task performance equal to that of the demonstrator, likely due to the simplicity of the task. However, for the rest of the tasks, it can be clearly seen that the proposed approach performs better than GAIfO222Note that the performance of GAIfO on Hopper is different from what was presented in the GAIfO paper [Torabi et al.2018b]. We hypothesize that the reason is twofold: (1) different physics engines—MuJoCo is used in this paper, but in the previous work [Torabi et al.2018b] Pybullet [Coumans and Bai2016 2017] was used , and (2) differences in video appearance—in this work we do not alter the default simulator parameters, whereas in the previous work [Torabi et al.2018b] some of the parameters were modified such as the colors used in the video frames in order to increase the contrast between the agent and the background.. Further, we can see that increasing the number of demonstrated trajectories results in increased task performance.

To validate our hypothesis with respect to learning speed, we also studied the transient performance of the various learning algorithms. Because only one other method, GAIfO, performed as well as the expert in only one domain, InvertedPendulum, Figure 4 only depicts the results for these algorithms in that domain. The x-axis shows the number of iterations, i.e., the number of update cycles for both the policy and the discriminator. Since updating the policy requires interaction with the environment, a smaller number of iterations also corresponds to less overhead during the learning process. As shown in the figure, our method converges to expert-level performance much faster than GAIfO, which supports our hypothesis that leveraging proprioception speeds the imitation learning process.

In Figure 3, we can see that two of the baseline methods—BCO and TCN—do not achieve task performance anywhere near that of the expert.

For InvertedPendulum and InvertedDoublePendulum, we suspect that TCN performs poorly due to possible overfitting of the learned state embedding to the specific demonstrations and, therefore, does not generalize well toward supporting the overall goal of keeping the pendulum balanced above the rod. For Hopper, Walker2d, and HalfCheetah, the poor performance of TCN may be due to the fact that the tasks are cyclical in nature and therefore not well-suited to the time-dependent learned state embedding. TCN performs relatively better in MountainCarContinuous, compared to other domains because this domain does have the properties required by TCN. As for BCO, we posit that the low performance is due to the well-known compounding-error issue present in behavioral cloning.

One interesting thing to note is that Walker2d results in larger error bars for our technique than those seen for any of the other domains. We hypothesize that the reason for this is that the video frames provide very poor information regarding the state of the demonstrator—here, the agent has two legs, which sometimes results in occlusion and, therefore, uncertainty regarding which action the agent should take.

Finally, we can see that the proposed technique performs the most poorly in the HalfCheetah domain. We hypothesize that this is due to the speed at which the demonstrator acts: frame-to-frame differences are large, e.g., three to four consecutive frames cover a complete cycle of the agent jumping forwards. This rate of change may make it difficult for our discriminator to extract a pattern of behavior, which, consequently, would make it much more difficult for the agent to move its behavior closer to that of the demonstrator. Therefore, one way that performance might be improved is to increase the frame rate at which the demonstrations are sampled. Another way, as suggested by Figure 3, would be to increase the number of demonstration trajectories beyond what is shown here.

Figure 4: Performance of imitation agents with respect to the number of iterations (N). Solid colored lines represent the mean return and shaded areas represent standard errors. The returns are scaled so that the performance of the expert and random policies be zero and one, respectively.

6 Conclusion and Future Work

In this paper, we hypothesized that including proprioception would be beneficial to the learning process in the IfO paradigm. To test this hypothesis, we presented a new imitation from observation algorithm that leverages both available visual and proprioceptive information. It uses visual information to compare the imitator’s behavior to that of the demonstrator, and uses this comparison as a reward function for training a policy over proprioceptive states. We showed that leveraging this state information can significantly improve both the performance and the efficiency of the learning process.

However, to achieve the end-goal of true imitation from observation, several challenges remain. For example, IfO algorithms should be able to overcome embodiment mismatch (the imitator and the demonstrator have different embodiments), and viewpoint mismatch (the visual demonstrations are recorded from different viewpoints.). Resolving these limitations is a natural next step for extending this research. Another way to improve upon the proposed method is to attempt to make the training more reliable by incorporating techniques developed to improve the stability of GANs, such as the work of arjovsky2017wasserstein arjovsky2017wasserstein. Further, to the best of our knowledge, nobody has been able to deploy GAN-like methods on real robots due to high sample complexity. Therefore, techniques that seek to improve the learning process with respect to this metric should also be investigated further.

Acknowledgments

This work has taken place in the Learning Agents Research Group (LARG) at the Artificial Intelligence Laboratory, The University of Texas at Austin. LARG research is supported in part by grants from the National Science Foundation (IIS-1637736, IIS-1651089, IIS-1724157), the Office of Naval Research (N00014-18-2243), Future of Life Institute (RFP2-000), Army Research Lab, DARPA, Intel, Raytheon, and Lockheed Martin. Peter Stone serves on the Board of Directors of Cogitai, Inc. The terms of this arrangement have been reviewed and approved by the University of Texas at Austin in accordance with its policy on objectivity in research.

References

  • [Abbeel and Ng2004] Pieter Abbeel and Andrew Y Ng. Apprenticeship learning via inverse reinforcement learning. In

    Proceedings of the twenty-first International Conference on Machine learning

    , page 1. ACM, 2004.
  • [Argall et al.2009] Brenna D Argall, Sonia Chernova, Manuela Veloso, and Brett Browning. A survey of robot learning from demonstration. Robotics and Autonomous Systems, 57(5):469–483, 2009.
  • [Arjovsky et al.2017] Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein generative adversarial networks. In In Proceedings of International Conference on Machine Learning, pages 214–223, 2017.
  • [Aytar et al.2018] Yusuf Aytar, Tobias Pfaff, David Budden, Thomas Paine, Ziyu Wang, and Nando de Freitas. Playing hard exploration games by watching youtube. In Advances in Neural Information Processing Systems, pages 2935–2945, 2018.
  • [Bagnell et al.2007] JA Bagnell, Joel Chestnutt, David M Bradley, and Nathan D Ratliff. Boosting structured prediction for imitation learning. In Advances in Neural Information Processing Systems, pages 1153–1160, 2007.
  • [Bain and Sammut1995] Michael Bain and Claude Sammut. A framework for behavioral cloning. Machine Intelligence 14, 1995.
  • [Bain and Sommut1999] Michael Bain and Claude Sommut. A framework for behavioural claning. Machine Intelligence 15, 15:103, 1999.
  • [Baker et al.2009] Chris L Baker, Rebecca Saxe, and Joshua B Tenenbaum. Action understanding as inverse planning. Cognition, 113(3):329–349, 2009.
  • [Brockman et al.2016] Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. OpenAI Gym, 2016.
  • [Coumans and Bai2016 2017] Erwin Coumans and Yunfei Bai. Pybullet, a python module for physics simulation for games, robotics and machine learning. 2016-2017.
  • [Daftry et al.2016] Shreyansh Daftry, J Andrew Bagnell, and Martial Hebert. Learning transferable policies for monocular reactive mav control. In International Symposium on Experimental Robotics, pages 3–11. Springer, 2016.
  • [Edwards et al.2019] Ashley D Edwards, Himanshu Sahni, Yannick Schroeker, and Charles L Isbell. Imitating latent policies from observation. In Proceedings of International Conference on Machine Learning, 2019.
  • [Fu et al.2018] Justin Fu, Katie Luo, and Sergey Levine. Learning robust rewards with adverserial inverse reinforcement learning. In International Conference on Learning Representations, 2018.
  • [Ganin et al.2016] Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo Larochelle, François Laviolette, Mario Marchand, and Victor Lempitsky. Domain-adversarial training of neural networks. The Journal of Machine Learning Research, 17(1):2096–2030, 2016.
  • [Goo and Niekum2019] Wonjoon Goo and Scott Niekum. One-shot learning of multi-step tasks from observation via activity localization in auxiliary video. In Proceedings of International Conference on Robotics and Automation (ICRA), 2019.
  • [Goodfellow et al.2014] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in Neural Information Processing Systems, 2014.
  • [Gupta et al.2017] Abhishek Gupta, Coline Devin, YuXuan Liu, Pieter Abbeel, and Sergey Levine. Learning invariant feature spaces to transfer skills with reinforcement learning. In International Conference on Learning Representations, 2017.
  • [Ho and Ermon2016] Jonathan Ho and Stefano Ermon. Generative adversarial imitation learning. In Advances in Neural Information Processing Systems, pages 4565–4573, 2016.
  • [Kingma and Ba2015] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. International Conference on Learning Representations, 2015.
  • [Kostrikov et al.2019] Ilya Kostrikov, Kumar Krishna Agrawal, Debidatta Dwibedi, Sergey Levine, and Jonathan Tompson. Discriminator-actor-critic: Addressing sample inefficiency and reward bias in adversarial imitation learning. In International Conference on Learning Representations, 2019.
  • [Liu et al.2018] YuXuan Liu, Abhishek Gupta, Pieter Abbeel, and Sergey Levine. Imitation from observation: Learning to imitate behaviors from raw video via context translation. 2018.
  • [Merel et al.2017] Josh Merel, Yuval Tassa, Sriram Srinivasan, Jay Lemmon, Ziyu Wang, Greg Wayne, and Nicolas Heess. Learning human behaviors from motion capture by adversarial imitation. arXiv preprint arXiv:1707.02201, 2017.
  • [Misra et al.2016] Ishan Misra, C Lawrence Zitnick, and Martial Hebert.

    Shuffle and learn: unsupervised learning using temporal order verification.

    In

    European Conference on Computer Vision

    , pages 527–544. Springer, 2016.
  • [Osa et al.2018] Takayuki Osa, Joni Pajarinen, Gerhard Neumann, J Andrew Bagnell, Pieter Abbeel, Jan Peters, et al. An algorithmic perspective on imitation learning. Foundations and Trends® in Robotics, 7(1-2):1–179, 2018.
  • [Qureshi et al.2019] Ahmed H. Qureshi, Byron Boots, and Michael C. Yip. Adversarial imitation via variational inverse reinforcement learning. In International Conference on Learning Representations, 2019.
  • [Ross et al.2011] Stéphane Ross, Geoffrey J Gordon, and Drew Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning. In International Conference on Artificial Intelligence and Statistics, pages 627–635, 2011.
  • [Sasaki et al.2019] Fumihiro Sasaki, Tetsuya Yohira, and Atsuo Kawaguchi. Sample efficient imitation learning for continuous control. In International Conference on Learning Representations, 2019.
  • [Schaal1997] Stefan Schaal. Learning from demonstration. In Advances in Neural Information Processing Systems, pages 1040–1046, 1997.
  • [Schulman et al.2015] John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. In International Conference on Machine Learning, pages 1889–1897, 2015.
  • [Schulman et al.2017] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
  • [Sermanet et al.2018] Pierre Sermanet, Corey Lynch, Jasmine Hsu, and Sergey Levine. Time-contrastive networks: Self-supervised learning from multi-view observation. In International Conference in Robotics and Automation (ICRA), 2018.
  • [Stadie et al.2017] Bradly C Stadie, Pieter Abbeel, and Ilya Sutskever. Third-person imitation learning. In International Conference on Learning Representations, 2017.
  • [Sutton and Barto1998] Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction, volume 1. MIT press Cambridge, 1998.
  • [Todorov et al.2012] Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-based control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 5026–5033. IEEE, 2012.
  • [Torabi et al.2018a] Faraz Torabi, Garrett Warnell, and Peter Stone. Behavioral cloning from observation. In Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, IJCAI-18, pages 4950–4957, 2018.
  • [Torabi et al.2018b] Faraz Torabi, Garrett Warnell, and Peter Stone. Generative adversarial imitation from observation. arXiv preprint arXiv:1807.06158, 2018.
  • [Torabi et al.2019] Faraz Torabi, Garrett Warnell, and Peter Stone. Adversarial imitation learning from state-only demonstrations. In Proceedings of the 18th International Conference on Autonomous Agents and MultiAgent Systems, 2019.
  • [Vered et al.2016] Mor Vered, Gal A Kaminka, and Sivan Biham. Online goal recognition through mirroring: Humans and agents. In The Fourth Annual Conference on Advances in Cognitive Systems, 2016.
  • [Vered et al.2018] Mor Vered, Ramon Fraga Pereira, Maurício C Magnaguagno, Gal A Kaminka, and Felipe Meneguzzi. Towards online goal recognition combining goal mirroring and landmarks. In Proceedings of the 17th International Conference on Autonomous Agents and MultiAgent Systems, pages 2112–2114. International Foundation for Autonomous Agents and Multiagent Systems, 2018.

Appendix A Appendix

Ours GAIfO
Domain Demonstration trajectories Policy steps Discriminator steps Policy steps Discriminator steps
MountainCarContinuous
InvertedPendulum
InvertedDoublePendulum
Hopper
Walker2D
HalfCheetah
Table 1: Number of discriminator and policy updates performed at each iteration.