1 Introduction
One wellknown way in which artificiallyintelligent agents are able to learn to perform tasks is via
reinforcement learning (RL) (Sutton & Barto, 1998) techniques. Using these techniques, if agents are able to interact with the world and receive feedback (known as reward) based on how well they are performing with respect to a particular task, they are able to use their own experience to improve their future behavior. However, designing a proper feedback mechanism for complex tasks can sometimes prove to be extremely difficult for system designers. Moreover, learning based solely on one’s own experience can be exceedingly slow.Concerns such as the ones above have given rise to the study of imitation learning (Schaal, 1997; Billard et al., 2008; Argall et al., 2009), where agents instead attempt to learn a task by observing another, more expert agent perform that task. Because the information about how to perform the task is communicated to the imitating agent via a demonstration, this paradigm does not require the explicit design of a reward function. Moreover, because the demonstrations directly provide rich information regarding how to perform the task correctly, imitation learning is typically faster than RL. While there are multiple ways that this problem can be formulated, one general approach is referred to as inverse reinforcement learning (IRL) (Russell, 1998). IRLbased techniques aim to first infer the expert agent’s reward function, and then learn imitating behavior using RL techniques that utilize the inferred function.
Importantly, most of the imitation learning literature has thus far concentrated only on situations in which the imitator not only has the ability to observe the demonstrating agent’s states (e.g., observable quantities such as spatial location), but also the ability to observe the demonstrator’s actions (e.g., internal control signals such as motor commands). While this extra information can make the imitation learning problem easier, requiring it is also limiting. In particular, requiring action observations makes a large number of valuable learning resources – e.g., vast quantities of online videos of people performing different tasks (Zhou et al., 2017) – useless. For the demonstrations present in such resources, the actions of the expert are unknown. This limitation has recently motivated work in the area of imitation from observation (IfO) (Liu et al., 2017), in which agents seek to perform imitation learning using stateonly demonstrations.
Broadly speaking, the IfO problem consists of two major subproblems: (1) perception of the demonstrations, i.e., extracting useful features from raw visual data, and (2) learning a control policy using the extracted features. Most IfO work thus far (Liu et al., 2017; Sermanet et al., 2017) has focused on perception and not on control. While powerful methods for perceiving the demonstrations have been developed, the control problem is solved via relatively simple means, i.e., reinforcement learning over a predefined reward function. Depending on the defined reward function, this approach could be restrictive, as discussed further in the next section. Therefore, we seek a more sophisticated control algorithm that is able to learn the task automatically from the demonstrations without explicitly defining a reward function.
In this paper, we propose a general framework for the control aspect of IfO in which we characterize the cost as a function of state transitions only. Under this framework, the IfO problem becomes one of trying to recover the statetransition cost function of the expert. Inspired by the work of Ho & Ermon (2016), we introduce a novel, modelfree algorithm called generative adversarial imitation from observation (GAIfO) and prove that it is a specific version of the general framework proposed for IfO. We then experimentally evaluate GAIfO in highdimensional simulation environments in two different settings: (1) demonstrations and states of the imitator are manuallydefined features, and (2) demonstrations and states of the imitator come exclusively from raw visual observation. We show that the proposed method compares favorably to other recentlydeveloped methods for IfO and also that it performs comparably to stateoftheart conventional imitation learning methods that do have access the the demonstrator’s actions.
The rest of this paper is organized as follows. In Section 2, we cover related work in imitation learning and review existing research in imitation from observation. Then, we present the notation and background needed in Section 3. In Section 4, we introduce our proposed general framework for IfO problems and, in Sections 5 and 6, we discuss our IfO algorithm, GAIfO. Finally, we describe and discuss our experiments in Sections 7 and 8, respectively.
2 Related Work
Because our work is related to imitation learning (Schaal et al., 2003), we first discuss here different approaches and recent advancements in this area. In general, existing work in imitation learning can be split into two categories: (1) behavioral cloning (BC) (Bain & Sammut, 1995; Pomerleau, 1989), and (2) inverse reinforcement learning (IRL) (Ng et al., 2000; Abbeel & Ng, 2004; Ziebart et al., 2008; Fu et al., 2017).
Behavioral cloning methods use supervised learning as a means by which to find a direct mapping from states to actions.
BC approaches have been used to successfully learn many different tasks such as navigation for quadrotors (Giusti et al., 2016) or autonomous ground vehicles (Bojarski et al., 2016). Inverse reinforcement learning (IRL) techniques, on the other hand, seek to learn the demonstrator’s cost function and then use this learned cost function in order to learn an imitation policy through RL techniques. IRL methods have been used for interesting tasks such as dish placement and pouring (Finn et al., 2016). To the best of our knowledge, the current state of the art in imitation learning is an IRLbased technique called generative adversarial imitation learning (GAIL) (Ho & Ermon, 2016). GAIL uses generative adversarial networks (GANs) (Goodfellow et al., 2014) as a means by which to bring the distribution of state and action pairs of the imitator and the demonstrator closer together.Most existing imitation learning approaches require demonstrations that include the expert actions. However, these actions are not always observable, and sometimes it is more practical to be able to imitate stateonly demonstrations. One step towards this goal is the work of Finn et al. (2017) where a metalearning imitation learning method is proposed that enables a robot to reuse past experience and learn new skills from a single demonstration. In particular, raw pixel videos are used as the source of demonstration information. However, it is still assumed that the expert actions are available during metatraining; the requirement for actions is only lifted at test time when learning the new task.
One way to approach the aforementioned problem is to “learn to imitate” (as opposed to imitation learning), i.e., by doing some preprocessing, enable the agent to follow a single demonstration exactly. Two such approaches are proposed by Nair et al. (2017) and Pathak et al. (2018). These methods first learn an inverse dynamics model through selfsupervised exploration, and then use it to infer the demonstrator’s action at each step and perform that in the environment. These approaches mimic the one demonstration that they are exposed to exactly (as opposed to learning and generalizing a task from multiple different demonstrations).
A second approach to imitation from actionfree demonstrations is behavioral cloning from observation (BCO) (Torabi et al., 2018). This method also learns an inverse dynamics model through selfsupervised exploration which is then used to infer actions from demonstrations. The problem is then treated as a regular imitation learning problem, and behavioral cloning is used to learn an imitation policy that maps states to the inferred actions. Therefore, this method is able to learn and generalize from different demonstrations but, since it is based on behavioral cloning, it may suffer from the wellstudied compounding error caused by covariate shift (Ross & Bagnell, 2010; Ross et al., 2011; Laskey et al., 2016).
A third class of techniques that is able to perform imitation learning without requiring knowledge of actions includes those that first focus on learning a representation of the task and then use an RL method with a predefined surrogate reward over that representation. For example, Gupta et al. (2017) have proposed an invariant feature space to transfer skills between agents with different embodiments, Liu et al. (2017) have presented a network architecture which is capable of handling differences in viewpoints and contexts between the imitator and the demonstrator, and Sermanet et al. (2017) have proposed a timecontrastive network which is invariant to both different embodiments and viewpoints. While these techniques represent significant advances in representation learning, each of them uses the same surrogate reward function, i.e., the proximity of the imitator’s and demonstrator’s encoded representation at each time step. One of the downsides of this reward function is that each provided demonstration needs to be timealigned, i.e., at every time step, each demonstration needs to have advanced to the same point of the task. Another approaches developed by Merel et al. and Henderson et al. aim to imitate the state distribution of the expert. However, the state distribution does not represent the demonstrator policy and the learned policy may fail in tasks such as the cyclic ones. Moreover, these approaches have thus far focused mostly on experimentation and less on the theoretical underpinnings of the control problem. In our work, we propose a new algorithm to remove the constraints mentioned above, and also provide theoretical analysis of this approach.
3 Preliminaries
Notation
We consider agents within the framework of Markov decision processes (MDPs). In this framework,
and are the state and action spaces, respectively. An agent at a particular state , chooses an action , based on a policy and transitions to statewith probability of
that is predefined by the environment transition dynamics. In this process, the agent gets feedback which is coming from a cost function . In this paper, means the extended real numbers and expectation over a policy means the expectation over all the trajectories that it generates.Inverse Reinforcement Learning (Irl)
As described earlier, one general approach to imitation learning is based on IRL. The first step of this approach is to learn a cost function based on the given stateaction demonstrations. This cost function is learned such that it is minimal for the trajectories demonstrated by the expert and maximal for every other policy (Abbeel & Ng, 2004). However, since the problem is underconstrained — many policies can lead to the same (demonstrated) trajectories — another constraint is usually assigned as well which chooses the policy that has the maximum entropy. This method is called maximum entropy inverse reinforcement learning (MaxEnt IRL) (Ziebart et al., 2008). A very general form of this framework can be described as
(1) 
where is a convex cost function regularizer, is the expert policy, is the space of all the possible policies, and and are the entropy function of the policy and its weighting parameter respectively. The output here is the desired cost function. The second step of this framework is to input the learned cost function into a standard reinforcement learning problem. An entropyregularized version of RL can be described as
(2) 
which aims to find a policy that minimizes the cost function and maximizes the entropy.
Generative Adversarial Imitation Learning (Gail)
Recently, Ho & Ermon have shown that by considering a specific function as the cost regularizer , the described pipeline ((1) and (2)) can be solved instead as
(3) 
where
is a classifier trained to discriminate between the stateaction pairs that arise from the demonstrator and the imitator. Excluding the entropy term, the loss function in (
3) is similar to the loss of generative adversarial networks (Goodfellow et al., 2014). Instead of first learning the cost function and then learning the policy on top of that, this method directly learns the optimal policy by bringing the distribution of the stateaction pairs of the imitator as close as possible to that of the demonstrator.4 A General Framework for Imitation from Observation
In IRL, both states and actions are available and the goal is to find a cost function that on average has a smaller value for the trajectories generated by the expert policy compared to the ones generated by any other policy. In the case of imitation from observation, however, the demonstrations that the agent receives are limited to the expert’s stateonly trajectories. In the context of the IRLbased approaches to imitation learning discussed above, this lack of action information makes it impossible to calculate the term in (1). Consequently, none of the approaches described in Section 3 is directly applicable in this setting.
In imitation from observation, the goal is for the imitator to perform similarly to the expert in the environment, i.e., for the actions of the demonstrator and imitator to have the same effect on the environment (performing the task), rather than taking exactly the same actions. Therefore, instead of characterizing the cost signal as a function of states and actions , we define them as a function of the state transitions . Based on this characterization, we formulate inverse reinforcement learning from observation as
(4) 
which outputs . Note that in (4) we ignore the entropy term so as to simplify the theoretical analysis presented in Section 5. Evidence form Ho & Ermon suggests that doing so is fine from an empirical perspective (they set in more than of their successful experiments). We leave detailed analysis of the effect of this term to future work. From a highlevel perspective, in imitation from observation, the goal is to enable the agent to extract what the task is by observing some state sequences. Intuitively, this extraction is possible because we expect the beneficial state transitions for any given task to form a lowdimensional manifold within the space. Thus, the intuition behind our definition of the cost function is to penalize based on how close each transition is to that manifold.
Now using an RL algorithm for amounts to solving:
(5) 
where the output, , is the imitation policy.
5 Generative Adversarial Imitation from Observation
Having developed the general framework in (4), we now propose a specific algorithm, generative adversarial imitation from observation (GAIfO). To this end, we first define the statetransition occupancy measure, as
(6) 
This occupancy measure corresponds to the distribution of state transitions that an agent encounters when using policy . We define the set of valid statetransition occupancy measures as .
We now introduce a proposition which is the foundation of our algorithm. In the following proposition we use the convex conjugate concept which is defined as follows: for a function , the convex conjugate is defined as .
Proposition 5.1.
and induce policies that have the same statetransition occupancy measure, .
In the rest of this section, we prove this proposition and then by choosing a specific regularizer, we present our algorithm. At the end we propose a practical implementation of the algorithm. To prove the proposition, we first define another problem, , and argue that it outputs a statetransition occupancy measure which is the same as induced by . We define
(7) 
where, the output is a cost function . Note that, so (4) and (7) are similar except that the former is optimized over and the latter over . If we consider using an RL method to find a statetransition occupancy measure under , (5) can be rewritten as
(8) 
which would now output the desired statetransition occupancy measure .
Lemma 5.1.
outputs a statetransition occupancy measure, , which is the same as induced by .
Proof.
From the definition of , the mapping from to is surjective, i.e., for every , there exists at least one . Therefore, we can say (where and , as already defined, are the outputs of (5) and (8), and is the statetransition occupancy measure that corresponds to ). Therefore, solving results in the same as applying using the cost function returned by in (7). ∎
Note that, in this lemma, the returned policies from these two problems are not necessarily the same. The reason is that the mapping from to is not injective, i.e., there could be one or multiple that corresponds to the same . Consequently, it is not necessarily the case that a policy that gives rise to is the same as . However, as we discussed in the previous section, in imitation from observation, we are primarily concerned with the effect of the policy on the environment so this situation is acceptable.
Now we introduce another lemma that helps us in the proof of Proposition 5.1.
Lemma 5.2.
This lemma is proven in the appendix ^{1}^{1}1The appendix is anonymously presented https://tinyurl.com/ybkn8v7n https://tinyurl.com/ybkn8v7n using the minimax principle (Millar, 1983). Thus far, by combining Lemmas 5.1 and 5.2, we can conclude that induced by is the same as the output of . Now, we only need one more step to prove Proposition 5.1:
Lemma 5.3.
is a policy that has a statetransition occupancy measure that is the same as the output of .
The proof of Lemma 5.3 is similar to that of Lemma 5.1. Now based on Lemmas 5.1, 5.2, and 5.3 we can conclude that Proposition 5.1 holds.
Having proved this proposition, we can solve instead of . To this end, we consider the generative adversarial regularizer
(9) 
where
(10) 
which is a closed, proper, convex function and has convex conjugate
(11) 
where is a discriminative classifier. A similar convex conjugate is derived in Ho & Ermon; However, for the sake of completeness, we prove the properties claimed for (9) and show that (11) is its convex conjugate in the appendix. ^{2}^{2}2This proof closely follows the proofs of Proposition A.1. and Corollary A.1.1. of Ho & Ermon and it is included here for the sake of completeness. The only substantive difference is that in our case we consider statetransition occupancy measure instead of .
Using the above, the imitation from observation problem can be solved as:
(12) 
We can see that the loss function in (12) is similar to the generative adversarial loss. We can connect this to general GANs if we interpret the expert’s demonstrations as the real data, and the data coming from the imitator as the generated data. The discriminator seeks to distinguish the source of the data, and the imitator policy (i.e., the generator) seeks to fool the discriminator to make it look like the state transitions it generates are coming from the expert. The entire process can be interpreted as bringing the distribution of the imitator’s state transitions closer to that of the expert. We call this process Generative Adversarial Imitation from Observation (GAIfO).
6 Practical Implementation
Based on the preceding analysis, we now specify our practical implementation of the GAIfO algorithm. We represent the discriminator,
, using a multilayer perceptron with parameters
that takes as input a state transition and outputs a value between and . We represent the policy, , using a multilayer perceptron with parameters that takes as input a state and outputs an action. We begin by randomly initializing each of these networks, after which the imitator selects an action according to and executes that action. This action leads to a new state, and we feed both this state transition and the entire set of expert state transitions to the discriminator. The discriminator is updated using the Adam optimization algorithm (Kingma & Ba, 2014), with crossentropy loss that seeks to push the output for expert state transitions closer to and the imitator’s state transitions closer to . After the discriminator update, we perform trust region policy optimization (TRPO) (Schulman et al., 2015) to improve the policy using a reward function that encourages state transitions that yield small outputs from the discriminator (i.e., those that appear to be from the demonstrator). This process continues until convergence. The algorithm is shown in Algorithm 1 and the framework is summarized in Figure 1.The implementation described above is only effective for cases in which the demonstration consists of lowdimensional state representations. In particular, the imitation policy maps a single state to the imitating action and the reward function operates on a single state transition. This approach is feasible for cases in which (a) the states can be assumed to be fullyobservable, and (b) the system is strictly Markovian. However, when considering visual state representations, neither of these assumptions is necessarily valid. Therefore, agents operating in such state spaces are typically provided instead a recent state history. This is useful because, for example, having knowledge about the velocity of the agent at each time step is important in order to select the correct action, and velocity information is not available when considering a single image. Therefore, we propose here a second implementation of GAIfO that enables imitation from visual demonstration data. It modifies the implementation used for lowdimensional state representations by adding convolutional layers and using images from multiple time steps as the input to the generator and discriminator. This implementation is summarized in Figure 2.
7 Experimental Setup and Implementation Details
We evaluate our algorithm in domains from OpenAI Gym (Brockman et al., 2016) based on the Pybullet simulator (Coumans & Bai, 20162017). In each of the domains, we used trust region policy optimization (TRPO) (Schulman et al., 2015) to train the expert agents, and we recorded the demonstrations using the resulting policy.
The results shown in the figures are the average over ten independent trials. We compare our algorithm against three baselines:

Behavioral Cloning from Observation (BCO)(Torabi et al., 2018): BCO first learns an inverse dynamics model through selfsupervised exploration, and then uses that model to infer the missing actions from stateonly demonstrated trajectories. BCO then uses the inferred actions to learn an imitation policy using conventional behavioral cloning.

Time Contrastive Networks (TCN)(Sermanet et al., 2017): TCN
s use a triplet loss to train a neural network to learn an encoded form of the task at each time step. This loss function brings the states that occur in a small timewindow closer together in the embedding space and pushes the ones from distant timesteps far apart. A reward function is then defined as the Euclidean distance between the embedded demonstration and the embedded agent’s state at each time step. The imitation policy is learned using
RL techniques that seek to optimize this reward function.
8 Results and Discussion
In this section, we present the results of the two sets of experiments described above.
8.1 Lowdimensional State Representations
Figure 3 illustrate the comparative performance of GAIfO in our experimental domains using the lowdimensional state representations. We can see that, for the domains considered here, GAIfO (a) performs very well compared to other IfO techniques, and (b) is surprisingly comparable to GAIL even though GAIfO lacks access to explicit action information.
Figure 3 compares the final performance of the imitation policies learned by different algorithms. We can clearly see that GAIfO outperforms the other imitation from observation algorithms by a large margin in most of the experiments. For the InvertedDoublePendulum domain, we can see that the TCN method does not perform well at all. We hypothesize that this is the case because TCN relies on time synchronization in order to find the imitating policy, i.e., it learns what the state should be at each time step. However, successfully performing the InvertedDoublePendulum task requires the agent to simply keep the pendulum upright, and requiring it to time synchronize with the demonstrator may be too restrictive a requirement. BCO, on the other hand, performs very well in this domain, which demonstrates that, here, the inverse dynamics model learned by BCO is accurate and that the compounding error problem is negligible. We can see that GAIfO also performs very well here, achieving performance similar to that of the expert, which shows that the algorithm has been able to extract the goal of the task and find a reasonable cost function from which to learn the policy.
For the InvertedPendulumSwingup domain, we can see that TCN again does not perform well, perhaps because the goal of the task is not wellrepresented in the encodinglearning phase. BCO also does not perform well. We hypothesize that this is the case because of the compounding error problem since performing this task successfully is contingent on taking several specific actions consecutively – deviation from those actions would cause the pendulum to drop down and not reach the goal. GAIfO and GAIL, on the other hand, perform as well as the expert, which reveals that these algorithms have successfully extracted the goal and learned the task.
For both the Hopper and Walker2D domains, it can be seen that, again, TCN does not work well. We posit that this might be due to the fact that these tasks require behavior that is cyclic in nature, i.e., the expert demonstrations contain repeated states. Because TCN learns a timedependent representation of the task, it cannot appropriately handle this periodicity and, therefore, the learned representations are not sufficient. GAIfO, however, learns a distribution of the state transitions that is not timedependent; therefore, periodicity does not affect its performance. BCO also does not perform well in either of these two domains, perhaps again due to the compounding error problem. Learning in these domains has two steps: first, the agent needs to learn to stand, and then the agent needs to learn to walk or hop. With BCO, it would seem that the imitating agent begins to deviate from the expert early in the task, and this early deviation ultimately leads to the imitating agent being unable to learn the secondary walking and hopping behaviors. GAIfO, on the other hand, does not suffer from this issue because it learns by executing its own policy in the environment (onpolicy learning) and is therefore able to address deviation from the expert during the learning process.
8.2 Visual State Representations
In this section, we discuss the results of the experiments performed on the cases where the states are represented using the raw visual data. Figure 4 illustrates the comparison between the performance of GAIfO, BCO and TCN. ^{3}^{3}3Here, we do not compare against GAIL because doing so would require a drastic change to the structure of its discriminator in order to process raw visual data, i.e., the discriminator would need to be altered to appropriately mix action and visual data. In these experiments, like the ones done using the lowerdimensional state representations, the expert is trained with TRPO using lowlevel state features, and the quantities and represent the performance of a random agent and the expert, respectively. The demonstrations, though, consist of visual recordings using the trained policy. Accordingly, for a morerepresentative baseline, we also learn a policy with TRPO using visual states only (as opposed to the lowdimensional state observations) and represent the performance of that agent using a black dotted line on the plots. This line is important in our comparison because it shows (everything being similar to IfO methods) what would have been the resulting performance if the agent had access to the reward. Figure 4 shows that GAIfO outperforms other approaches by a large margin.
It is interesting to notice that, even though GAIfO (like the other IfO techniques) does not achieve the performance of the expert agent (solid line), it does achieve the performance of the TRPOtrained agent that used visual state representations. This suggests that, in these cases, the drop in imitation performance is perhaps due to a fundamental limitation of learning the task from visual data (i.e., partial state observability).
Finally, it can be seen that BCO does not perform well in any of the domains, perhaps due to (a) the complexity of learning dynamics models over visual states, and (b) compounding error. TCN also does not work well, perhaps due to the demonstrations not being timesynchronized.
9 Conclusion and Future Work
In this paper, we presented a general framework for imitation from observation () and then proposed a specific algorithm (GAIfO) for doing so. GAIfO removes the need for several restrictive assumptions that are required for some other IfO techniques, including the need for multiple demonstrations to be timesynchronized. Moreover, the onpolicy nature of GAIfO allows it to avoid the compounding error problem experienced by more brittle imitation techniques. The result is an approach that is able to find better imitation policies without the need for action information, and is also able to find imitation policies that perform very close to those found by techniques that do have access to this information.
Regarding future work, note that, in our analysis, we did not consider policy entropy terms in either the IRLfO step, nor in the RL step. Therefore, it would be interesting to include entropy in these equations – as has been shown to be beneficial in some cases (Haarnoja et al., 2017, 2018) – and investigate its effects on the overall problem and results as has been shown to be beneficial in some cases.
Acknowledgements
This work has taken place in the Learning Agents Research Group (LARG) at the Artificial Intelligence Laboratory, The University of Texas at Austin. LARG research is supported in part by grants from the National Science Foundation (IIS1637736, IIS1651089, IIS1724157), the Office of Naval Research (N00014182243), Future of Life Institute (RFP2000), Army Research Lab, DARPA, Intel, Raytheon, and Lockheed Martin. Peter Stone serves on the Board of Directors of Cogitai, Inc. The terms of this arrangement have been reviewed and approved by the University of Texas at Austin in accordance with its policy on objectivity in research.
References
 Abbeel & Ng (2004) Abbeel, P. and Ng, A. Y. Apprenticeship learning via inverse reinforcement learning. In Proceedings of the twentyfirst International Conference on Machine learning, pp. 1. ACM, 2004.
 Argall et al. (2009) Argall, B. D., Chernova, S., Veloso, M., and Browning, B. A survey of robot learning from demonstration. Robotics and Autonomous Systems, 57(5):469–483, 2009.
 Bain & Sammut (1995) Bain, M. and Sammut, C. A framework for behavioral cloning. Machine Intelligence 14, 1995.
 Billard et al. (2008) Billard, A., Calinon, S., Dillmann, R., and Schaal, S. Robot programming by demonstration. In Springer handbook of robotics, pp. 1371–1394. Springer, 2008.
 Bojarski et al. (2016) Bojarski, M., Del Testa, D., Dworakowski, D., Firner, B., Flepp, B., Goyal, P., Jackel, L. D., Monfort, M., Muller, U., Zhang, J., et al. End to end learning for selfdriving cars. arXiv preprint arXiv:1604.07316, 2016.
 Brockman et al. (2016) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., and Zaremba, W. OpenAI Gym, 2016.
 Coumans & Bai (20162017) Coumans, E. and Bai, Y. Pybullet, a python module for physics simulation for games, robotics and machine learning. 20162017. URL http://pybullet.org/.
 Finn et al. (2016) Finn, C., Levine, S., and Abbeel, P. Guided cost learning: Deep inverse optimal control via policy optimization. In International Conference on Machine Learning, pp. 49–58, 2016.
 Finn et al. (2017) Finn, C., Yu, T., Zhang, T., Abbeel, P., and Levine, S. Oneshot visual imitation learning via metalearning. In Conference on Robot Learning, pp. 357–368, 2017.
 Fu et al. (2017) Fu, J., Luo, K., and Levine, S. Learning robust rewards with adversarial inverse reinforcement learning. arXiv preprint arXiv:1710.11248, 2017.
 Giusti et al. (2016) Giusti, A., Guzzi, J., Cireşan, D. C., He, F.L., Rodríguez, J. P., Fontana, F., Faessler, M., Forster, C., Schmidhuber, J., Di Caro, G., et al. A machine learning approach to visual perception of forest trails for mobile robots. IEEE Robotics and Automation Letters, 1(2):661–667, 2016.
 Goodfellow et al. (2014) Goodfellow, I., PougetAbadie, J., Mirza, M., Xu, B., WardeFarley, D., Ozair, S., Courville, A., and Bengio, Y. Generative adversarial nets. In Advances in Neural Information Processing Systems, 2014.
 Gupta et al. (2017) Gupta, A., Devin, C., Liu, Y., Abbeel, P., and Levine, S. Learning invariant feature spaces to transfer skills with reinforcement learning. arXiv preprint arXiv:1703.02949, 2017.
 Haarnoja et al. (2017) Haarnoja, T., Tang, H., Abbeel, P., and Levine, S. Reinforcement learning with deep energybased policies. arXiv preprint arXiv:1702.08165, 2017.
 Haarnoja et al. (2018) Haarnoja, T., Zhou, A., Abbeel, P., and Levine, S. Soft actorcritic: Offpolicy maximum entropy deep reinforcement learning with a stochastic actor. arXiv preprint arXiv:1801.01290, 2018.
 Henderson et al. (2018) Henderson, P., Chang, W.D., Bacon, P.L., Meger, D., Pineau, J., and Precup, D. Optiongan: Learning joint rewardpolicy options using generative adversarial inverse reinforcement learning. In ThirtySecond AAAI Conference on Artificial Intelligence, 2018.
 Ho & Ermon (2016) Ho, J. and Ermon, S. Generative adversarial imitation learning. In Advances in Neural Information Processing Systems, pp. 4565–4573, 2016.
 Kingma & Ba (2014) Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.

Laskey et al. (2016)
Laskey, M., Staszak, S., Hsieh, W. Y.S., Mahler, J., Pokorny, F. T., Dragan,
A. D., and Goldberg, K.
Shiv: Reducing supervisor burden in dagger using support vectors for efficient learning from demonstrations in high dimensional state spaces.
In In International Conference on Robotics and Automation (ICRA), pp. 462–469. IEEE, 2016.  Liu et al. (2017) Liu, Y., Gupta, A., Abbeel, P., and Levine, S. Imitation from observation: Learning to imitate behaviors from raw video via context translation. arXiv preprint arXiv:1707.03374, 2017.
 Merel et al. (2017) Merel, J., Tassa, Y., Srinivasan, S., Lemmon, J., Wang, Z., Wayne, G., and Heess, N. Learning human behaviors from motion capture by adversarial imitation. arXiv preprint arXiv:1707.02201, 2017.
 Millar (1983) Millar, P. W. The minimax principle in asymptotic statistical theory. In Ecole d’Eté de Probabilités de SaintFlour XI1981, pp. 75–265. Springer, 1983.
 Nair et al. (2017) Nair, A., Chen, D., Agrawal, P., Isola, P., Abbeel, P., Malik, J., and Levine, S. Combining selfsupervised learning and imitation for visionbased rope manipulation. In Robotics and Automation (ICRA), 2017 IEEE International Conference on, pp. 2146–2153. IEEE, 2017.
 Ng et al. (2000) Ng, A. Y., Russell, S. J., et al. Algorithms for inverse reinforcement learning. In ICML, pp. 663–670, 2000.
 Pathak et al. (2018) Pathak, D., Mahmoudieh, P., Luo, G., Agrawal, P., Chen, D., Shentu, Y., Shelhamer, E., Malik, J., Efros, A. A., and Darrell, T. Zeroshot visual imitation. arXiv preprint arXiv:1804.08606, 2018.
 Pomerleau (1989) Pomerleau, D. A. Alvinn: An autonomous land vehicle in a neural network. In Advances in Neural Information Processing Systems, pp. 305–313, 1989.
 Ross & Bagnell (2010) Ross, S. and Bagnell, D. Efficient reductions for imitation learning. In Proceedings of the thirteenth International Conference on Artificial Intelligence and Statistics, pp. 661–668, 2010.
 Ross et al. (2011) Ross, S., Gordon, G., and Bagnell, D. A reduction of imitation learning and structured prediction to noregret online learning. In Proceedings of the fourteenth international conference on artificial intelligence and statistics, pp. 627–635, 2011.

Russell (1998)
Russell, S.
Learning agents for uncertain environments.
In
Proceedings of the eleventh annual conference on computational learning theory
, pp. 101–103. ACM, 1998.  Schaal (1997) Schaal, S. Learning from demonstration. In Advances in Neural Information Processing Systems, pp. 1040–1046, 1997.
 Schaal et al. (2003) Schaal, S., Ijspeert, A., and Billard, A. Computational approaches to motor learning by imitation. Philosophical Transactions of the Royal Society B: Biological Sciences, 358(1431):537–547, 2003.
 Schulman et al. (2015) Schulman, J., Levine, S., Abbeel, P., Jordan, M., and Moritz, P. Trust region policy optimization. In International Conference on Machine Learning, pp. 1889–1897, 2015.
 Sermanet et al. (2017) Sermanet, P., Lynch, C., Hsu, J., and Levine, S. Timecontrastive networks: Selfsupervised learning from multiview observation. arXiv preprint arXiv:1704.06888, 2017.
 Sutton & Barto (1998) Sutton, R. S. and Barto, A. G. Reinforcement learning: An introduction, volume 1. MIT press Cambridge, 1998.
 Torabi et al. (2018) Torabi, F., Warrnell, G., and Stone, P. Behavioral cloning from observation. arXiv preprint arXiv:1805.01954, 2018.
 Zhou et al. (2017) Zhou, L., Xu, C., and Corso, J. J. Towards automatic learning of procedures from web instructional videos. arXiv preprint arXiv:1703.09788, 2017.
 Ziebart et al. (2008) Ziebart, B. D., Maas, A. L., Bagnell, J. A., and Dey, A. K. Maximum entropy inverse reinforcement learning. In AAAI, volume 8, pp. 1433–1438. Chicago, IL, USA, 2008.