Humans have an extraordinary ability to perform complex operations by watching others. How do we achieve this? Imitation requires inferring the goal/intention of the other person one is trying to imitate, translating these goals into one’s own context, mapping the third-person’s actions to first-person actions, and then finally using these translated goals and mapped actions to perform low-level control. For example, as shown in Figure 1, imitating the pouring task not only involves understanding how to change object states (tilt glass on top of another glass), but also imagining how to adapt goals to novel objects in scene followed by low-level control to accomplish the task.††Work done at CMU and UC Berkeley. Correspondence to firstname.lastname@example.org
As one can imagine, simultaneously learning these functions is extremely difficult. Therefore, most of the classical work in robotics has focused on a much-restricted version of the problem. One of the most common setup is learning from demonstration (LfD) (Pomerleau, 1989; Argall et al., 2009; Schaal, 1999; Ng and Russell, 2000; Akgun et al., 2012; Zhang et al., 2017), where demonstrations are collected either by manually actuating the robot, i.e., kinesthetic demonstrations, or controlling it via teleoperation. LfD involves learning a policy from such demonstrations with the hope that it would generalize to new location/poses of the objects in unseen scenarios. Some recent works explore a relatively general version where a robot learns to imitate a video of the demonstration collected from either the robot’s viewpoint (Pathak et al., 2018) or with only a little different expert viewpoint (Yu et al., 2018).
In this paper, we tackle the generalized setting of learning from third-person demonstrations. Our agent first observes a video of a human demonstrating the task in front of it, and then it performs that task by itself. We do not assume any access to the state-space information of the environment and learn directly from raw camera images. To be successful, the robot needs to translate the observed goal states to its own context (imagine the goals in its viewpoint) as well as map the third-person actions to its trajectory. One way to solve this would be to use classical vision methods that estimate location/pose of objects as well as the human expert and then map the keypoints to robot actions. However, hard-coding the correspondence from human keypoints to robot morphology is often non-trivial, and this overall multi-stage approach is difficult to generalize to unseen object/task categories. Another way is to leverage modern deep learning algorithms to learn an end-to-end function that goes from video frames of human demonstration to output the series of joint angles required to perform the task. This function can be trained in a supervised manner with ground truth kinesthetic demonstrations. However, unfortunately, today’s deep learning vision algorithms require millions of images for training. While recent approaches(Yu et al., 2018) attempt to handle this challenge via meta-learning, the models for each of the tasks are separately trained and difficult to generalize to new tasks.
We propose an alternative approach by injecting hierarchical structure into the learning process in-between inferring the high-level intention of the demonstrator and learning the low-level controller to perform the desired task. We decouple the end-to-end pipeline into two modules. First, a high-level module that generates goal conditioned on the human demonstration video (third-person view) and the robot’s current observation (first-person view). It predicts a visual sub-goal in the first-person view that roughly corresponds to an intermediate way-point in achieving the intended task described in the demonstration video. Generating a visual sub-goal is a difficult learning problem and, hence, we employ a conditional variant of Generative Adversarial Networks (GANs) (Goodfellow et al., 2014) to generate realistic rendering (Goodfellow et al., 2014; Mirza and Osindero, 2014; Pathak et al., 2016; Isola et al., 2017). Second, a low-level controller module outputs a sequence of actions to achieve this visual sub-goal from its current observation. Both the modules are trained in a supervised manner using human videos and robot joint angles trajectories, which are paired (with respect to objects and tasks) but unaligned (with respect to time sequence). Our overall approach is summarized in Figure 2. The key advantage of this modular separation into task-specific goal-generator and task-independent low-level controller is that it improves the efficiency of our approach; how? The data-hungry low-level controller is shared across all tasks allowing it: (a) to be sample-efficient (in terms of data required per task) (b) robust and avoid overfitting.
We show experiments on a real robotic platform using Baxter across two scenarios: pouring and placing objects in a box. We first systematically evaluate the quality of both the high-level and low-level modules individually given perfect information on held-out test examples of human video and robot trajectories. We then ablate the generalization properties of these modules across the same task with different scenarios and different tasks with different scenarios. Finally, we deploy the complete system on the Baxter robot for performing tasks with novel objects and demonstrations.
2 Problem Setup: Third Person Visual Imitation
Consider a robotic agent observing its current observation state at time
. The action space of the robot is a vector of joint angles, referred to as. Let be the sequence of images (i.e., video) of a human demonstrating the task as observed by the robot in third-person view, i.e., . Our goal is to train the agent such that, at inference, it can follow a video of a novel human demonstration video starting from its initial state by predicting a sequence of joint angle configurations .
Our goal is to learn an agent that can imitate the action performed by the human expert in the third person video. We want to imitate only from raw pixels without access to full-state information about the environment. At training, we have access to a video of the human expert demonstration for a object manipulation task , a video of the same demonstration performed kinesthetically using the robot joint angle states and a time series of the sequence of robot’s first-person image observations . We leverage a recently released dataset of human demonstration videos and robot trajectories (Sharma et al., 2018) where the demonstrations and trajectories are paired, but not exactly aligned in time. We sub-sample the robot and human demonstration sequences, which helps them roughly get aligned. In our setup, we have access to all the three time-series data at the training time, but only the time series data corresponding to the human demonstration image sequence at the test time. The other two time series would be predicted or generated by our algorithm.
3 Hierarchical Controllers for Imitation
An end-to-end model that goes from human demonstration video and robot’s current observation to directly predict the robot trajectories would require a lot of human demonstrations. Instead, we inject the structure into the learning process by decoupling the imitation signal into what needs to be done from how it needs to be done. Decoupling makes our approach modular and more sample efficient than end-to-end learning. It also enables the system to be more interpretable, as the goal inference is now disentangled from the control task allowing us to visualize the intermediate sub-goals.
Our approach consists of a two-level hierarchical modular approach. The high-level module is a goal generator that infers the goal in the pixel space from a human video demonstration and translates it into what it means in the context of the robot’s environment in the form of a pixel level representation. The second step is an inverse controller, which follows up on the generated cues from the visual goal inference model and generates an action for the robot to execute. These models are trained independently, and at test time, they are alternatively executed for the robot to accomplish the multi-step manipulation task, as illustrated in Figure 2.
3.1 High-Level Module: Goal Generator
The role of the high-level module is to translate the human demonstration images to generate sub-goals images in a way that is understandable to the robot. This high-level goal-generator could be learned by leveraging the paired examples of human demonstration video and the robot demonstration video from our training data. The most straightforward formulation is to express the goal-generator as image translation, i.e., translating human demonstration image to robot demonstration. Image translation is a well-studied problem in computer vision and approaches like Pix2Pix(Isola et al., 2017), CycleGAN (Zhu et al., 2017) could be directly deployed as-is. However, the stark difference between human and robot demonstration images is in terms of viewpoint (third-person vs. first person) and appearance (human arm vs. robotic arm) which makes these models much harder to train, and difficult to generalize as shown in Section 6.
We propose to handle this issue by translating change in the human demonstration image instead of the image itself. In particular, we task the goal-generator to translate the current robot observation image in the same manner as the corresponding human demonstration image is translated into the next image in sequence. This forces the goal-generator to focus on how the pixels should move (re-rendering) instead of figuring out the way harder task of generating the entire pixel distribution in the first place (generation). An illustration is shown in Figure 3. Further, in order to generate realistic looking sub-goals, we represent goal-generator via a conditioned version of generative adversarial networks with a U-Net (Ronneberger et al., 2015) style architecture (Goodfellow et al., 2014; Mirza and Osindero, 2014; Pathak et al., 2016; Isola et al., 2017).
At any particular instant , the input to the goal generator model is the visual state of the robot as well as the visual states of the human demonstration and . This model is trained to generate the visual state of the robot at the step which can be represented as . The overall optimization is as follows:
where refers to the GAN discriminator classification network, state is sampled form the set of real robot observations from the training data, and the triplet are randomly sampled from the time series data of human demonstration and corresponding robot observations. In practice, we resort to using a wider context around the human demonstration images, for instance, more frames surrounding and especially when the human and robot demonstrations are not aligned. The L1-loss ensures that the correct frame is generated while the adversarial discriminator loss ensures the generated samples are realistic (Pathak et al., 2016).
3.2 Low-Level Module: Inverse Controller
The main purpose of the low-level inverse controller is to achieve the goals set by the goal generator. The low-level inverse controller, , takes as input the present visual state of the robot demonstration () along with the predicted visual state of the robot demonstration for the next time step () to predict the action that the robot should take to make the transition to its next state (). Since the task we test on may be performed by the left or the right hand of the robot depending on the human demonstration, we concatenate the seven joint angle states of the left as well as the right hand of Baxter robot. In our case, the predicted action is a 14-dimensional tuple of the joint angles of the robot’s arms. The inverse model uses spatial information from the images of the present visual state of the robot and the generated goal visual state to predict the action. The network used is inspired by the ResNet-18 model (He et al., 2016b)
and is initialized with the weights obtained from pretraining the network on ImageNet. An illustration of our controller is shown in Figure3.
Note an exciting aspect of decoupling goals from the controller is that the controller need not be specific to a particular task. We can share the inverse controller across the different types of tasks like pouring, picking, sliding. Further, another advantage of decoupling goal inference from the inverse model is the ability to utilize additional self-supervised data (, , pairs) which does not have to rely on only perfectly curated demonstrations for training. We leave the self-supervised training for future work.
3.3 Inference: Third-person Imitation
At inference, we run our high-level goal-generator and low-level inverse model in an alternating manner. Given the robot’s current observation and the human demonstration sequence , the goal-generator first generates a sub-goal . The low-level controller then outputs the series of robot joint angles to reach the state . This process is continued until the final image of the human demonstration.
4 Implementation Details and Baselines
We use the MIME dataset (Sharma et al., 2018) of human demonstrations to train our decoupled hierarchical controllers. The dataset is collected using a Baxter robot and contains pairs of 8260 human-kinesthetic robot demonstrations spanned across 20 tasks. For the pouring task, we train on 230 demonstrations, validate on 29, and test on 30 demonstrations. For the models trained on multiple tasks, 6632 demonstrations were used for training, 829 for validation, and 829 for test. In particular, each example contains triplet of human demonstration image sequence, robot demonstration images, and robot’s joint angle state, i.e., . We sub-sampled the trajectories (both images and joint angle states) to a fixed length of 200 time steps for training our models. For training low-level inverse model, we perform regression the action space of robot which is a fourteen dimensional joint angle state . All the training and implementation details related to our hierarchical controllers are provided in Section A.1 of the supplementary.
We first perform ablations of our modules and compare them to different possible architectures, including CycleGAN (Zhu et al., 2017), and L1, L2 loss based prediction models. We then compare our joint approach to two different baselines: (a) End-to-end Baseline (Sharma et al., 2018): In this approach, both the task of inference and control are handled by a single network. The inputs to the network are consecutive frames of the human demonstration around a time step t, along with the image of the robot demonstration at the time step t. The network predicts the action that the robot must then take at time step t to transition to its state at time step t+1. (b) DAML (Yu et al., 2018): The second baseline, we compare our results with is the Domain Adaptive Meta-Learning (DAML (Yu et al., 2018)) baseline. The algorithm is targeted for recovering the best network parameters for a task via a single gradient update at test time using meta-learning.
5 Results: Generalization of Individual Hierarchical Modules
The hierarchy modules run alternatively at test time, and hence, each model relies on the other’s performance at the previous step. Therefore, in this section, we evaluate the generalization abilities of both of our individual modules of the hierarchy while assuming ground truth access to others. We evaluate top-level goal generators assuming the inverse model is perfect and evaluate the inverse-model assuming access to perfect goal-generator. We study generalization across three different scenarios: new location, new objects, and new tasks.
5.1 Generalization to new positions of the same object
Goal Generator: The ability to condition inferred goals in the robot’s own setting is a crucial aspect of our approach. The sensitivity analysis of the goal generator with respect to the position of the objects can help us understand how well the goal generator generalizes in terms of object positions. In Figure 4 (b), we show a scenario where the input of the human demonstration is fixed, but the positions of the objects are varied at test time. The predictions of the goal generator reveal that it is responsive in accordance with change in object positioning. A quantitative analysis of this positional generalization is performed jointly with the evaluation of generalization ability to new objects in Table 2.
Inverse model: To check the ability of the inverse model to generalize to new positions (given perfect goal-generator) of the object, we test the inverse model using ground truth images of the test set. This quantitative evaluation is performed jointly with the evaluation of generalization to novel object in Table 2 and discussed in the next sub-section.
5.2 Generalization to new objects
We now evaluate the ability of our models to generalize manipulation skills to unseen objects.
Goal Generator: Figure 4(a) shows the ability of the goal generator to generate meaningful sub-goals given a demonstration with novel objects. A quantitative evaluation is shown in Table 2 for the goal generation ability when tested with novel objects in different configurations. Our approach outperforms the baselines on all four metrics and generalizes better to new objects both quantitatively (Table 2) and qualitatively (Figure 4(a)). In addition to the baselines shown in Table 2, we also tried an optical flow baseline which did not perform well and was unable to account for in-plane rotations that the task like pouring required. The performance is (L1: 127.28, SSIM:0.81) significantly worse than other methods.
Inverse model: A quantitative evaluation of generalization to new objects and locations is shown in Table 2. Our model outperforms all other baselines by a significant margin. The generalization to diverse positions of objects of the inverse model can be attributed to its training across many different positions of diverse objects.
In addition to the baselines in Table 2, we also compare against the two feature matching based approaches. First, we compute trajectory-based features of the frames of human demonstration and then find the nearest neighbors from the other demonstrations in the training set. The joint angles corresponding to the nearest demonstrations are then considered as the prediction. The trajectory-based features were computed using state-of-the-art temporal deep visual features trained on video action datasets (Carreira and Zisserman, 2017). Using these features as keys to match the nearest neighbors resulted in a rMSE of 22.20 with a stderr of 2.14. Secondly, we used a static feature-based model where we align human demonstration frames with robot ones in SIFT feature space. This resulted in a rMSE value of 45.32 with a stderr of 6.12. Both the baselines perform significantly worse than our results shown in Table 2. In particular, SIFT features did not perform well in finding correspondences between the human and robot demonstrations because of the large domain gap.
5.3 Generalization to new tasks
So far, we have tested generalization with respect to objects and their positions. We now evaluate the ability of our approach to generalize across tasks.
Goal Generator: The goal generator is not task-agnostic. We leave training a task-agnostic goal generator for future work. In principle, since both the goal generator and inverse model don’t depend on temporal information, it should potentially be possible to train a task-agnostic Goal Generator.
|Method||Train (15 Tasks)||Test (5 Tasks)|
|End to End (Sharma et al., 2018)||23.63||1.06||24.83||1.56|
|DAML (Yu et al., 2018)||35.90||1.56||36.45||1.55|
|Inv. Model (Ours)||18.05||0.76||16.90||1.04|
Inverse Model: The inverse model is not trained to perform a particular task. No temporal knowledge of trajectories is used while training the module. This ensures that while the model predicts every step of the trajectory it doesn’t have any preconceived notion about what the entire trajectory will be. Hence, the role of low-level controller (inverse model) is decoupled from the intent of the task (goal-generator) making it agnostic to the task. The ability of the model to generalize to new tasks is demonstrated in Table 3. We train on the first 15 tasks from MIME dataset and test on a held-out dataset for 15 training as well 5 novel tasks. Our model has a much lower error on both the trained tasks as well as the novel tasks than the baseline methods. We want to note that DAML (Yu et al., 2018) is a generic approach, not mainly designed for task transfer in third person, and the results in the original paper have been shown in the context of single planar-manipulation tasks. It has not been shown to scale to training on multiple task categories together. Hence, further changes might be required to scale DAML for transfer across tasks.
6 Results: Generalization and Evaluation of Joint Hierarchical Model
The final test of our approach is to evaluate how the decoupled models perform when run together. Robot demo videos are on the project website https://pathak22.github.io/hierarchical-imitation/.
We look at two tasks - Pouring and Placing in a box. In the task of pouring, the robot is required to start at a given location and then move to a goal location of the cup that needs to be poured into. This task requires the model to predict the different parts of the task correctly which are reaching the goal cup and pouring into it. Since the controller of the robot is imperfect and the predictions can be slightly noisy, we consider a reach to be successful if the robot reaches within 5cm of the cup. Similarly, we consider pouring to be successful if the robot reaches and does the pouring action in 5cm radius of the cup. These evaluation metrics are similar to those used byYu et al. (2018).
|End to End (Sharma et al., 2018)||20%||8%||20%||10%|
|DAML (Yu et al., 2018)||25%||15%||20%||10%|
For the task of placing in the box, we categorize a successful placing in a box if the robot is able to reach within 5cm of the box and is then able to drop the object within 5cm of the box. Further, the models are trained on the task on pouring alone and we evaluate how they generalize to the task of placing.
For the high-level goal generator, it is crucial to generate good quality results over a long horizon to ensure the successful execution of the task. Our approach of using a goal generator to predict high-level goals and an Inverse model to follow up on the generated goals in alternation outperforms the other approaches, as shown in Table 4. The test sets comprised of demonstrations with novel objects placed in random locations. The test not only required the individual models to generalize well but also works well in tandem with the possibility of imperfect predictions and actions from one another.
7 Related Work
Inferring the intent of interaction from a human demonstration and successfully enabling a robot to replicate the task in it’s own environment ties to several related areas discussed as follows.
Domain Adaptation: Addressing the domain shift between the human demonstrator and robot (e.g., appearance, view-points) is one of the goals of our setup. There has been previous work on transfer in visual space (Zhou et al., 2016; Isola et al., 2017) and on tackling domain shift from simulation environments to the real-world (Dosovitskiy et al., 2017; OpenAI, 2018). Some of these approaches map data points from one domain to another (Isola et al., 2017; Zhou et al., 2016). Other approaches aid the transfer by finding domain invariant representations (Tzeng et al., 2014; Sadeghi and Levine, 2016). Along similar lines, Sermanet et al. (2018) looks at learning view-point invariant representations that are then used for third-person imitation. Training such a system would require training data with videos collected from multiple viewpoints. Moreover, learning task-invariant features might not alone be enough to aid the transfer to the robot’s setting because of the differences in the physical configurations. Our approach handles these issues via modular controllers.
Learning from Demonstrations (LfD): LfD generally uses demonstrations obtained from trajectories collected by kinesthetic teaching, teleoperation, or using motion capture technology on the robot arm (Pomerleau, 1989; Argall et al., 2009; Schaal, 1999; Ng and Russell, 2000; Akgun et al., 2012; Zhang et al., 2017). LfD has been successful in learning complex tasks from expert human trajectories, for instance, playing table-tennis (Mülling et al., 2013), autonomous helicopter aerobatics, and drone flying (Abbeel and Ng, 2004). Most of these focus on learning a task from a handful of expert demonstrations for a single task. Our goal is to start by using demonstration data collected across some objects and tasks but enable the robot to imitate the task by just watching one video of a human demonstrating the task with new objects.
Explicitly Inferring Rewards:
Other approaches explicitly infer the reward associated with performing a task from the human demonstrations through techniques such as inverse reinforcement learning(Rhinehart and Kitani, 2017; Sermanet et al., 2017)
. The rewards become representations of the sequence of goals of the task. After construction of the reward functions, the robot is trained using reinforcement learning by collecting samples in its environment to maximize the reward. However, such systems end up needing significantly large amounts of real-world data and have to be re-trained for every new task from scratch, which makes them difficult to scale in the real world. In contrast, our supervised learning approach is trained via maximum likelihood, and thus, efficient enough to scale to real robots.
Visual Foresight: Visual foresight has been popular for self-supervised robot manipulation (Ebert et al., 2017b; Finn and Levine, 2016; Ebert et al., 2017a; Watter et al., 2015), but it relies on task specification in the form of dots in the image space and are action conditioned visual space predictions. Our setting relies on no hand specified goals. The goals in our setting are specified from the human demonstration videos directly. This flexibility lets us specify harder tasks such as pouring, which would have been difficult to specify from dots on images alone.
We present decoupled hierarchical controllers for third-person imitation learning. Our approach is capable of inferring the task from a single third-person human demonstration and executing it on a real robot from first-person perspective. Our approach works from raw pixel input and does not make any assumption about the problem setup. Our results demonstrate the advantage of using a decoupled model over an end-to-end approach and other baselines in terms of improved generalization to novel objects in unseen configurations.
Future Directions: Our high-level and low-level modules currently operate at a per-time step level and don’t make use of temporal information, which results in the predicted trajectories being shaky. A naive inverse controller modeled via LSTM could incorporate the temporal information but it easily learns to cheat by memorizing the mean trajectory making it hard to generalize to novel tasks. However, training on lots of tasks together could potentially alleviate this limitation. An added advantage of the explicit decoupling of the models is the ability to utilize additional self-supervised data to train the low-level controller and make it robust to failure and different types of joint configurations. We leave these directions for future work to explore.
We would like to thank David Held, Aayush Bansal, members of the CMU visual learning lab and Berkeley AI Research lab for fruitful discussions. The work was carried out when PS was at CMU and DP was at UC Berkeley. This work was supported by ONR MURI N000141612007 and ONR Young Investigator Award to AG. DP is supported by the Facebook graduate fellowship.
Appendix A Supplementary Material
a.1 Implementation Details
Goal Generator (high-level)
The goal generator uses pix2pixIsola et al. (2017) inspired framework. The generator network is a U-Net 128 block with skip connections between the and layers where is the number of layers in the U-Net block. The encoder and decoder architecture are as shown in Figure 3 of the main paper. The input to the model is an image of shape . The images are randomly jittered by resizing to 140X140 and then cropped back to 128X128. The network is optimized using Adam (Kingma and Ba, 2015) with a learning rate of 0.0002 along with momentum parameters .
The input to the network contains the human demonstration image at time step and ( and ) along with the robot demonstration image at time step (). The output of the network is the robot goal state () at time . While we want precise goal predictions which would require the long multi-step task to be broken into smaller steps, we also require the goal generator to predict goals that look significantly different from the current observed state so the inverse controller can predict a change in state. Empirically, we find that after subsampling the trajectories to 200 time steps a value of handles this trade-off best.
Inverse Controllers (low-level)
The inverse model or the local controller consists of 4 convolution blocks of ResNet-18 (He et al., 2016a) followed by three fully connected layers. The ResNet blocks are initialized with pre-trained weights on ImageNet. The input to the network was the robot state at time and the goal state . The action predicted by the network was a fourteen-dimensional tuple of the joint angle states of the different joints of both the left and right arms of Baxter, . The input images were jittered by random cropping 85% of the image to make the model robust to vibrations in the robot arms and camera. The learning rate used to train the model was 0.001 and the optimized using Adam (Kingma and Ba, 2015).
a.2 Generalization of Inverse Model: Simulation Experiments
In addition to our real-world experiments discussed in Section 5.2 of the main paper, we also trained an inverse model in simulation with the Sawyer robot. The trajectories used to train Sawyer were obtained from a policy trained on reaching with different objects placed in front of it. Demonstrations were created by training policies using proximal policy optimization(PPO). The policies were trained on a diverse set of objects to collect 500 demonstrations. For different object locations on new objects at test time, our learned controller achieves mean RMSE of 6.09 with a stderr of 2.8, which suggests the robustness of the controller.
- Abbeel and Ng  P. Abbeel and A. Y. Ng. Apprenticeship learning via inverse reinforcement learning. In ICML, 2004.
- Akgun et al.  B. Akgun, M. Cakmak, J. W. Yoo, and A. L. Thomaz. Trajectories and keyframes for kinesthetic teaching: A human-robot interaction perspective. In HRI, March 2012.
- Argall et al.  B. D. Argall, S. Chernova, M. Veloso, and B. Browning. A survey of robot learning from demonstration. RAS, 2009.
- Carreira and Zisserman  J. Carreira and A. Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. In CVPR, 2017.
- Dosovitskiy et al.  A. Dosovitskiy, G. Ros, F. Codevilla, A. Lopez, and V. Koltun. CARLA: An open urban driving simulator. In CoRL, 2017.
- Ebert et al. [2017a] F. Ebert, C. Finn, A. X. Lee, and S. Levine. Self-supervised visual planning with temporal skip connections. CoRR, abs/1710.05268, 2017a.
- Ebert et al. [2017b] F. Ebert, C. Finn, A. X. Lee, and S. Levine. Self-supervised visual planning with temporal skip connections. arXiv preprint arXiv:1710.05268, 2017b.
- Finn and Levine  C. Finn and S. Levine. Deep visual foresight for planning robot motion. CoRR, abs/1610.00696, 2016.
- Goodfellow et al.  I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In NIPS, 2014.
- He et al. [2016a] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, 2016a.
- He et al. [2016b] K. He, X. Zhang, S. Ren, and J. Sun. Identity mappings in deep residual networks. In ECCV, pages 630–645. Springer, 2016b.
Isola et al. 
P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros.
Image-to-image translation with conditional adversarial networks.In CVPR, 2017.
- Kingma and Ba  D. Kingma and J. Ba. Adam: A method for stochastic optimization. ICLR, 2015.
- Mirza and Osindero  M. Mirza and S. Osindero. Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784, 2014.
- Mülling et al.  K. Mülling, J. Kober, O. Kroemer, and J. Peters. Learning to select and generalize striking movements in robot table tennis. Int. J. Rob. Res., 2013.
- Ng and Russell  A. Y. Ng and S. J. Russell. Algorithms for inverse reinforcement learning. In ICML, pages 663–670, 2000.
- OpenAI  OpenAI. Learning dexterous in-hand manipulation. CoRR, abs/1808.00177, 2018.
- Pathak et al.  D. Pathak, P. Krahenbuhl, J. Donahue, T. Darrell, and A. A. Efros. Context encoders: Feature learning by inpainting. In CVPR, 2016.
- Pathak et al.  D. Pathak, P. Mahmoudieh, G. Luo, P. Agrawal, D. Chen, Y. Shentu, E. Shelhamer, J. Malik, A. A. Efros, and T. Darrell. Zero-shot visual imitation. In ICLR, 2018.
D. A. Pomerleau.
ALVINN: An autonomous land vehicle in a neural network.In NIPS, 1989.
- Rhinehart and Kitani  N. Rhinehart and K. M. Kitani. First-person activity forecasting with online inverse reinforcement learning. In ICCV, Oct 2017.
- Ronneberger et al.  O. Ronneberger, P. Fischer, and T. Brox. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pages 234–241. Springer, 2015.
- Sadeghi and Levine  F. Sadeghi and S. Levine. CAD2RL: Real single-image flight without a single real image. CoRR, abs/1611.04201, 2016.
- Schaal  S. Schaal. Is imitation learning the route to humanoid robots? Trends in cognitive sciences, 1999.
- Sermanet et al.  P. Sermanet, K. Xu, and S. Levine. Unsupervised perceptual rewards for imitation learning. In RSS, 2017.
- Sermanet et al.  P. Sermanet, C. Lynch, Y. Chebotar, J. Hsu, E. Jang, S. Schaal, and S. Levine. Time-contrastive networks: Self-supervised learning from video. In ICRA, 2018.
- Sharma et al.  P. Sharma, L. Mohan, L. Pinto, and A. Gupta. Multiple interactions made easy (MIME): large scale demonstrations data for imitation. CoRL, 2018.
- Tzeng et al.  E. Tzeng, J. Hoffman, N. Zhang, K. Saenko, and T. Darrell. Deep domain confusion: Maximizing for domain invariance. CoRR, abs/1412.3474, 2014.
- Watter et al.  M. Watter, J. T. Springenberg, J. Boedecker, and M. A. Riedmiller. Embed to control: A locally linear latent dynamics model for control from raw images. CoRR, abs/1506.07365, 2015.
- Yu et al.  T. Yu, C. Finn, A. Xie, S. Dasari, T. Zhang, P. Abbeel, and S. Levine. One-shot imitation from observing humans via domain-adaptive meta-learning. arXiv preprint arXiv:1802.01557, 2018.
- Zhang et al.  T. Zhang, Z. McCarthy, O. Jow, D. Lee, K. Goldberg, and P. Abbeel. Deep imitation learning for complex manipulation tasks from virtual reality teleoperation. CoRR, abs/1710.04615, 2017.
- Zhou et al.  T. Zhou, P. Krahenbuhl, M. Aubry, Q. Huang, and A. A. Efros. Learning dense correspondence via 3d-guided cycle consistency. In CVPR, 2016.
- Zhu et al.  J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros. Unpaired image-to-image translation using cycle-consistent adversarial networks. ICCV, 2017.