In the machine learning community, solving complex sequential prediction problems usually follows one of two different approaches: reinforcement learning (RL) or supervised learning (SL), more specifically imitation learning (IL). On the one hand, the learner in conventional IL is required to trust and replicate authoritative behaviors of a teacher. The drawbacks are primarily the need for extensive manually collected training data and the inherent subjectivity to potential negative behaviors of teachers, since in many realistic scenarios they are imperfect. IL often does not generalize well to completely new environments or differing control scenarios that are not well represented during training. On the other hand, RL does not specifically require supervision by a teacher, as it searches for an optimal policy that leads to the highest eventual reward. However, a good reward function, which offers the agent the opportunity to learn desirable behaviors, requires tedious and meticulous reward shaping. Recent methods have used RL to learn simpler tasks without supervision , but they require excessive training time and a very fast simulation (e.g. fps). In this paper, we demonstrate that state-of-the art performance can be achieved by incorporating RL concepts into direct imitation to learn only the successful actions of multiple teachers. We call this approach Observational Imitation Learning (OIL). Unlike conventional IL, OIL enables learning from multiple teachers with a method for discarding bad maneuvers by using a reward based online evaluation of the teachers at training time. Furthermore, this approach enables a greater control over training in diverse environments that require different control dynamics. In our experiments, it is shown that this approach leads to greater robustness and improved performance, as compared to various state-of-the-art IL and RL methods. Moreover, OIL allows for control networks to be trained in a fully self-supervised fashion requiring no human annotation but rather can be trained using automated agents.
We apply OIL to both autonomous driving and UAV high speed racing in order to demonstrate the diverse scenarios at which it can be applied to solve sequential prediction problems. We follow recent work  that tests AI systems through the use of computer games. We use Sim4CV based on the Unreal Engine 4 (UE4), which has both a residential driving environment with a physics based car and gated racing tracks for UAV racing 
. The simulator is multi-purpose, as it enables the generation of synthetic image data, reinforcement based training in real-time, and evaluation on unseen tracks. We demonstrate that OIL makes possible a self-supervised and modular neural network capable of learning how to control both vehicular driving and the more complex task of UAV racing in the simulated Sim4CV environment. Through extensive experiments, we demonstrate that OIL outperforms its teachers, conventional IL and RL approaches and even humans in simulation.
Contributions. (1) We propose Observational Imitation Learning (OIL) as a new approach for training a stationary deterministic policy that overcomes shortcomings of conventional imitation learning by incorporating reinforcement learning ideas. It learns from an ensemble of imperfect teachers, but only updates the policy with the best maneuvers of each teacher, eventually outperforming all of them. (2) We introduce a flexible network architecture that adapts well to different control scenarios and complex navigation tasks (e.g. autonomous driving and UAV racing) using OIL in a self-supervised manner without any human training demonstrations. (3) To the best of our knowledge, this paper is the first to apply imitation learning to multiple teachers while being robust to teachers that exhibit bad behavior.
2 Related Work
The task of training an actor (e.g. ground vehicle, human, or UAV) to physically navigate through an unknown environment has traditionally been approached either through supervised learning (SL) and in particular Imitation Learning (IL), Reinforcement Learning (RL), or a combination of the two. A key challenge is learning a high dimensional representation of raw sensory input within a complex 3D environment. Similar to our approach in this paper, many recent works such as [7, 22, 23, 31, 11] use modern game engines or driving simulators, where the 3D environment can be controlled, synthetic datasets for self-driving can be generated, and other navigation based tasks can be evaluated. Subsequently, we give a brief overview of relevant work within the domain of IL, RL and its intersection.
Imitation Learning (IL). For the particular case of physics based navigation, IL can be advantageous when high dimensional feedback can be recorded. It has been applied to both autonomous driving [2, 4, 32, 37] and UAV navigation [32, 9, 19]. In particular, DAGGER  has been widely used for many robotic control tasks. However, there are some limitations that prevent DAGGER to scale up to some problems in practice. Data augmentation is only corrective. It cannot predict the controls a human driver may choose and requires significant fine-tuning, which may not be applicable at different speeds or environments. Obtaining optimal or near-optimal expert data can be costly or even infeasible. Moreover, the flaws and mistakes of the teachers are learned along with differing responses. See Section 5.2 for the comparison of two popular IL methods (Behavioral Cloning  and Dagger) to OIL. Follow-up work such as AGGREVATE  and Deeply AggreVaTeD  are trying to mitigate this problem by introducing exploratory actions and measuring actual induced cost instead of optimizing for expert imitation only. They also claim exponentially higher sample efficiency than many classical RL methods. A number of improvements in other respects have been published, such as SafeDAgger  that aims to make DAGGER more (policy) query-efficient, and LOLS  that aims to improve upon cases where the reference policy is suboptimal.
Reinforcement learning (RL). RL provides an alternative to IL by using rewards and many iterations of exploration to help discover the proper response through interactive trial and error. Recent work on autonomous car driving has employed RL [15, 14, 10, 28, 27, 8, 6]. One of the main disadvantages is that RL networks may not be able to discover the optimal outputs in higher-order control tasks. For example, Dosovitskiy et al. 
find RL to under-perform in vehicular navigation due to the extensive hyperparameter space. RL methods can be divided into three classes: value-based, policy-based, and actor-critic based methods. In particular, actor-critic based methods, e.g. A3C  and DDPG , are notably the most popular algorithms in the RL community. However, achieving strong results with RL is difficult, since it is very sensitive to the reward function, it can be sample inefficient, and it requires extensive training time due to the large policy space (see Section 5.2 for comparison of DDPG to OIL). Methods such as TRPO  has been developed to provide monotonic policy improvements in most cases, but still require extensive training time.
Combined approaches. Several methods exist that combine the advantages of IL and RL. Most of them focus on tackling the problem of low RL sample efficiency by pre-initializing with suitable expert demonstrations (e.g. CIRL  and Guided Policy Search [12, 5]). Others focus on risk awareness as real-world deployment failures can be costly 
. We draw inspiration for OIL from these hybrid approaches. In contrast to pure IL, OIL can prevent itself from learning bad demonstrations from imperfect teachers by observing teachers’ behaviours and estimating the advantage or disadvantage to imitate them. Unlike pure RL, it converges to a high performance policy without too much exploration since it is guided by the best teacher behaviors. While sharing the advantage of higher sample efficiency with other hybrid approaches, our method has the specific advantage of inherently dealing well with bad demonstrations, which is a common occurrence in real-world applications.
After giving a brief review of related learning strategies for sequential decision making (i.e
. Markov Decision Process, Imitation Learning, and Reinforcement Learning), we introduce our proposed Observational Imitation Learning (OIL), which enables automatic selection of the best teacher (from multiple teachers) at each time step of online learning.
3.1 Markov Decision Process
OIL is a method that enables a learner to learn from multiple sub-optimal or imperfect teachers and eventually to outperform all of them. To achieve this goal, training needs to be done by repeatedly interacting with an environment . We consider the problem as a Markov Decision Process (MDP) consisting of an agent and environment. At every time step , the agent observes a state or a partial observation of state . In our setting, we assume the environment is partially-observed but we use to represent the state or the partial observation for simplicity. Given , the agent performs an action within the available action space based on its current policy , where is the parameter of the policy. Then, the environment provides the agent a scalar reward according to a reward function and transfers to a new state within state space under the environment transition distribution . After receiving an initial state , the agent generates a trajectory after time steps. The trajectory can end after a certain number of time steps or after the agent reaches a goal or terminal state. The objective of solving a MDP problem is to find an optimal policy from policy space that maximizes the expected sum of discounted future rewards, at time . It is popularly known as the value function:
where is a discounted factor that trades-off the importance of immediate and future rewards, is the time step when the trajectory terminates. The optimal policy maximizes the value function for all :
3.2 Imitation Learning
Imitation learning (IL) is a supervised learning approach to solve sequential decision making problems by mimicking an expert policy
. Instead of directly optimizing the value function, IL minimizes a surrogate loss function. Let denote the average distribution of visited observations when an arbitrary policy is executed for time steps. Behavioral Cloning, one of the simplest IL algorithms, trains a learner policy network to fit the input (observations) and output (actions) of the expert policy by minimizing the surrogate loss .
However, this leads to poor performance because the encountered observation spaces of the learner and the expert are different， thus, violating the independent, identically distributed (i.i.d) assumption of statistical learning approaches and causes compounding errors . DAGGER (Dataset Aggregation)  alleviates these errors in an iterative fashion by collecting the state-action pairs visited by the learned policy, but labeled by the expert. Its goal is to find a policy that minimizes the surrogate loss under the observation distribution induced by the current policy .
A major drawback of DAGGER is that it highly depends on the expert’s performance, where a near-optimal expert is hard to acquire in most tasks.
3.3 Reinforcement Learning
Reinforcement Learning (RL) is another category of methods to solve the MDP problem by trial and error. RL methods can be divided into three classes: value-based, policy-based, and actor-critic-based methods . Specifically, actor-critic-based methods, e.g. A3C , and DDPG , are currently the most popular algorithms in the RL community. The most related actor-critic method to our work is the Advantage Actor-critic approach , which uses an advantage function for policy update instead of the typical value function or Q-function . Intuitively, this advantage function evaluates how much improvement is obtained if action is taken at state as compared to the expected value. It is formally defined as follows :
3.4 Observational Imitation Learning (OIL)
As discussed in Section 3.2 imitation learning requires a near-optimal teacher and extensive augmentation for exploration. Getting labeled expert data is expensive and not scalable. While RL approaches (Section 3.3) do not require supervision and can freely explore an environment, they are time-consuming and may never learn a good policy unless the reward is very well designed. In an effort to combine the strengths of both approaches we propose Observational Imitation Learning (OIL). Inspired by advantage actor-critic learning , we learn to imitate only the best behaviours of several sub-optimal teachers. We do so by estimating the value function of each teacher and only keeping the best to imitate. As a result, we can learn a policy that outperforms each of its teachers. Learning from multiple teachers also allows for exploration, but only of feasible states, leading to faster learning than in RL.
Since we do not require expert teachers we can obtain labeled data much more cheaply. We assume easy access to cheap sub-optimal teacher policies (e.g. simple PID controllers). We denote the teacher policy set as , where is the teacher policy corresponding to teacher k. Let denote the learner policy. Denote the advantage function of current learner policy compared to teacher policy at state as:
The advantage function determines the advantage of taking the learner policy at state compared to the teacher policy where and are analogous to an actor and critic policy respectively. Note that the advantage function in regular RL settings is used to cope with the learner policy. In contrast, our advantage function Equation 6 considers both the learner policy and the teacher policy. In multi-teacher scenarios, we select the most critical teacher as the critic policy (refer to Algorithm 1).
We define our training in terms of observation rounds, each of which is divided into three phases, observing, rehearsing and acting.
Observe. We estimate the value functions of the learner policy as well as of all teacher policies using Monte-Carlo sampling (rolling-out the trajectory of each policy to get the returns ). We then select the most critical teacher policy as the critic. Then we compute the advantage function between the learner policy and the most critical teacher policy. If the advantage (i.e. there exists a sub-optimal teacher policy with a higher advantage than the current learner policy), we enter the rehearsing phase. Otherwise, we go to the acting phase directly.
Rehearse. After computing , we can optimize the policy by actor-critic methods or by optimizing the surrogate loss. In order to benefit from the fast convergence of IL instead of optimizing the advantage function directly, we optimize the surrogate loss 7 iteratively as follows:
Where is the learner policy at the th iteration. In our implementation, we use a DNN to represent our learner policy as , where is the parameter of the neural network. In order to minimize the surrogate loss, we roll out the learner (actor) policy and using the selected teacher (critic) to correct the learner’s actions. In other words, we minimize the surrogate loss with states encountered by the learner policy and actions labeled by the most critical teacher policy. We minimize the surrogate loss by performing gradient descent with respect to on collected data and update learner policy until or episodes.
Act. After rehearsing, the learner policy will perform well at the current trajectory, then we roll out the current policy to new state with acting steps .
4 Network and Training Details
In this section we present a modular network architecture for autonomous driving and UAV racing that solves the navigation task as a high dimensional perception module (trained with self-supervision) and a low dimensional control module (trained with OIL).
4.1 Modular Architecture
The fundamental modules of our proposed system are summarized in Figure 2. We use modular architecture to reduce the required training time by introducing an intermediate representation layer. The overall neural network consists of two modules: a Perception Network and a Control Network . The input state includes image and the physical measurements of the vehicle’s state (i.e. current orientation and velocity). The action is a control signal for cars (G: Gas/Brake, S: Steering) or UAV (T: Throttle, A: Aileron/Roll, E: Elevator/Pitch, R: Rudder/Yaw). The Perception Network is parameterized by and the control network is parameterized by . The control network takes the intermediate representation predictions of the perception network and vehicle’s state as input and outputs the final control predictions. The whole policy can be described as follows:
The overall loss is defined in Equation 9 as a weighted sum of perception and control loss. Note that the perception loss comes from self-supervision by minimizing the difference between the ground truth intermediate representation and predicted intermediate representation , while the control loss in Equation 10 comes from applying OIL to learn from multiple imperfect teachers (automated PID controllers in our case) by minimizing the surrogate loss in Equation 7, where and . In experiments, we choose waypoints as the intermediate representation. Note that it can also be segmentations , depth images, affordances   or the combination of them.
In general, this optimization problem can be solved by minimizing the overall loss with respect to and at the same time. The gradients are as follows:
A good perception network is essential to achieving good controls. To maintain modularity and reduce training time, we first optimize only for while ignoring the control loss. After the perception network converges, we fix and optimize for .
This modular approach has several advantages over an end-to-end approach (see also [17, 9]). Since only the control module is specific to the vehicle’s dynamics, the perception module can simply be swapped out, allowing the vehicle to navigate in completely different environments without any modification to the control module. Similarly, if the reward function is changed in different tasks, we can simply retrain the control module to learn different policy. In this case, it is possible to add links between the perception and control networks and then finetune the joint network in an end-to-end fashion. One could also connect the two networks and use the waypoint labels as intermediate supervision for the perception part while training the joint model end-to-end. While these variants are interesting, we specifically refrain from such connections to safeguard the attractive modularity properties.
In what follows, we provide details of the architecture, implementation, and training procedure for both the perception and control modules. Note that OIL and the proposed architecture can also be applied to other types of MDPs problems (e.g. other vision-based sequential decision making problems).
In our case the perception module takes raw RGB images as an input and predicts a trajectory of waypoints relative to the vehicle’s current position, which remains unknown in 3D. The waypoints predicted by the perception module are input into the control network along with the current vehicle state (velocity and orientation).
Waypoint Encoding. The mapping from image to waypoints is deterministic and unique. For every camera view, the corresponding waypoints can easily be determined and are independent of the vehicle state. We define waypoints along the track as a vertical offset that is measured as the distance between the vehicle position and the projected point along the viewing axis, and a horizontal offset that is defined as the distance between the original and projected point along the viewing axis normal. We then encode these waypoints relative to the vehicle position and orientation by projecting them onto the viewing axis. Predicting waypoints rather than controls does not only facilitate network training, but it also allows for the automatic collection of training data without human intervention (self-supervision). Within the simulator, we simply sample/render the entire training track from multiple views and calculate the corresponding waypoints along the track. Note that it is still possible to use recordings from teachers, as one can use future positions to determine the waypoint for the current frame similar to . Please refer to the supplementary material for further details and an illustration of the waypoint encoding method.
Network Architecture. We choose a regression network architecture similar in spirit to the one used by Bojarski et al. . Our DNN architecture is shown in Figure 2 as the perception module. It consists of eight layers: five convolutional with filters and three fully-connected with hidden units. The DNN consumes a single RGB-image with 180320 pixel resolution and is trained to regress the next five waypoints (x-offset and y-offset with respect to the local position of the vehicle) using a standard -loss and dropout ratio of 0.5 in the fully-connected layers. As compared to related methods [2, 32], we find that the relatively high input resolution (equivalently high network capacity) is useful to improve the network’s ability to look further ahead. This affords the network more robustness for long-term trajectory stability.
4.3 OIL for Control
Here, we present the details of training the control network using OIL, including network architecture and learning strategy.
Teachers and Learner. In our experiments, we use multiple naive PID controllers as the teachers for control policy. The PID paremeters are tuned in only a couple of minutes and validated on a training track to perform well. As the system is very robust to learn from imperfect teachers we don’t need to spend much effort tuning these parameters or achieve optimal performance of the teachers on all training tracks. Although an unlimited number of PID based teachers can be created, we find empirically five to be sufficient for the two control scenarios (Autonomous driving and UAV racing) (For further details see evaluations which demonstrate how the different number of teachers affect learning).
We use a three-layer fully connected network to approximate the control policy
of the learner. The MDP state of the learner is a vector concatenation of the predicted intermediate representation (e.g. waypoints) and the vehicle state (physical measurements for vehicle’s speed and orientation) .
Network Architecture. The goal of the control network is to find a control policy that minimizes the control loss :
It consists of three fully-connected layers with hidden units , and a dropout in the second layer with dropout ratio of 0.5. The loss function is a standard -loss optimized by the Adam Optimizer with a learning rate of . The control network is updated by OIL in an online fashion, while the vehicle runs through a set of training tracks.
As such, the control network is learning from experiences (akin to reinforcement learning), but it is supervised throughout by multiple teachers to minimize the surrogate loss. An advantage to our approach is that multiple teachers are able to teach throughout the control network’s exploration. The control network never becomes dependent on the teachers, but gradually becomes independent and eventually learns to outperform them.
5.1 Experimental Setup.
The Sim4CV  (see Figure 1) environment provides capabilities to generate labeled datasets for offline training (e.g. imitation learning), and interact with the simulation environment for online learning (e.g. OIL or reinforcement learning) as well as online evaluation. For each application, we design six environments for training and four for testing. For fair comparison, human drivers/pilots are given as much time as needed to practice on the training tracks, before attempting the test tracks.
For the autonomous driving scenario, the PID teachers, learned baselines and human drivers have to complete one lap. They are scored based on the average error to the center line and the time needed to complete the course. In addition, they need to pass through invisible checkpoints placed at every 50 meters to make sure they stay on the course and do not take any shortcuts. The vehicle is reset at the next checkpoint if it did not reach it within 15 seconds.
For the UAV racing task, all pilots have to complete two laps on the test tracks and are scored based on the percentage of gates they maneuver through and the overall time. Similar to the autonomous driving scenario we reset the UAV at the next checkpoint if it was not reached within 10 seconds. Here, the checkpoints correspond to the racing gates.
Training Details. For fair comparison we train each network until convergence or for at most 800k steps. We estimate the value functions for all the teachers and the learner to select the most critical teacher with steps rollouts ( for car, for UAV) during observing phase. We execute the learner policy trajectory with also steps during rehearsing phase until advantage function or episodes. We choose for three and five teacher experiments, for one teacher experiments. We choose as the acting step size.
For OIL we score both the learner and the teachers using a reward function that trades off between the trajectory error and speed. For the UAV we simply use its forward speed and penalize it for going outside the track. For the car we compute the progression along the center of the road and the average error. More details of the reward function are provided in the supplementary material.
|Teachers: PID controllers|
|Learned Policy: Best Teacher (1)|
|Learned Policy: All Teachers|
|Teachers: PID controllers|
|Learned Policy: Best Teacher (3)|
|Learned Policy: All Teachers|
|3 Teachers (1, 3, 4)||25.79||73.60|
|Trajectory Length (60 steps)||15.30||80.48|
|Trajectory Length (180 steps)||16.04||80.72|
|Trajectory Length (600 steps)||13.79||82.17|
Comparison to State-of-the-Art Baselines. We compare OIL for both autonomous driving and UAV racing to several DNN baselines. These include both IL approaches: Behaviour cloning, Dagger; and RL: DDPG. For each comparison we implement both learning from the single best teacher and an ensemble of teachers. This essentially allows a broad baseline comparison of 6 different state-of-the-art learning approaches evaluating various IL and RL approaches.
In Tables 1 and 2, the learned approaches are compared to OIL. The evaluations demonstrate that OIL outperforms all learned baselines in both score and timing. Both Behaviour Cloning and Dagger become worse in score or timing with more teachers learning both good and bad behaviours. Moreover, they do not achieve scores better than their teachers in single teacher training. In contrast, OIL improves upon scores both with a single teacher and multiple. In comparison to DDPG, OIL converges quickly without extensive hyperparameter search and still learns to fly/drive much more precisely and faster.
Comparison to Teachers and Human Performance. We compare our OIL trained control network to the teachers it learned from and human control. The perception network is kept the same for all learned models. The summary of this comparison to OIL is given in Tables 1 and 2. The evaluations demonstrate that OIL outperforms all teachers and novice to intermediate human pilots/drivers. Compared to the expert driver, OIL is 3.83 seconds slower but only has 35.91% error in term of the distance to the center.
Teacher1 has the least error of all teachers and is the only one to perfectly complete all gates in the UAV racing evaluation. However, OIL not only completes all gates at higher speeds but becomes even more precise in centering along the middle of the road in the autonomous driving evaluation. In comparison to humans, OIL is better than novice and intermediate levels but is still slower than expert. A note-able difference between the expert driver and OIL is that OIL has a much lower error in driving. It is able to maintain high speeds while centering most accurately in the center of the tracks.
Ablation study. We investigate the importance of the trajectory length and the number of teachers and report the results in Table 3.
6 Conclusions and Future Work
In this paper, we present Observational Imitation Learning (OIL), a new approach for training a stationary deterministic policy that is not bound by imperfect or inexperienced teachers but rather updates its policy by selecting only the best maneuvers leading to improved performance. The flexible network architecture affords modularity in perception and control, and can be applied to many different types of complex vision-based sequential prediction problems. We demonstrate the ability of the OIL framework to train without supervision a DNN to autonomously drive a car and to fly an unmanned aerial vehicle (UAV) through challenging tracks. The extensive experiments demonstrate that OIL outperforms single and multiple teacher learned IL methods (Behavior Cloning, DAGGER) and RL approaches (DDPG). OIL performance is better than its teachers and experienced humans pilots/drivers.
OIL provides a new learning-based approach that can replace traditional control methods especially in robotics and control systems. We expect our framework can be adapted for other robotic tasks such as visual grasping tasks or visual placing tasks where teachers may be imperfect or unavailable except through automation.
One very interesting avenue for future work is to apply OIL outside simulation. The transferability of the perception network can be tested with a conservative PID controller handling controls. Since OIL online training functions at 60fps, the control module can be trained using an IMU with GPS to calculate waypoints and state, then feed them to a real-time control network in the field. Although Sim4CV uses a high quality gaming engine for rendering, the differences in appearance between the simulated and real-world will need to be reconciled. Moreover, real-world physics, weather and road conditions, and sensor noise will present new challenges to training the control network.
Since OIL as an improved DNN learning approach is generic in nature, we expect it will open up unique opportunities for the community to develop better self-navigation and control systems expanding its reach to other fields of autonomous navigation, and to benefit other interesting AI complex sequential prediction problems (e.g. obstacle avoidance).
O. Andersson, M. Wzorek, and P. Doherty.
Deep learning quadcopter control via risk-aware active learning.
Thirty-First AAAI Conference on Artificial Intelligence (AAAI), 2017, San Francisco, February 4-9. :, 2017. Accepted.
-  M. Bojarski, D. D. Testa, D. Dworakowski, B. Firner, B. Flepp, P. Goyal, L. D. Jackel, M. Monfort, U. Muller, J. Zhang, X. Zhang, J. Zhao, and K. Zieba. End to end learning for self-driving cars. CoRR, abs/1604.07316, 2016.
-  K.-W. Chang, A. Krishnamurthy, A. Agarwal, H. Daume III, and J. Langford. Learning to search better than your teacher. 2015.
C. Chen, A. Seff, A. Kornhauser, and J. Xiao.
Deepdriving: Learning affordance for direct perception in autonomous
Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), ICCV ’15, pages 2722–2730, Washington, DC, USA, 2015. IEEE Computer Society.
-  A. Dosovitskiy and V. Koltun. Learning to act by predicting the future. volume abs/1611.01779, 2017.
-  A. Dosovitskiy, G. Ros, F. Codevilla, A. Lopez, and V. Koltun. CARLA: An open urban driving simulator. In S. Levine, V. Vanhoucke, and K. Goldberg, editors, Proceedings of the 1st Annual Conference on Robot Learning, volume 78 of Proceedings of Machine Learning Research, pages 1–16. PMLR, 13–15 Nov 2017.
A. Gaidon, Q. Wang, Y. Cabon, and E. Vig.
Virtual worlds as proxy for multi-object tracking analysis.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4340–4349, 2016.
-  D. Isele, A. Cosgun, K. Subramanian, and K. Fujimura. Navigating intersections with autonomous vehicles using deep reinforcement learning. CoRR, abs/1705.01196, 2017.
-  E. Kaufmann, A. Loquercio, R. Ranftl, A. Dosovitskiy, V. Koltun, and D. Scaramuzza. Deep drone racing: Learning agile flight in dynamic environments. In A. Billard, A. Dragan, J. Peters, and J. Morimoto, editors, Proceedings of The 2nd Conference on Robot Learning, volume 87 of Proceedings of Machine Learning Research, pages 133–145. PMLR, 29–31 Oct 2018.
-  J. Koutník, J. Schmidhuber, and F. Gomez. Online Evolution of Deep Convolutional Network for Vision-Based Reinforcement Learning, pages 260–269. Springer International Publishing, Cham, 2014.
-  A. Lerer, S. Gross, and R. Fergus. Learning Physical Intuition of Block Towers by Example, 2016. arXiv:1603.01312v1.
-  S. Levine and V. Koltun. Guided policy search. In S. Dasgupta and D. McAllester, editors, Proceedings of the 30th International Conference on Machine Learning, volume 28 of Proceedings of Machine Learning Research, pages 1–9, Atlanta, Georgia, USA, 17–19 Jun 2013. PMLR.
-  X. Liang, T. Wang, L. Yang, and E. Xing. Cirl: Controllable imitative reinforcement learning for vision-based self-driving. arXiv preprint arXiv:1807.03776, 1, 2018.
-  T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra. Continuous control with deep reinforcement learning. ICLR, abs/1509.02971, 2016.
-  V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In International Conference on Machine Learning, pages 1928–1937, 2016.
-  V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529, 2015.
-  M. Mueller, A. Dosovitskiy, B. Ghanem, and V. Koltun. Driving policy transfer via modularity and abstraction. In A. Billard, A. Dragan, J. Peters, and J. Morimoto, editors, Proceedings of The 2nd Conference on Robot Learning, volume 87 of Proceedings of Machine Learning Research, pages 1–15. PMLR, 29–31 Oct 2018.
-  M. Müller, V. Casser, J. Lahoud, N. Smith, and B. Ghanem. Sim4cv: A photo-realistic simulator for computer vision applications. Int. J. Comput. Vision, 126(9):902–919, Sept. 2018.
-  M. Müller, V. Casser, N. Smith, D. L. Michels, and B. Ghanem. Teaching UAVs to Race: End-to-End Regression of Agile Controls in Simulation. In European Conference on Computer Vision Workshop (ECCVW), Sept. 2018.
-  A. Y. Ng, D. Harada, and S. Russell. Policy invariance under reward transformations: Theory and application to reward shaping. In ICML, volume 99, pages 278–287, 1999.
-  J. Peters, S. Vijayakumar, and S. Schaal. Natural actor-critic. In European Conference on Machine Learning, pages 280–291. Springer, 2005.
-  S. R. Richter, V. Vineet, S. Roth, and V. Koltun. Playing for data: Ground truth from computer games. In ECCV, 2016.
-  G. Ros, L. Sellart, J. Materzynska, D. Vazquez, and A. Lopez. The SYNTHIA Dataset: A large collection of synthetic images for semantic segmentation of urban scenes. 2016.
-  S. Ross and D. Bagnell. Efficient reductions for imitation learning. In Proceedings of the thirteenth international conference on artificial intelligence and statistics, pages 661–668, 2010.
-  S. Ross and J. A. Bagnell. Reinforcement and imitation learning via interactive no-regret learning. arXiv preprint arXiv:1406.5979, 2014.
-  S. Ross, G. J. Gordon, and J. A. Bagnell. No-regret reductions for imitation learning and structured prediction. CoRR, abs/1011.0686, 2010.
-  A. E. Sallab, M. Abdou, E. Perot, and S. Yogamani. End-to-end deep reinforcement learning for lane keeping assist. CoRR, abs/1612.04340, 2016.
-  A. E. Sallab, M. Abdou, E. Perot, and S. Yogamani. Deep reinforcement learning framework for autonomous driving. Electronic Imaging, 2017(19):70–76, 2017.
-  A. Sauer, N. Savinov, and A. Geiger. Conditional affordance learning for driving in urban environments. arXiv preprint arXiv:1806.06498, 2018.
-  J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz. Trust region policy optimization. In International Conference on Machine Learning, pages 1889–1897, 2015.
-  S. Shah, D. Dey, C. Lovett, and A. Kapoor. Airsim: High-fidelity visual and physical simulation for autonomous vehicles, 2017.
-  N. Smolyanskiy, A. Kamenev, J. Smith, and S. Birchfield. Toward Low-Flying Autonomous MAV Trail Navigation using Deep Neural Networks for Environmental Awareness. ArXiv e-prints, May 2017.
-  W. Sun, A. Venkatraman, G. J. Gordon, B. Boots, and J. A. Bagnell. Deeply aggrevated: Differentiable imitation learning for sequential prediction. arXiv preprint arXiv:1703.01030, 2017.
-  R. S. Sutton and A. G. Barto. Introduction to reinforcement learning, volume 135. MIT press Cambridge, 1998.
-  F. Torabi, G. Warnell, and P. Stone. Behavioral cloning from observation. arXiv preprint arXiv:1805.01954, 2018.
-  H. Xu, Y. Gao, F. Yu, and T. Darrell. End-to-end learning of driving models from large-scale video datasets. CoRR, abs/1612.01079, 2016.
-  H. Xu, Y. Gao, F. Yu, and T. Darrell. End-to-end learning of driving models from large-scale video datasets. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017.
-  J. Zhang and K. Cho. Query-efficient imitation learning for end-to-end autonomous driving. arXiv preprint arXiv:1605.06450, 2016.