Autonomous driving consists of many complex sub-tasks that consider the dynamics of an environment and often lack accurate definitions of various driving behaviors. These characteristics lead to conventional control methods to suffer subpar performance on the task [2, 3]. However, driving and many other tasks can be easily demonstrated by human experts. This observation inspires imitation learning, which leverages expert demonstrations to synthesize a controller.
While there are many advantages of using imitation learning, it also has drawbacks. For autonomous driving, the most critical one is covariate shift, meaning the training and test distributions are different. This could lead autonomous vehicles (AVs) to accidents since a learned policy may fail to respond to unseen scenarios including those dangerous situations that do not occur often.
In order to mitigate this issue, the training dataset needs to be augmented with more expert demonstrations covering a wide spectrum of driving scenarios—especially ones of significant safety threats to the passengers—so that a policy can learn how to recover from its own mistakes. This is emphasized by Pomerleau 
, who synthesized a neural network based controller for AVs: “the network must not solely be shown examples of accurate driving, but also how to recover (i.e. return to the road center) once a mistake has been made.”
Although critical, obtaining recovery data from accidents in the physical world is impractical due to the high cost of a vehicle and potential injuries to both passengers and pedestrians. In addition, even one managed to collect accident data, human experts are usually needed to label them, which is inefficient and may subject to judgmental errors .
These difficulties naturally lead us to the virtual world, where accidents can be simulated and analyzed . We have developed ADAPS (Autonomous Driving Via Principled Simulations) to achieve this goal. ADAPS consists of two simulation platforms and a memory-enabled hierarchical control policy based on deep neural networks (DNNs). The first simulation platform, referred to as SimLearner, runs in a 3D environment and is used to test a learned policy, simulate accidents, and collect training data. The second simulation platform, referred to as SimExpert, acts in a 2D environment and serves as the “expert” to analyze and resolve an accident via principled simulations that can plan alternative safe trajectories for a vehicle by taking its physical, kinematic, and geometric constraints into account.
Furthermore, ADAPS represents a more efficient online learning mechanism than existing methods such as DAGGER . This is useful consider learning to drive requires iterative testing and update of a control policy. Ideally, we want to obtain a robust policy using minimal iterations since one iteration corresponds to one incident. This would require the generation of training data at each iteration to be accurate, efficient, and sufficient so that a policy can gain a large improvement going into the next iteration. ADAPS can assist to achieve this goal.
The main contributions of this research are specifically: (1) The accidents generated in SimLearner will be analyzed by SimExpert to produce alternative safe trajectories. (2) These trajectories will be automatically processed to generate a large number of annotated and segmented training data. Because SimExpert is parameterized and has taken the physical, kinematic, and geometric constraints of a vehicle into account (i.e., principled), the resulting training examples are more heterogeneous than data collected via running a learned policy multiple times and are more effective than data collected through random sampling. (3) We present both theoretical and experimental results to demonstrate that ADAPS is an efficient online learning mechanism.
The Appendix, which contains supporting material, can be found at http://gamma.cs.unc.edu/ADAPS/.
Ii Related Work
We sample previous studies that are related to each aspect of our framework and discuss the differences within.
Autonomous Driving. Among various methods to plan and control an AV , we focus on end-to-end imitation learning as it can avoid manually designed features and lead to a more compact policy compared to conventional mediation perception approaches . The early studies done by Pomerleau  and LeCun et al.  have shown that neural networks can be used for an AV to achieve lane-following and off-road obstacle avoidance. Due to the advancements of deep neural networks (DNNs), a number of studies have emerged [10, 11, 12, 13]. While significant improvements have been made, these results mainly inherit normal driving conditions and restrict a vehicle to the lane-following behavior . Our policy, in contrast, learns from accidents and enables a vehicle to achieve on-road collision avoidance with both static and dynamic obstacles.
Hierarchical Control Policy. There have been many efforts in constructing a hierarchical policy to control an agent at different stages of a task . Example studies include the options framework  and transferable motor skills . When combined with DNNs, the hierarchical approach has been adopted for virtual characters to learn locomotion tasks . In these studies, the goal is to discover a hierarchical relationship from complex sensorimotor behaviors. We apply a hierarchical and memory-enabled policy to autonomous driving based on multiple DNNs. Our policy enables an AV to continuously categorize the road condition as safe or dangerous, and execute corresponding control commands to achieve accident-free driving.
Generative Policy Learning. Using principled simulations to assist learning is essentially taking a generative model approach. Several studies have adopted the same philosophy to learn (near-)optimal policy, examples including function approximations , Sparse Sampling , and Fitted Value Iteration . These studies leverage a generative model to stochastically generate training samples. The emphasize is to simulate the feedback from an environment instead of the dynamics of an agent assuming the reward function is known. Our system, on the other hand, does not assume any reward function of a driving behavior but models the physical, kinematic, and geometric constraints of a vehicle, and uses simulations to plan their trajectories w.r.t. environment characteristics. In essence, our method learns from expert demonstrations rather than self-exploration  as of the previous studies.
Autonomous driving is a sequential prediction and controlled (SPC) task, for which a system must predict a sequence of control commands based on inputs that depend on past predicted control commands. Because the control and prediction processes are intertwined, SPC tasks often encounter covariate shift, meaning the training and test distributions vary. In this section, we will first introduce notation and definitions to formulate an SPC task and then briefly discuss its existing solutions.
Iii-a Notation and Definitions
The problem we consider is a -step control task. Given the observation of a state at each step , the goal of a learner is to find a policy such that its produced action will lead to the minimal cost:
where is the expected immediate cost of performing in . For many tasks such as driving, we may not know the true value of . So, we instead minimize the observed surrogate loss , which is assumed to upper bound , based on the approximation of the learner’s action to the expert’s action . We denote the distribution of observations at as , which is the result of executing from to . Consequently, is the average distribution of observations by executing for steps. Our goal is to solve an SPC task by obtaining that minimizes the observed surrogate loss under its own induced observations w.r.t. expert’s actions in those observations:
We further denote as the expected loss under the training distribution induced by the expert’s policy , and the cost-to-go over steps of as and of as . It has been shown that by simply treating expert demonstrations as i.i.d. samples the discrepancy between and is [22, 1]
. Given the error of a typical supervised learning is, this demonstrates the additional cost due to covariate shift when solving an SPC task via standard supervised learning111The proofs regarding results and can be found in Appendix IX-A..
Iii-B Existing Techniques
. Essentially, these methods reduce an SPC task to online learning. By further leveraging interactions with experts and no-regret algorithms that have strong guarantees on convex loss functions, at each iteration, these methods train one or multiple policies using standard supervised learning and improve the trained policies as the iteration continues.
To illustrate, we denote the best policy at the th iteration (trained using all observations from the previous iterations) as and for any policy we have its expected loss under the observation distribution induced by as 222In online learning, the surrogate loss can be seen as chosen by some adversary which varies at each iteration.. In addition, we denote the minimal loss in hindsight after iterations as (i.e., the training loss after using all observations from iterations). Then, we can represent the average regret of this online learning program as . Using DAGGER  as an example method, the accumulated error difference becomes the summation of three terms:
where is the function of fixed and . As , the third term tends to so as the second term if a no-regret algorithm such as the Follow-the-Leader  is used.
The aforementioned approach provides a practical way to solve SPC tasks. However, it may require many iterations for obtaining a good policy. In addition, usually human experts or pre-defined controllers are needed for labeling the generated training data, which could be inefficient or difficult to generalize. For autonomous driving, we want the iteration number to be minimal since it directly corresponds to the number of accidents. This requires the generation of training data being accurate, efficient, and sufficient.
In the following, we present theoretical analysis of our framework and introduce our framework pipeline.
Iv-a Theoretical Analysis
We have evaluated our approach against existing learning mechanisms such as DAGGER , with our method’s results proving to be more effective. Specifically, DAGGER  assumes that an underlying learning algorithm has access to a reset model. So, the training examples can be obtained only online by putting an agent to its initial state distribution and executing a learned policy, thus achieving “small changes” at each iteration [1, 23, 26, 27]. In comparison, our method allows a learning algorithm to access a generative model so that the training examples can be acquired offline by putting an agent to arbitrary states during the analysis of an accident and letting a generative model simulate its behavior. This approach results in massive training data, thus achieving “large changes” of a policy at one iteration.
Additionally, existing techniques such as DAGGER  usually incorporate the demonstrations of a few experts into training. Because of the reset model assumption and the lack of a diversity requirement on experts, these demonstrations can be homogeneous. In contrast, using our parameterized model to retrace and analyze each accident, the number of recovery actions obtained can be multiple orders of magnitude higher. Subsequently, we can treat the generated trajectories and the additional data generated based on them (described in Section VI-B) as running a learned policy to sample independent expert trajectories at different states, since 1) a policy that is learned using DNNs can achieve a small training error and 2) our model provides near-exhaustive coverage of the configuration space of a vehicle. With these assumptions, we derive the following theorem.
If the surrogate loss upper bounds the true cost , by collecting trajectories using ADAPS at each iteration, with probability at least
trajectories using ADAPS at each iteration, with probability at least, , we have the following guarantee:
See Appendix IX-A3.
Theorem 1 provides a bound for the expected cost-to-go of the best learned policy based on the empirical error of the best policy in (i.e., ) and the empirical average regret of the learner (i.e., ). The second term can be eliminated if a no-regret algorithm such as Follow-the-Leader  is used and the third term suggests that we need the number of training examples to be in order to have a negligible generalization error, which is easily achievable using ADAPS. Summarizing these changes, we derive the following Corollary.
If is convex in for any and it upper bounds , and Follow-the-Leader is used to select the learned policy, then for any , after collecting training examples, with probability at least , , we have the following guarantee:
Following Theorem 1 and the aforementioned deduction.
Now we only need the best policy to have a small training error . This can be achieved using DNNs since they have rich representing capabilities.
Iv-B Framework Pipeline
The pipeline of our framework is the following. First, in SimLearner, we test a learned policy by letting it control an AV. During the testing, an accident may occur, in which case the trajectory of the vehicle and the full specifications of the situation (e.g., positions of obstacles, road configuration, etc.) are known. Next, we switch to SimExpert and replicate the specifications of the accident so that we can “solve” the accident (i.e., find alternative safe trajectories and dangerous zones). After obtaining the solutions, we then use them to generate additional training data in SimLearner, which will be combined with previously generated data to update the policy. Finally, we test the updated policy again.
V Policy Learning
In this section, we will detail our control policy by first explaining our design rationale then formulating our problem and introducing the training data collection.
Driving is a hierarchical decision process. In its simplest form, a driver needs to constantly monitor the road condition, decide it is “safe” or “dangerous”, and make corresponding maneuvers. When designing a control policy for AVs, we need to consider this hierarchical aspect. In addition, driving is a temporal behavior. Drivers need reaction time to respond to various road situations [28, 29]. A Markovian-based control policy will not model this aspect and instead likely to give a vehicle jerky motions. Consider these factors, we propose a hierarchical and memory-enabled control policy.
The task we consider is autonomous driving via a single front-facing camera. Our control policy consists of three modules: Detection, Following, and Avoidance. The Detection module keeps monitoring road conditions and activates either Following or Avoidance to produce a steering command. All these modules are trained via end-to-end imitation learning and share a similar network specification which is detailed in Appendix IX-B.
V-a End-to-end Imitation Learning
The objective of imitation learning is to train a model that behaves or makes decisions like an expert through demonstrations. The model could be a classifier or a regresserparameterized by :
where is a distance function.
The end-to-end aspect denotes the mapping from raw observations to decision/control commands. For our policy, we need one decision module and two control modules and . The input for is a sequence of annotated images while the outputs are binary labels indicating whether a road condition is dangerous or safe. The inputs for and are sequences of annotated images while the outputs are steering angles. Together, these learned policies form a hierarchical control mechanism enabling an AV to drive safely on roads and avoid obstacles when needed.
V-B Training Data Collection
For training Following, inspired by the technique used by Bojarski et al. , we collect images from three front-facing cameras behind the main windshield: one at the center, one at the left side, and one at the right side. The image from the center camera is labeled with the exact steering angle while the images from the other two cameras are labeled with adjusted steering angles. However, once Following is learned, it only needs images from the center camera to operate.
For training Avoidance, we rely on SimExpert
, which can generate numerous intermediate collision-free trajectories between the first moment and the last moment of a potential accident (see SectionVI-A). By positioning an AV on these trajectories, we collect images from the center front-facing camera along with corresponding steering angles. The training of Detection requires a more sophisticated mechanism and is the subject of the next section.
Vi Learning from Accidents
We explain how we analyze an accident in SimExpert and use the generated data to train the Avoidance and Detection modules of our policy. SimExpert is built based on the multi-agent simulator WarpDriver .
Vi-a Solving Accidents
When an accident occurs, we know the trajectory of the tested vehicle for the latest frames, which we note as a collection of states , where each state
contains the 2-dimensional position and velocity vectors of the vehicle. Then, there are three notable states on this trajectory that we need to track. The first is the earliest state where the vehicle involved in an accident (is in a collision)(at frame ). The second is the last state (at frame ) where the expert algorithm can still avoid a collision. The final one is the first state (at frame ) where the expert algorithm perceives the interaction leading to the accident with the other involved agent, before that accident.
In order to compute these notable states, we briefly recall the high-level components of WarpDriver . This collision-avoidance algorithm consists of two parts. The first is the function , which given the current state of an agent and any prediction point in 2-dimensional space and time (in this agent’s referential), gives the probability of that agent’s colliding with any neighbor . The second part is the solver, which based on this function, computes the agent’s probability of colliding with neighbors along its future trajectory starting from a state (i.e., computed for spanning the future predicted trajectory of the agent, we denote this probability ), and then proposes a new velocity to lower this probability. Subsequently, we can initialize an agent in this algorithm to any state and compute a new trajectory consisting of new states , where .
Additionally, since in space and time in an agent’s referential represents the agent’s position at the current time (we can use this point with function to determine if the agent is currently colliding with anyone), we find where subject to and . We note that a trajectory produced by the expert algorithm could contain collisions (accounting for vehicle dynamics) depending on the state that it was initialized from. We can denote the set of colliding states along this trajectory as . Then, we can compute where subject to and . Finally, we can compute with subject to and .
Knowing these notable states, we can solve the accident situation by computing the set of collision-free trajectories . An example can be found in Appendix IX-C. These trajectories can then be used to generate training examples in SimLearner in order to train the Avoidance module.
Vi-B Additional Data Coverage
The previous step generated collision-free trajectories between and . It is possible to build on these trajectories if the tested steering algorithm has particular data/training requirements. Here we detail the data we derive in order to train the Detection module, where the task is to determine if a situation is dangerous and tell Avoidance to address it.
To proceed, we essentially generate a number of trajectories parallel to , and for each position on them, generate several images for various orientations of the vehicle. These images are then labeled based on under-steering/over-steering as compared to the “ideal” trajectories in . This way, we scan the region of the road before the accident locus, generating several images (different vehicle orientations) for each point in that region.
In summary (a thorough version can be found in Appendix IX-D), and as depicted in Figure 1, at each state , we construct a line perpendicular to the original trajectory. Then on this line, we define three points and a margin . The first point is the furthest (from ) intersection between this line and the collision-free trajectories . The other two points are the intersections between the constructed line and the left and right road borders, respectively. From these points, a generated image at a position along the constructed line and with a given direction vector has either a DANGER or SAFE label (red and green ranges in Figure 1
) depending on the direction vector being on the “left” or “right” of the vector resulting from the interpolation of the velocity vectors of states belonging to nearby collision-free trajectories (bilinear interpolation ifis between two collision-free trajectories, linear otherwise).
If a point is on the same side of the original trajectory as the collision-free trajectories ( and in Figure 1, is “outside” but within the margin of the collision-free trajectories, is “inside” the collision-free trajectories), the label is SAFE on the exterior of the avoidance maneuver, and DANGER otherwise.
If a point is on the other side of the original trajectory as compared to the collision-free trajectories ( and in Figure 1)), inside the road () the label is always DANGER, while outside but within the margin of the road (), the label is DANGER when directed towards the road, and SAFE otherwise.
We test our framework in three scenarios: a straight road representing a linear geometry, a curved road representing a non-linear geometry, and an open ground. The first two scenarios demonstrate on-road situations with a static obstacle while the last one demonstrates an off-road situation with a dynamic obstacle. The specifications of our experiments are detailed in Appendix IX-E.
For evaluation, we compare our policy to the “flat policy” that essentially consists of a single DNN [8, 11, 31, 13]. Usually, this type of policy contains a few convolutional layers followed by a few dense layers. Although the specifications may vary, without human intervention, they are mainly limited to single-lane following . In this work, we select Bojarski et al.  as an example network, as it is one of the most tested control policies. In the following, we will first demonstrate the effectiveness of our policy and then qualitatively illustrate the efficiency of our framework.
Vii-a Control Policy
We derive our training datasets from straight road with or without an obstacle and curved road with or without an obstacle. This separation allows us to train multiple policies and test the effect of learning from accidents using our policy compared to Bojarski et al. . By progressively increasing the training datasets, we obtain six policies for evaluation:
Our own policy: trained with only lane-following data ; additionally trained after analyzing one accident on the straight road ; and additionally trained after producing one accident on the curved road .
Similarly, for the policy from Bojarski et al. : , , and .
We first evaluate and using both the straight and curved roads by counting how many laps (out of 50) the AV can finish. As a result, both policies managed to finish all laps while keeping the vehicle in the lane. We then test these two policies on the straight road with a static obstacle added. Both policies result in the vehicle collides into the obstacle, which is expected since no accident data were used during the training.
Having the occurred accident, we can now use SimExpert to generate additional training data to obtain 333The accident data are only used to perform a regression task as the policy by Bojarski et al.  does not have a classification module. and . As a result, continues to cause collision while avoids the obstacle. Nevertheless, when testing on the curved road with an obstacle, accident still occurs because of the corresponding accident data are not yet included in training.
By further including the accident data from the curved road into training, we obtain and . manages to perform both lane-following and collision avoidance in all runs. , on the other hand, leads the vehicle to drift away from the road.
For the studies involved an obstacle, we uniformly sampled 50 obstacle positions on a line segment that is perpendicular to the direction of a road and in the same lane as the vehicle. We compute the success rate as how many times a policy can avoid the obstacle (while stay in the lane) and resume lane-following afterwards. The results are shown in Table II and example trajectories are shown in Figure 2 LEFT and CENTER.
|Training Module (Data)||Other Specs|
|Scenarios||Following (#Images)||Avoidance (#Images)||Detection (#Images)||Total||Data Augmentation||#Safe Trajectories||Road Type||Obstacle|
|Straight road||33 642||34 516||32 538||97 854||x||74||on-road||static|
|Curved road||31 419||33 624||71 859||136 855||x||40||on-road||static|
|Open ground||30 000||33 741||67 102||130 843||x||46||off-road||dynamic|
We further test our method on an open ground which involves a dynamic obstacle. The AV is trained heading towards a green sphere while an adversary vehicle is scripted to collide with the AV on its default course. The result showing our policy can steer the AV away from the adversary vehicle and resume its direction to the sphere target. This can be seen in Figure 2 RIGHT.
|Test Policy and Success Rate (out of 50 runs)|
|Straight rd. / Curved rd.||100%||100%||100%||100%||100%||100%|
|Straight rd. + Obst.||0%||0%||0%||100%||0%||100%|
|Curved rd. + Obst.||0%||0%||0%||0%||0%||100%|
Vii-B Algorithm Efficiency
The key to rapid policy improvement is to generate training data accurately, efficiently, and sufficiently. Using principled simulations covers the first two criteria, now we demonstrate the third. Compared to the average number of training data collected by DAGGER  at one iteration, our method can achieve over 200 times more training examples for one iteration444The result is computed via dividing the total number of training images via our method by the average number of training data collected using the safe trajectories in each scenario.. This is shown in Table I.
In Figure 3, we show the visualization results of images collected using our method and DAGGER  within one iteration via progressively increasing the number of sampled trajectories. Our method generates much more heterogeneous training data, which when produced in a large quantity can greatly facilitate the update of a control policy.
In this work, we have proposed ADAPS, a framework that consists of two simulation platforms and a control policy. Using ADAPS, one can easily simulate accidents. Then, ADAPS will automatically retrace each accident, analyze it, and plan alternative safe trajectories. With the additional training data generation technique, our method can produce a large number of heterogeneous training examples compared to existing methods such as DAGGER , thus representing a more efficient learning mechanism. Our hierarchical and memory-enabled policy offers robust collision avoidance behaviors that previous policies fail to achieve. We have evaluated our method using multiple simulated scenarios, in which our method shows a variety of benefits.
There are many future directions. First of all, we would like to combine long-range vision into ADAPS so that an AV can plan ahead in time. Secondly, the generation of accidents can be parameterized using knowledge from traffic engineering studies. Lastly, we would like to combine more sensors and fuse their inputs so that an AV can navigate in more complicated traffic scenarios .
The authors would like to thank US Army Research Office and UNC Arts & Science Foundation, and Dr. Feng “Bill” Shi for insightful discussions.
S. Ross, G. Gordon, and D. Bagnell, “A reduction of imitation learning and
structured prediction to no-regret online learning,” in
Proceedings of the fourteenth international conference on artificial intelligence and statistics, 2011, pp. 627–635.
-  N. Ratliff, “Learning to search: structured prediction techniques for imitation learning,” Ph.D. dissertation, Carnegie Mellon University, 2009.
-  D. Silver, “Learning preference models for autonomous mobile robots in complex domains,” Ph.D. dissertation, 2010.
-  D. Pomerleau, “ALVINN: An autonomous land vehicle in a neural network,” in Advances in neural information processing systems, 1989, pp. 305–313.
-  S. Ross, N. Melik-Barkhudarov, K. S. Shankar, A. Wendel, D. Dey, J. A. Bagnell, and M. Hebert, “Learning monocular reactive uav control in cluttered natural environments,” in Robotics and Automation, 2013 IEEE International Conference on. IEEE, 2013, pp. 1765–1772.
-  Q. Chao, H. Bi, W. Li, T. Mao, Z. Wang, M. C. Lin, and Z. Deng, “A survey on visual traffic simulation: Models, evaluations, and applications in autonomous driving,” Computer Graphics Fourm, 2019.
-  W. Schwarting, J. Alonso-Mora, and D. Rus, “Planning and decision-making for autonomous vehicles,” Annual Review of Control, Robotics, and Autonomous Systems, 2018.
-  C. Chen, A. Seff, A. Kornhauser, and J. Xiao, “Deepdriving: Learning affordance for direct perception in autonomous driving,” in Computer Vision, 2015 IEEE International Conference on, 2015, pp. 2722–2730.
-  Y. LeCun, U. Muller, J. Ben, E. Cosatto, and B. Flepp, “Off-road obstacle avoidance through end-to-end learning,” in Advances in neural information processing systems, 2005, pp. 739–746.
-  M. Bojarski, D. Del Testa, D. Dworakowski, B. Firner, B. Flepp, P. Goyal, L. D. Jackel, M. Monfort, U. Muller, J. Zhang, et al., “End to end learning for self-driving cars,” arXiv preprint arXiv:1604.07316, 2016.
H. Xu, Y. Gao, F. Yu, and T. Darrell, “End-to-end learning of driving models
from large-scale video datasets,” in
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 3530–3538.
-  Y. Pan, C.-A. Cheng, K. Saigol, K. Lee, X. Yan, E. Theodorou, and B. Boots, “Agile off-road autonomous driving using end-to-end deep imitation learning,” in Robotics: Science and Systems, 2018.
-  F. Codevilla, M. Müller, A. Dosovitskiy, A. López, and V. Koltun, “End-to-end driving via conditional imitation learning,” in Robotics and Automation (ICRA), 2017 IEEE International Conference on. IEEE, 2017, pp. 746–753.
A. G. Barto and S. Mahadevan, “Recent advances in hierarchical reinforcement learning,”Discrete Event Dynamic Systems, vol. 13, no. 4, pp. 341–379, 2003.
-  R. S. Sutton, D. Precup, and S. Singh, “Between mdps and semi-mdps: A framework for temporal abstraction in reinforcement learning,” Artificial intelligence, vol. 112, no. 1-2, pp. 181–211, 1999.
-  G. Konidaris, S. Kuindersma, R. Grupen, and A. Barto, “Robot learning from demonstration by constructing skill trees,” The International Journal of Robotics Research, vol. 31, no. 3, pp. 360–375, 2012.
S. Levine and V. Koltun, “Guided policy search,” in
Proceedings of the 30th International Conference on Machine Learning (ICML), 2013, pp. 1–9.
-  G. J. Gordon, “Stable function approximation in dynamic programming,” in Machine Learning Proceedings 1995. Elsevier, 1995, pp. 261–268.
M. Kearns, Y. Mansour, and A. Y. Ng, “A sparse sampling algorithm for near-optimal planning in large markov decision processes,”Machine learning, vol. 49, no. 2-3, pp. 193–208, 2002.
-  C. Szepesvári and R. Munos, “Finite time bounds for sampling based fitted value iteration,” in Proceedings of the 22nd international conference on Machine learning, 2005, pp. 880–887.
-  L.-J. Lin, “Self-improving reactive agents based on reinforcement learning, planning and teaching,” Machine learning, vol. 8, no. 3-4, pp. 293–321, 1992.
-  U. Syed and R. E. Schapire, “A reduction from apprenticeship learning to classification,” in Advances in Neural Information Processing Systems, 2010, pp. 2253–2261.
-  H. Daumé, J. Langford, and D. Marcu, “Search-based structured prediction,” Machine learning, vol. 75, no. 3, pp. 297–325, 2009.
-  S. M. Kakade and A. Tewari, “On the generalization ability of online strongly convex programming algorithms,” in Advances in Neural Information Processing Systems, 2009, pp. 801–808.
-  E. Hazan, A. Agarwal, and S. Kale, “Logarithmic regret algorithms for online convex optimization,” Machine Learning, vol. 69, no. 2-3, pp. 169–192, 2007.
-  S. Kakade and J. Langford, “Approximately optimal approximate reinforcement learning,” in Proceedings of the 30th International Conference on Machine Learning (ICML), vol. 2, 2002, pp. 267–274.
-  J. A. Bagnell, S. M. Kakade, J. G. Schneider, and A. Y. Ng, “Policy search by dynamic programming,” in Advances in neural information processing systems, 2004, pp. 831–838.
-  G. Johansson and K. Rumar, “Drivers’ brake reaction times,” Human factors, vol. 13, no. 1, pp. 23–27, 1971.
-  D. V. McGehee, E. N. Mazzae, and G. S. Baldwin, “Driver reaction time in crash avoidance research: validation of a driving simulator study on a test track,” in Proceedings of the human factors and ergonomics society annual meeting, vol. 44, no. 20, 2000.
-  D. Wolinski, M. Lin, and J. Pettré, “Warpdriver: context-aware probabilistic motion prediction for crowd simulation,” ACM Transactions on Graphics (TOG), vol. 35, no. 6, 2016.
-  J. Zhang and K. Cho, “Query-efficient imitation learning for end-to-end simulated driving,” in AAAI, 2017, pp. 2891–2897.
-  L. v. d. Maaten and G. Hinton, “Visualizing data using t-sne,” Journal of machine learning research, vol. 9, no. Nov, pp. 2579–2605, 2008.
-  W. Li, D. Wolinski, and M. C. Lin, “City-scale traffic animation using statistical learning and metamodel-based optimization,” ACM Trans. Graph., vol. 36, no. 6, pp. 200:1–200:12, Nov. 2017.
S. Hochreiter and J. Schmidhuber, “Long short-term memory,”Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.
Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,”nature, vol. 521, no. 7553, p. 436, 2015.
-  D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” ICLR, 2015.
Ix-a Solving An SPC Task
We show the proofs of solving an SPC task using standard supervised learning, DAGGER , and ADAPS, respectively. We use “state” and ”observation” interchangeably here as for these proofs we can always find a deterministic function to map the two.
Ix-A1 Supervised Learning
The following proof is adapted and simplified from Ross et al. . We include it here for completeness.
Consider a -step control task. Let be the observed surrogate loss under the training distribution induced by the expert’s policy . We assume and upper bounds the 0-1 loss. and denote the cost-to-go over steps of executing and , respectively. Then, we have the following result:
In order to prove this theorem, we introduce the following notation and definitions:
: the state distribution at as a result of the following event: is executed and has been choosing the same actions as from time to .
: the probability that the above-mentioned event holds true.
: the state distribution at as a result of the following event: is executed and has chosen at least one different action than from time to .
: the probability that the above-mentioned event holds true.
: the state distribution at .
: the probability that chooses a different action than in .
: the probability that chooses a different action than in .
: the probability that chooses a different action than in .
: the expected immediate cost of executing in .
: the expected immediate cost of executing in .
: the expected immediate cost of executing in .
: the expected immediate cost of executing in .
: the upper bound of an expected immediate cost.
: the cost-to-go of executing for steps.
: the cost-to-go of executing for steps.
The probability that the learner chooses at least one different action than the expert in the first steps is:
This gives us since . Solving this recurrence we arrive at:
Now consider in state distribution , if chooses a different action than with probability , then will incur a cost at most more than . This can be represented as:
Thus, we have:
We sum the above result over steps and use the fact :
 Let and be any two distributions over elements and , any bounded function such that for all . Let the range . Then .
Taking leads to and proves the lemma.
 Let be the learned policy, be the expert’s policy, and be the policy used to collect training data with probability executing and probability executing over steps. Then, we have .
In contrast to which is the state distribution as the result of solely executing , we denote as the state distribution as the result of executing at least once over steps. This gives us . We also have the facts that for any two distributions and , and . Then, we have and can further show:
 If the surrogate loss is the same as the cost function or upper bounds it, then after iterations of DAGGER:
Let be the expected loss of any policy under the state distribution induced by the learned policy at the th iteration and be the minimal loss in hindsight after iterations. Then, is the average regret of this online learning program. In addition, we denote the expected loss of any policy under its own induced state distribution as and consider as the mixed policy that samples the policies uniformly at the beginning of each trajectory. Using Lemma 1 and Lemma 2, we can show:
By further assuming is monotonically decreasing and , we have the following:
Summing over gives us:
Define , in order to have , we need which gives us . In addition, note now and , continuing the above derivation, we have:
Given the fact and representing the third term as , we have proved the theorem.
With the assumption that we can treat the generated trajectories from our model and the additional data generated based on them as running a learned policy to sample independent expert trajectories at different states while performing policy roll-out, we have the following guarantee of ADAPS. To better understand the following theorem and proof, we recommend interested readers to read the proofs of Theorem 2 and 3 first.
If the surrogate loss upper bounds the true cost , by collecting trajectories using ADAPS at each iteration, with probability at least , , we have the following guarantee:
Assuming at the th iteration, our model generates trajectories. These trajectories are independent from each other since they are generated using different parameters and at different states during the analysis of an accident. For the th trajectory,
, we can construct an estimate, where is the learned policy from data gathered in previous iterations. Then, the approximated expected loss is the average of these estimates: . We denote as the approximated minimal loss in hindsight after iterations, then is the approximated average regret.
and define random variables, for and . Consequently, form a martingale and . By Azuma-Hoeffding’s inequality, with probability at least , we have .
Next, we denote the expected loss of any policy under its own induced state distribution as and consider as the mixed policy that samples the policies uniformly at the beginning of each trajectory. At each iteration, during the data collection, we only execute the learned policy instead of mix it with the expert’s policy, which leads to . Finally, we can show:
Summing over proves the theorem.
Ix-B Network Specification
All modules within our control mechanism share a similar network architecture that combines Long Short-Term Memory (LSTM) 
and Convolutional Neural Networks (CNN). Each image will first go through a CNN and then be combined with other images to form a training sample to go through a LSTM. The number of images of a training sample is empirically set to 5. We use the many-to-many mode of LSTM and set the number of hidden units of the LSTM to 100. The output is the average value of the output sequence.
The CNN consists of eight layers. The first five are convolutional layers and the last three are dense layers. The kernel size is in the first three convolutional layers and
in the other two convolutional layers. The first three convolutional layers have a stride of two while the last two convolutional layers are non-strided. The filters for the five convolutional layers are 24, 36, 48, 64, 64, respectively. All convolutional layers use VALID padding. The three dense layers have 100, 50, and 10 units, respectively. We use ELU as the activation function andas the kernel regularizer set to 0.001 for all layers.
We train our model using Adam 
with initial learning rate set to 0.0001. The batch size is 128 and the number of epochs is 500. For trainingDetection (a classification task), we use Softmax for generating the output and categorical cross entropy as the loss function. For training Following and Avoidance (regression tasks), we use mean squared error (MSE) as the loss function. We have also adopted cross-validation with 90/10 split. The input image data have resolution in RGB channels.
Ix-C Example Expert Trajectories
Figure 4 shows a set of generated trajectories for a situation where the vehicle had collided with a static obstacle in front of it after driving on a straight road. As expected, the trajectories feature sharper turns (red trajectories) as the starting state tends towards the last moment that the vehicle can still avoid the obstacle.
Ix-D Learning From Accidents
For the following paragraph, we abusively note , the position coordinates at state , and , the velocity vector coordinates at state . Then, for any state we can define a line . On this line, we note the furthest point on from which is at an intersection between and a collision-free trajectory from . This point determines how far the vehicle can be expected to stray from the original trajectory before the accident, if it followed an arbitrary trajectory from . We also note and the two intersections between and the road edges ( is on the “left” with , and is on the “right” with ). These two points delimit how far from the original trajectory the vehicle could be. Finally, we define a user-set margin as outlined below (we set ).
Altogether, these points and margin are the limits of the region along the original trajectory wherein we generate images for training: a point is inside the region if it is between the original trajectory and the furthest collision-free trajectory plus a margin (if and are on the same side, i.e.