Urban Driver: Learning to Drive from Real-world Demonstrations Using Policy Gradients

09/27/2021 ∙ by Oliver Scheel, et al. ∙ 0

In this work we are the first to present an offline policy gradient method for learning imitative policies for complex urban driving from a large corpus of real-world demonstrations. This is achieved by building a differentiable data-driven simulator on top of perception outputs and high-fidelity HD maps of the area. It allows us to synthesize new driving experiences from existing demonstrations using mid-level representations. Using this simulator we then train a policy network in closed-loop employing policy gradients. We train our proposed method on 100 hours of expert demonstrations on urban roads and show that it learns complex driving policies that generalize well and can perform a variety of driving maneuvers. We demonstrate this in simulation as well as deploy our model to self-driving vehicles in the real-world. Our method outperforms previously demonstrated state-of-the-art for urban driving scenarios – all this without the need for complex state perturbations or collecting additional on-policy data during training. We make code and data publicly available.



There are no comments yet.


page 2

page 3

page 5

page 6

page 7

page 9

page 10

page 14

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Self-driving has the potential to revolutionize transportation and is a major field of AI applications. Even though already in 1990 there were prototypes capable of driving on highways [13]

, technology is still not widespread, especially in the context of urban driving. In the past decade, the availability of large datasets and high-capacity neural networks has enabled significant progress in perception

[26, 12]

and the vehicles’ ability to understand their surrounding environment. Self-driving decision making, however, has seen very little benefit from machine learning or large datasets. State-of-the-art planning systems used in industry

[15] still heavily rely on trajectory optimisation techniques with expert-defined cost functions. These cost functions capture desirable properties of the future vehicle path. However, engineering these cost functions scales poorly with the complexity of driving situations and the long tail of rare events.

Due to this, learning a driving policy directly from expert demonstrations is appealing, since performance scales to new domains by adding data rather than via additional human engineering effort. In this paper we focus specifically on learning rich driving policies for urban driving from large amounts of real-world collected data. Unlike highway driving [19], urban driving requires performing a variety of maneuvers and interactions with, e.g., traffic lights, other cars and pedestrians.

Recently, rich mid-level representations powered by large-scale datasets [21, 11], HD-maps and high-performance perception systems enabled capturing nuances of urban driving. This led to new methods achieving high performance for motion prediction [16, 29]. Furthermore, [3] demonstrated that leveraging these representations and behavioral cloning with state perturbations leads to learning robust driving policies. While promising, difficulty of this approach lies in engineering the perturbation noise mechanism required to avoid covariate shift between training and testing distribution.

Inspired by this approach, we present the first results on offline learning of imitating driving policies using mid-level representations, a closed-loop simulator and a policy gradient method. This formulation has several benefits: it can successfully learn high-complexity maneuvers without the need for perturbations, implicitly avoid the problem of covariate shift, and directly optimize imitation as well as auxiliary costs. The proposed simulator is constructed directly from the collected logs of real-world demonstrations and HD maps of the area, and can synthesize new realistic driving episodes from past experiences (see Figure 1

for an overview of our method). Furthermore, for training on large datasets reducing the computational complexity is paramount. We leverage vectorized representations and show how this allows for computing policy gradients quickly using backpropagation through time. We demonstrate how these choices lead to superior performance of our method over the existing state-of-the-art in imitation learning for real-world self-driving planning in urban areas.

Our contributions are four-fold:

  • The first demonstration of policy gradient learning of imitative driving policies for complex urban driving from a large corpus of real-world demonstrations. We leverage a closed-loop simulator and rich, mid-level vectorized representations to learn policies capable of performing a variety of maneuvers.

  • A new differentiable simulator that enables efficient closed-loop simulation of realistic driving experiences based on past demonstrations, and quickly compute policy gradients by backpropagation through time, allowing fast learning.

  • A comprehensive qualitative and quantitative evaluation of the method and its performance compared to existing state-of-the-art. We show that our approach, trained purely in simulation can control a real-world self-driving vehicle, outperforms other methods, generalizes well and can effectively optimize both imitation and auxiliary costs.

  • Source code and data are made available to the public111Code and video available at https://planning.l5kit.org..

2 Related Work

In this section we summarize different approaches for solving self-driving vehicle (SDV) decision-making in both academia and industry. In particular, we focus on both optimisation-based and ML-based systems. Furthermore, we discuss the role of representations and datasets as enablers in recent years, to tackle progressively more complex Autonomous Driving (AD) scenarios.

Trajectory-based optimization is still a dominant approach used in industry for both highway and urban-driving scenarios. It relies on manually defined costs and reward functions that describe good driving behavior. Such cost can then be optimized using a set of classical optimization techniques (A* [47], RRTs [22], POMDP with solver [2], or dynamic programming [31]). Appealing properties of these approaches are their interpretability and functional guarantees, which are important for AD safety. These methods, however, are very difficult to scale. They rely on human engineering rather than on data to specify functionality. This becomes especially apparent when tackling complex urban driving scenarios, which we address in this work.

Reinforcement learning (RL) [40] removes some complexity of human engineering by providing a reward (cost) signal and uses ML to learn an optimal policy to maximize it. Directly providing the reward through real-time disengagements [23], however, is impractical due to a low sample-efficiency of RL and the involved risks. Therefore, most approaches [24] rely on constructing a simulator and explicitly encoding and optimising a reward signal [37]. A limiting factor of these approaches is that the simulator often is hand-engineered [14, 30], limiting its ability to capture long-tail real-world scenarios. Recent examples of sim-to-real policy transfer (e.g. [43], [33], [1]) were not focused on evaluating scenarios typical to urban driving, in particular interacting with other agents. In our work, we construct the simulator directly from real-world logs through mid-level representations. This allows training in a variety of real-world scenarios with other agents present, while employing efficient policy-gradient learning.

Imitation learning (IL) and Inverse Reinforcement Learning (IRL) [35, 46] are more scalable ML approaches that leverage expert demonstrations. Instead of learning from negative events, aim is to directly copy expert behavior or recover the underlying expert costs. Simple behavioral cloning was applied already back in 1989 [34] on rural roads, more recently by [8] on highways and [17] in urban driving. Naive behavioral cloning, however, suffers from covariate shift [35]. This issue has been successfully tackled for highway lane-following scenarios by reframing the problem as classification task [7] or employing a simple simulator [19], constructed from highway cameras. We take inspiration from these approaches but focus on the significantly more complex task of urban driving. Theoretically, our work is motivated by [42], as we employ a similar principle of generating synthetic corrections to simulate querying an expert. Due to this, identical proven guarantees hold for our method, namely the ideal linear regret bound, mitigating the problem of covariate shift. Adversarial Imitation Learning comprises another important field [4, 20, 6], but, to the best of our knowledge, has seen little application to autonomous driving and no actual SDV deployment yet.

Neural Motion Planners are another approach used for autonomous driving. In [45]

raw sensory input and HD-maps are used to estimate cost volumes of the goodness of possible future SDV positions. Based on these cost volumes, trajectories can be sampled and the lowest-cost one is selected to be executed. This was further improved in

[10], where the dependency on HD-maps was dropped. To the best of our knowledge, these promising methods have not yet been demonstrated to drive a car in the real-world though.

Mid-representations and the availability of large-scale real-world AD datasets [21, 11] have been major enablers in recent years for tackling complex urban scenarios. Instead of learning policies directly from sensor data, the input of the model comprises the output of the perception system as well as an HD map of the area. This representation compactly captures the nuance of urban scenarios and allows large-scale training on hundreds or thousands of hours of real driving situations. This led to new state-of-the-art solutions for motion forecasting [16, 29]. Moreover, [3] demonstrated that using mid-representations, large-scale datasets and simple behavioral cloning with perturbations [27] can scale and learn robust planning policies. The difficulty of this approach, however, is in engineering the noise model to overcome the covariate shift problem. In our work we are inspired by this approach, but attempt to learn robust policies using policy gradient optimisation [18] featuring unrolling and evaluating the policy during training. This implicitly avoids the problem of covariate shift and leads to superior results. This approach is, however, more computationally expensive and requires a simulator. To solve this, we show how a fast and powerful simulator can be constructed directly from real-world logs enabling scalability of this approach.

Data-driven simulation.

A realistic simulator is useful for both training and validation of ML models. However, many current simulators (e.g. [14, 28]

) depend on heuristics for vehicle control and do not capture the diversity of real world behaviours. Data-driven simulators are designed to alleviate this problem.

[1] created a photo-realistic simulator for training an end-to-end RL policy. [19] simulated a bird’s-eye view of dense traffic on a highway. Finally, two recent works [5, 39] developed data-driven simulators and showed their usefulness for training and validating ML planners. In this work we show that a simpler, differentiable simulator based on replaying logs is effective for training.

3 Differentiable Traffic Simulator from Real-world Driving Data

In this section we describe a differentiable simulator that approximates new driving experiences based on an experience collected in the real world. This simulator is used during policy learning for the closed-loop evaluation of the current policy’s performance and computing the policy gradient. As shown in Section 5, differentiability is an important building block for achieving good results, especially when employing auxiliary costs.

We represent the real-world experience as a sequence of state observations around the vehicle over time:


We use a vectorized representation based on [16], in which each state observation consists of a collection of static and dynamic elements around the vehicle pose , with denoting the special Euclidean group. Static elements include traffic lanes, stop signs and pedestrian crossings. These elements are extracted from the underlying HD semantic map of the area using the localisation system. The dynamic elements include traffic lights status and traffic participants (other cars, buses, pedestrians and cyclists). These are detected in real-time using the on-board perception system. Each element includes a pose relative to the SDV pose , as well as additional features, such as the element type, time of observation, and other optional attributes, e.g. the color of associated traffic lights, recent history of moving vehicles, etc. The full details of this representation are provided in Appendix C.

Goal of the simulation is to iteratively generate a sequence of state observations that corresponds to a different sequence of driver actions in the scenario. This is done by first computing the corresponding SDV trajectory and then locally transforming states .

Figure 2: Differentiable simulator from observed real-world logs: based on a ground truth log (top row), we unroll a new trajectory corresponding to different SDV actions (e.g. given by a planner) in the simulator approximating the vectorized world representation (bottom row)

Updated poses of the SDV are determined by a kinematic model , which is assumed to be differentiable. The state observation is then obtained by computing the new position for each state element using a transformation along the differences of the original and updated pose:


See Figure 2 for an illustrative example. It is worth noting that this approximation is effective if the distance between the original and generated SDV pose is not too large.

We denote performing these steps in sequence with the step-by-step simulation transition function . Moreover, since both Equation (2) and vehicle dynamics are fully differentiable, we can compute gradients with respect to both the state () and action (). This is critical for the efficient computation of policy gradients using backpropagation through time as described in the next section.

4 Imitation Learning Using a Differentiable Simulator

In this part, we detail how we use the simulator described in the previous section to learn a deterministic policy to drive a car using closed-loop policy learning.

We frame the imitation learning problem as minimisation of the L1 pose distance between the expert and learner on a sequence of collected real-world demonstrations . Note that with a slight abuse of notation we use the poses here to refer to 3D vectors , instead of roto-translation matrices in – yielding the common L1 norm and loss. This can be expressed as a discounted cumulative expected loss [32] on the set of collected expert scenarios:


Optimising this objective pushes the trajectory taken by the learned policy as close as possible to the one of the expert, as well as limiting the trajectory to the region where the approximation given by the simulator is effective. In Appendix B we further extend this to include auxiliary cost functions with the aim of optimising additional objectives.

We can use any policy optimisation method [36, 25] to optimize Equation (3). However, given that the transition is differentiable, we can exploit it for a more effective training that does not require a separate estimation of a value function. As shown in [4, 18, 38], this results into an order of magnitude more efficient training. The optimisation process consists of repeatedly sampling pairs of expert and policy trajectories , and computing the policy gradient for these samples to minimize Equation (3). We describe both steps in detail in the following subsections.

Figure 3: One iteration of policy gradient update. Given a real-world expert trajectory we sample a policy state by unrolling the policy for steps. We then compute optimal policy update by backpropagation through time.

4.1 Sampling from a Policy Distribution

Input: Expert policy samples
Output: Learned policy
random ;
for  do
       for t = 1 to T do
       end for
       for  downto  do
       end for
end for
Algorithm 1 Imitation learning from expert demonstrations

In this subsection we detail sampling pairs of expert () and corresponding policy trajectory () drawn from policy .

Sampling expert trajectories consists of simply sampling from the collected dataset of expert demonstrations. To generate the policy sample we acquire an expert state , and then unroll the current policy for steps using the simulator .

This naive method, however, introduces bias, as the initial state of the trajectory is always drawn from the expert and not from the policy distribution . As shown in Appendix B, this results in the under-performance of the method. To remove this bias we discard the first timesteps from both trajectories and use only the remaining timesteps to estimate the policy gradient as described next (see Figure 3 for a visualization).

4.2 Computing Policy Gradient

Here we describe the computation of the policy gradient around the rollout trajectory given by the current policy. This gradient can be computed for deterministic policies using backpropagation through time leveraging the differentiability of the simulator . Note that we denote partial differentiation with subscripts, i.e. . We follow the formulation in [18] and express the gradient by a pair of recursive formulas:


The resulting algorithm is outlined in Algorithm 1 and illustrated in Figure 3. It can be implemented simply as one forward pass of length and one backward pass of length . To compute the policy gradient we use equations (4) and (5) recursively from to and use it to update policy parameters .

5 Experiments

In this section we evaluate our proposed method and benchmark it against existing state-of-the-art systems. In particular, we are interested in: its ability to learn robust policies dealing with various situations observed in the real world; its ability to tailor performance using auxiliary costs; the sensitivity of key hyper-parameters; and the impact on performance with increasing amounts of training data. Additional results can be found in the appendix and the accompanied video.

5.1 Dataset

For training and testing our models we use the Lyft Motion Prediction Dataset [21]. This dataset was recorded by a fleet of Autonomous Vehicles and contains samples of real-world driving on a complex, urban route in Palo Alto, California. The dataset captures various real-world situations, such as driving in multi-lane traffic, taking turns, interactions with vehicles at intersections, etc. Data was preprocessed by a perception system, yielding the precise position of nearby vehicles, cyclists and pedestrians over time. In addition, a high-definition map provides locations of lane markings, crosswalks and traffic lights. All models are trained on a 100h subset, and tested on 25h. The training dataset is identical to the publicly available one, whereas for the sake of execution speed for testing we use a random, but fixed, subset of the listed test dataset, which is roughly in size.

5.2 Baselines

We compare our proposed algorithm against three state-of-the-art baselines:

  • Naive Behavioral Cloning (BC): we implement standard behavioral cloning using our vectorized backbone architecture. We do not use the SDV’s history as an input to the model to avoid causal confusion (compare [3]).

  • Behavioral Cloning + Perturbations (BC-perturb): we re-implement a vectorized version of ChauffeurNet [3] using our backbone network. As in the original paper, we add noise in the form of perturbations during training, but do not employ any auxiliary losses. We test two versions: without the SDV’s history, and using the SDV’s history equipped with history dropout.

  • Multi-step Prediction (MS Prediction): we apply the meta-learning framework proposed in [42] to train our vectorized network. We observe that a version of this algorithm can conveniently be expressed within our framework; we obtain it by explicitly detaching gradients between steps (i.e. ignoring the full differentiability of our simulation environment). Differently from the original work [42], we do not save past unrolls as new dataset samples over time.

5.3 Implementation

Inspired by [16, 41], we use a graph neural network for parametrizing our policy. It combines a PointNet-like architecture for local inputs processing followed by an attention mechanism for global reasoning. In contrast to [16], we use points instead of vectors. Given the set of points corresponding to each input element, we employ 3 PointNet layers to calculate a 128-dimensional feature descriptor. Subsequently, a single layer of scaled dot-product attention performs global feature aggregation, yielding the predicted trajectory. We found and to work well, i.e. we use 20 timesteps for the initial sampling and effectively predict 12 trajectory steps. is set to 0.8. In total, our model contains around 3.5 million trainable parameters, and training takes 30h on 32 Tesla V100 GPUs. For more details we refer to Appendix C.

For the vehicle kinematics model we use an unconstrained model with . This allows for a fair comparisons with the baselines as both BC-perturb and MS Prediction assume the possibility of arbitrary pose corrections. Other kinematics models, such as unicycle or bicycle models, could be used with our method as well.

All baseline methods share the same network backbone as ours, with model specific differences as described above – and BC and BC-perturb predicting a full T-step trajectory with a single forward, while MS Prediction and ours are calling the model times. To ensure a fair comparison, also for MS Prediction we use our proposed sampling procedure, i.e. use the first

steps for sampling only. We train all models for 61 epochs with a learning rate of

, and drop it to after 54 epochs. We note that we achieve best results for our proposed method by disabling dropout, and hypothesize this is related to similar issues observed for RNNs [44].

We refer the reader to Appendix B for ablations on the influence on data and sampling.

5.4 Metrics

We implement the metrics describe below to evaluate the planning performance. These capture key imitation performance, safety and comfort. In particular, we report the following, which are normalized – if applicable – per 1000 miles driven by the respective planner:

  • L2: L2 distance to the underlying expert position in the driving log in meters.

  • Off-road events: we report a failure if the planner deviates more than 2m laterally from the reference trajectory – this captures events such as running off-road and into opposing traffic.

  • Collisions: collisions of the SDV with any other agent, broken down into front, side and rear collisions w.r.t. the SDV.

  • Comfort: we monitor the absolute value of acceleration, and raise a failure should this exceed 3 m/s.

  • I1K: we accumulate safety-critical failures (collisions and off-road events) into one key metric for ease of comparison, namely Interventions per 1000 Miles (I1K).

Configuration Collisions Imitation
Model SDV history Front Side Rear Off-road L2 Comfort I1K
BC 79 23 395 170 997 74 1618 459 1.57 0.27 93K 3K 3,091 601
BC-perturb 16 2 56 6 411 146 82 11 0.74 0.01 203K 6K 567 128
BC-perturb 14 4 73 7 678 11 77 6 0.77 0.01 636K 22K 843 6
MS Prediction 18 6 55 4 125 14 141 31 0.46 0.02 595K 49K 341 39
Ours 15 7 46 5 101 13 97 6 0.42 0.00 637K 41K 260 9
Table 1:

Normalized metrics for all baselines and our method – reporting mean and standard deviation for each as obtained from 3 runs. For all, lower is better. Our method overall yields best performance and lowest I1K.

Figure 4: Qualitative results of our method controlling the SDV. Every row depicts two scenes, images are 2s apart. The SDV is drawn in red, other agents in blue and crosswalks in yellow. Traffic lights colors are projected onto the affected lanes. Best view on a screen.

5.5 Imitation Results

We evaluate our method and all the baselines by unrolling the policy on 3600 sequences of 25 seconds length from the test set and measure the above metrics.

Table 1 reports performance when all methods are trained to optimize the imitation loss alone. Behavioral cloning yields a high number of trajectory errors and collisions. This is expected, as this approach is known to suffer from the issue of covariate shift [35]. Including perturbation during training dramatically improves performance as it forces the method to learn how to recover from drifting. We further observe that MS Prediction yields comparable results for many categories, while yielding less rear collisions. We attribute this to the further reduction of covariate shift when compared to the previous methods: the training distribution is generated on-policy instead of being synthesized by adding noise. Finally, our method yields best results overall. It is worth noting that all models share a high number of comfort failures, due to the fact that they are all trained for imitation performance alone, which does not optimize for comfort, but only positional accuracy of the driven vehicle – which we address in the appendix.

5.6 In-car Testing

In addition to above stated simulation results, we further deployed our planner on SDVs in the real. For this, a Ford Fusion equipped with 7 camera, 3 LiDAR and 11 Radar sensors was employed. The sensor setup thus equals the one used for data collection, and during road-testing our perception and data-processing stack is run in real-time to generate the desired scene representation on the fly. For this, vehicles are equipped with 8 Nvidia 2080 TIs. Experiments were conducted on a private test track, including other traffic participants and reproducing challenging driving scenarios. Furthermore, this track was never shown to the network before, and thus offers valuable insights into generalization performance. Figure

5 shows our model successfully crossing a signaled intersection, for more results we refer to the appendix and our supplementary video.

Figure 5: Front-camera footage of our planner stopping for a red light, and subsequently crossing the intersection when the signal turns green (images from left to right, recorded several seconds apart).

6 Conclusion

In this work we have introduced a method for learning an autonomous driving policy in an urban setting, using closed-loop training, mid-level representations with a data-driven simulator and a large corpus of real world demonstrations. We show this yields good generalization and performance for complex, urban driving. In particular, it can control a real-world self-driving vehicle, yielding better driving performance than other state-of-the-art ML methods.

We believe this approach can be further extended towards production-grade real-world driving requirements of L4 and L5 systems – in particular, for improving performance in novel or rarely seen scenarios and to increase sample efficiency, allowing further scaling to millions of hours of driving.

We would like to thank everyone at Level 5 working on data-driven planning, in particular Sergey Zagoruyko, Yawei Ye, Moritz Niendorf, Jasper Friedrichs, Li Huang, Qiangui Huang, Jared Wood, Yilun Chen, Ana Ferreira, Matt Vitelli, Christian Perone, Hugo Grimmett, Parth Kothari, Stefano Pini, Valentin Irimia and Ashesh Jain. Further we would like to thank Alex Ozark, Bernard Barcela, Alex Oh, Ervin Vugdalic, Kimlyn Huynh and Faraz Abdul Shaikh for deploying our planner to SDVs in the real.


  • [1] A. Amini, I. Gilitschenski, J. Phillips, J. Moseyko, R. Banerjee, S. Karaman, and D. Rus (2020) Learning robust control policies for end-to-end autonomous driving from data-driven simulation. Robotics and Automation Letters. Cited by: §2, §2.
  • [2] T. Bandyopadhyay, K. S. Won, E. Frazzoli, D. Hsu, W. S. Lee, and D. Rus (2013) Intention-aware motion planning. In Algorithmic Foundations of Robotics X, E. Frazzoli, T. Lozano-Perez, N. Roy, and D. Rus (Eds.), Cited by: §2.
  • [3] M. Bansal, A. Krizhevsky, and A. Ogale (2018-12) ChauffeurNet: learning to drive by imitating the best and synthesizing the worst. Cited by: §1, §2, 1st item, 2nd item.
  • [4] N. Baram, O. Anschel, I. Caspi, and S. Mannor (2017) End-to-end differentiable adversarial imitation learning. In Int. Conf. on Machine Learning, Cited by: §2, §4.
  • [5] L. Bergamini, Y. Ye, O. Scheel, L. Chen, C. Hu, L. Del Pero, B. Osinski, H. Grimmett, and P. Ondruska (2021) SimNet: learning reactive self-driving simulations from real-world observations. Int. Conf. on Robotics and Automation. Cited by: §2.
  • [6] R. P. Bhattacharyya, B. Wulfe, D. J. Phillips, A. Kuefler, J. Morton, R. Senanayake, and M. J. Kochenderfer (2020) Modeling human driving behavior through generative adversarial imitation learning. ArXiv. Cited by: §2.
  • [7] M. Bojarski, C. Chen, J. Daw, A. Değirmenci, J. Deri, B. Firner, B. Flepp, S. Gogri, J. Hong, L. Jackel, et al. (2020) The nvidia pilotnet experiments. arXiv preprint arXiv:2010.08776. Cited by: §2.
  • [8] M. Bojarski, D. Testa, D. Dworakowski, B. Firner, B. Flepp, P. Goyal, L. Jackel, M. Monfort, U. Muller, J. Zhang, X. Zhang, J. Zhao, and K. Zieba (2016) End to end learning for self-driving cars. ArXiv. Cited by: §2.
  • [9] N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko (2020) End-to-end object detection with transformers. Cited by: Appendix C: Policy architecture and state representation.
  • [10] S. Casas, A. Sadat, and R. Urtasun (2021) MP3: a unified model to map, perceive, predict and plan. In

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    pp. 14403–14412. Cited by: §2.
  • [11] M. Chang, J. Lambert, P. Sangkloy, J. Singh, S. Bak, A. Hartnett, P. C. De Wang, S. Lucey, D. Ramanan, and J. Hays Argoverse: 3d tracking and forecasting with rich maps supplementary material. Int. Conf. on Computer Vision and Pattern Recognition (CVPR). Cited by: §1, §2.
  • [12] R. Q. Charles, H. Su, M. Kaichun, and L. J. Guibas (2017)

    PointNet: deep learning on point sets for 3d classification and segmentation

    In Int. Conf. on Computer Vision and Pattern Recognition (CVPR), Cited by: §1.
  • [13] E. Dieter. Dickmanns (2010) Dynamic vision for perception and control of motion. Cited by: §1.
  • [14] A. Dosovitskiy, G. Ros, F. Codevilla, A. Lopez, and V. Koltun (2017) CARLA: An open urban driving simulator. In 1st Annual Conference on Robot Learning, Cited by: §2, §2.
  • [15] H. Fan, F. Zhu, C. Liu, L. Zhang, L. Zhuang, D. Li, W. Zhu, J. Hu, H. Li, and Q. Kong (2018) Baidu apollo em motion planner. ArXiv. Cited by: §1.
  • [16] J. Gao, C. Sun, H. Zhao, Y. Shen, D. Anguelov, C. Li, and C. Schmid (2020) VectorNet: encoding hd maps and agent dynamics from vectorized representation. In Int. Conf. on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §2, §3, §5.3.
  • [17] J. Hawke, R. Shen, C. Gurau, S. Sharma, D. Reda, N. Nikolov, P. Mazur, S. Micklethwaite, N. Griffiths, A. Shah, and A. Kendall (2020) Urban driving with conditional imitation learning. In Int. Conf. on Robotics and Automation (ICRA), Cited by: §2.
  • [18] N. Heess, G. Wayne, D. Silver, T. Lillicrap, T. Erez, and Y. Tassa (2015) Learning continuous control policies by stochastic value gradients. In Advances in Neural Information Processing Systems, Cited by: §2, §4.2, §4.
  • [19] M. Henaff, A. Canziani, and Y. LeCun (2019) Model-predictive policy learning with uncertainty regularization for driving in dense traffic. ArXiv. Cited by: §1, §2, §2.
  • [20] J. Ho and S. Ermon (2016) Generative adversarial imitation learning. arXiv preprint arXiv:1606.03476. Cited by: §2.
  • [21] J. Houston, G. Zuidhof, L. Bergamini, Y. Ye, A. Jain, S. Omari, V. Iglovikov, and P. Ondruska (2020) One thousand and one hours: self-driving motion prediction dataset. Conference on Robot Learning (CoRL). Cited by: §1, §2, §5.1.
  • [22] J. J. K. Jr. and S. M. Lavalle (2000) RRT-connect: an efficient approach to single-query path planning. In Int. Conf. on Robotics and Automation, Cited by: §2.
  • [23] A. Kendall, J. Hawke, D. Janz, P. Mazur, D. Reda, J. Allen, V. Lam, A. Bewley, and A. Shah (2019) Learning to drive in a day. In Int. Conf. on Robotics and Automation (ICRA), Cited by: §2.
  • [24] B. R. Kiran, I. Sobh, V. Talpaert, P. Mannion, A. A. Al Sallab, S. Yogamani, and P. Pérez (2021) Deep reinforcement learning for autonomous driving: a survey. Transactions on Intelligent Transportation Systems. Cited by: §2.
  • [25] V. Konda and J. Tsitsiklis (2000) Actor-critic algorithms. In SIAM Journal on Control and Optimization, Cited by: §4.
  • [26] A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012) Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems, Cited by: §1.
  • [27] M. Laskey, J. Lee, R. Fox, A. Dragan, and K. Goldberg (2017) DART: noise injection for robust imitation learning. In 1st Annual Conference on Robot Learning, Cited by: §2.
  • [28] E. Leurent (2018) An environment for autonomous driving decision-making. Cited by: §2.
  • [29] M. Liang, B. Yang, R. Hu, Y. Chen, R. Liao, S. Feng, and R. Urtasun (2020) Learning lane graph representations for motion forecasting. Cited by: §1, §2.
  • [30] P. A. Lopez, M. Behrisch, L. Bieker-Walz, J. Erdmann, Y. Flötteröd, R. Hilbrich, L. Lücken, J. Rummel, P. Wagner, and E. Wießner (2018) Microscopic traffic simulation using sumo. In Int. Conf. on Intelligent Transportation Systems (ITSC), Cited by: §2.
  • [31] M. Montemerlo, J. Becker, S. Bhat, H. Dahlkamp, D. Dolgov, S. Ettinger, D. Haehnel, T. Hilden, G. Hoffmann, B. Huhnke, D. Johnston, S. Klumpp, D. Langer, A. Levandowski, J. Levinson, J. Marcil, D. Orenstein, J. Paefgen, I. Penny, A. Petrovskaya, M. Pflueger, G. Stanek, D. Stavens, A. Vogt, and S. Thrun (2009) Junior: the stanford entry in the urban challenge. In The DARPA Urban Challenge: Autonomous Vehicles in City Traffic, Cited by: §2.
  • [32] A. Y. Ng, S. J. Russell, et al. (2000) Algorithms for inverse reinforcement learning.. In Icml, Cited by: §4.
  • [33] B. Osiński, A. Jakubowski, P. Zięcina, P. Miłoś, C. Galias, S. Homoceanu, and H. Michalewski (2020) Simulation-based reinforcement learning for real-world autonomous driving. In Int. Conf. on Robotics and Automation (ICRA), Cited by: §2.
  • [34] D. A. Pomerleau (1989) ALVINN: an autonomous land vehicle in a neural network. In Advances in Neural Information Processing Systems, Cited by: §2.
  • [35] S. Ross, G. Gordon, and D. Bagnell (2011) A reduction of imitation learning and structured prediction to no-regret online learning. In

    Fourteenth Int. Conf. on Artificial Intelligence and Statistics

    Cited by: §2, §5.5.
  • [36] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347. Cited by: §4.
  • [37] S. Shalev-Shwartz, S. Shammah, and A. Shashua (2016) Safe, multi-agent, reinforcement learning for autonomous driving. ArXiv. Cited by: §2.
  • [38] W. Sun, A. Venkatraman, G. J. Gordon, B. Boots, and J. A. Bagnell Deeply AggreVaTeD: differentiable imitation learning for sequential prediction. In Int. Conf. on Machine Learning, Cited by: §4.
  • [39] S. Suo, S. Regalado, S. Casas, and R. Urtasun (2021) TrafficSim: learning to simulate realistic multi-agent behaviors. Cited by: §2, Results for Optimizing Auxiliary Costs, Appendix D: Differentiable Collision Loss.
  • [40] R. S. Sutton and A. G. Barto (2018) Reinforcement learning: an introduction. MIT press. Cited by: §2.
  • [41] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in Neural Information Processing Systems, Cited by: §5.3.
  • [42] A. Venkatraman, M. Hebert, and J. Bagnell (2015) Improving multi-step prediction of learned time series models. In AAAI, Cited by: §2, 3rd item.
  • [43] Y. You, X. Pan, Z. Wang, and C. Lu (2017) Virtual to real reinforcement learning for autonomous driving. Cited by: §2.
  • [44] W. Zaremba, I. Sutskever, and O. Vinyals (2014) Recurrent neural network regularization. Cited by: §5.3.
  • [45] W. Zeng, W. Luo, S. Suo, A. Sadat, B. Yang, S. Casas, and R. Urtasun (2019) End-to-end interpretable neural motion planner. Int. Conference on Computer Vision and Pattern Recognition (CVPR). Cited by: §2.
  • [46] B. D. Ziebart, A. Maas, J. A. Bagnell, and A. K. Dey (2008) Maximum entropy inverse reinforcement learning. In National Conference on Artificial Intelligence, Cited by: §2.
  • [47] J. Ziegler, M. Werling, and J. Schroder (2008) Navigating car-like robots in unstructured environments using an obstacle sensitive cost function. In Intelligent Vehicles Symposium, Cited by: §2.

Appendix A: Qualitative results

Figure 6 shows our method handling diverse, complex traffic situations well - it is identical to Figure 4 of the paper, but enlarged. For more qualitative results we refer to the supplementary video.

Figure 6: Qualitative results of our method controlling the SDV. Every row depicts one scene, images are 2s apart. The SDV is drawn in red, other agents in blue and crosswalks in yellow. Traffic lights colors are projected onto the affected lanes. Best view on a screen.

In-car testing

In this section we report additional results of deploying our trained policy to SDVs. Figure 7 shows our planner navigating through a multitude of challenging scenarios. For more results we refer to the supplementary video - where she show additional results in the form of videos, which also contain more information, namely different camera angles, the resulting scene understanding and planned trajectory of the SDV.

Figure 7: Qualitative results of our method controlling an SDV in the real. Every row depicts one scene, read left to right.

Appendix B: Additional Quantitative Results

Results for Optimizing Auxiliary Costs

In this section we investigate the ability to not only imitate expert behavior, but also to directly optimize metrics of interest. This mode blends pure imitation learning with reinforcement learning and allows tailoring certain aspects of the behavior, i.e. to optimize comfort or safety. To illustrate this, we consider optimising a mixed cost function that optimizes both L1 imitation loss and auxiliary losses:


Here is the magnitude of the acceleration at time and is a differentiable collision indicator, with denoting the set of other vehicles. This loss is based on [39], more details can be found in Appendix D. , allow to weigh the influence of these different losses.

The ability to succeed on this task requires optimally trading-off short- and long-term performance between pure imitation and other goals. Tables 2 and 3 summarize performance when including acceleration and collision loss, respectively. When including the acceleration term, we note our method is the only one to successfully trade-off performance between imitation and comfort cost, thanks to its capability to directly optimize over the full distribution: while I1K slightly increases with growing – which is expected – we can push comfort failures down to arbitrary levels. All other models fail for at least one of these metrics, and / or are insensitive to . When including the collision loss, results are closer together. We hypothesize this is due to , allowing one-step corrections and thus requiring less reasoning over the full time horizon.

Configuration Metrics Model I1K Comfort 0.01 0 BC-perturb 10,553 23,526 MS Prediction 1,428 20,980 Ours 2,512 10,168 0.03 0 BC-perturb 11,026 9,815 MS Prediction 2,205 15,546 Ours 2,147 4,670 0.1 0 BC-perturb 11,068 7,679 MS Prediction 2,316 28,780 Ours 2,737 3,307 Configuration Metrics Model Collisions Comfort 0 0 BC-perturb 772 600,778 MS Prediction 1,654 188,189 Ours 2,055 205,131 0 1000 BC-perturb 264 858,546 MS Prediction 612 388,632 Ours 765 258,114 0 10000 BC-perturb 568 943,144 MS Prediction 380 599,985 Ours 669 508,679
Table 2: Left: influence of the acceleration term weight . Only ours manages to find trade-offs and yields reasonable I1K and Comfort values. Right: influence of the collision term weight . For simplicity both experiments were run with , note that larger further improves performance of ours (compare Table 1 of the paper and Appendix B 2.2).
Configuration Metrics Model I1K Jerk Lat. Acc. 0.01 0 BC-perturb 10,553 47,579 921 MS Prediction 1,428 632,608 118 Ours 2,512 621,180 128 0.03 0 BC-perturb 11,026 19,033 931 MS Prediction 2,205 578,189 133 Ours 2,147 354,341 138 0.1 0 BC-perturb 11,068 18,599 2062 MS Prediction 2,316 691,540 133 Ours 2,737 131,589 128 Configuration Metrics Model Collisions Jerk Lat. Acc. 0 0 BC-perturb 772 1,914,230 8,052 MS Prediction 1,654 1,315,563 133 Ours 2,055 1,311,728 607 0 1000 BC-perturb 264 1,922,438 29,234 MS Prediction 612 1,455,627 1879 Ours 765 1,935,049 261 0 10000 BC-perturb 568 1,764,526 155,852 MS Prediction 380 1,580,696 3,131 Ours 669 1,498,776 1,158
Table 3: Repeating Table 2, but listing (longitudinal) jerk and lateral acceleration for comfort.

Ablation Studies

Figure 8 shows the impact of training dataset size on performance. We see the performance of the method improving with more data. Figure 9 demonstrates the effect of different on the performance of closed-loop training and thus demonstrates the importance of proper sampling.

Figure 8: Influence of training data on our planner’s performance: more data helps, but we seem to be reaching a plateau in performance. Figure 9: Importance of proper sampling: performance increases with growing K.

Discussion on Used Metrics

Metrics and their definition are naturally crucial for evaluating experiments - thus in the remainder of this section we list additional results using different thresholds and metrics. As reported in the paper, our default threshold for capturing deviations from the expert trajectory is 2m - which is based on average lane widths. Still, one can image wider lanes and less regulated traffic scenarios. Due to this Table 4 shows results of all examined methods using a threshold of 4m. Naturally, off-road failures increase, while other metrics improve due to our process of resetting after interventions. Still, one can observe that the reported results are relatively robust against such changes, i.e. the differences are small and relative trends still hold.

In the paper, for simplicity we measure comfort with one value, namely acceleration - which itself is based on differentiating speed, i.e. the travelled lateral and longitudinal distance divided by time. However, to reflect actual felt driving comfort, (longitudinal) jerk and lateral acceleration are better suited and more common in the industry. Therefore, Table 5 contains these additional values, and otherwise is identical to Table 1 of the original paper. These values yield more interesting insights into obtained driving behaviour, for example indicating that most discomfort is caused by longitudinal acceleration and jerk, while the lateral movement for all methods is much smoother. We further observe a similar theme as reported in the paper - namely that our method is the only one to be able to jointly optimize for performance and comfort, and that larger yield smoother driving. Still, we note that the number of jerk failures is higher than the number of acceleration failures - which leads to promising future experiments in the form of explicitly penalizing jerk instead, or in addition to, acceleration.

To complete this excursion on metrics, we briefly discuss rear collisions. Often, they can be attributed to mistakes of other traffic participants, or non-reactive simulation (consider choosing a slightly different velocity profile, resulting in a rear collision over time). Still, rear collisions can indicate severe misbehavior, such as not starting at green traffic lights, or sudden, unsafe braking maneuvers. See Figure 10 for an example. Due to this, we do include rear collisions in our aggregation metric I1K - however note that we report all metrics separately, as well, to allow a detailed, customized performance analysis.

Figure 10: Showing one example of a critical rear-collision: in this case, the planner controlling the SDV (BC-perturb without ego history) decides to abruptly stop after short turn, causing the trailing car to crash into it.
Configuration Collisions Imitation
Model SDV history Front Side Rear Off-road L2 Comfort I1K
BC 153 42 482 203 1,043 67 974 298 8.27 1.75 102K 1K 2,653 483
BC-perturb 22 4 57 8 414 142 27 5 3.06 0.06 204K 6K 512 127
BC-perturb 14 6 74 10 680 12 27 6 3.18 0.02 629K 23K 796 12
MS Prediction 22 3 55 3 125 12 60 13 2.07 0.14 598K 49K 265 17
Ours 17 7 51 5 102 12 40 6 1.83 0.04 638K 41K 210 9
Table 4: Repeating Table 1 of the paper, but with a threshold of 4m for off-road failures.
Configuration Collisions Imitation Comfort
Model SDV history Front Side Rear Off-road L2 Jerk Lat. Acc. I1K
BC 79 23 395 170 997 74 1618 459 1.57 0.27 958K 46K 71 23 3,091 601
BC-perturb 16 2 56 6 411 146 82 11 0.74 1,15 0.01 1,156K 672K 1,115 278 567 128
BC-perturb 14 4 73 7 678 11 77 6 0.77 0.01 1,862K 46 K 7,285 593 843 6
MS Prediction 18 6 55 4 125 14 141 31 0.46 0.02 1,600K 14K 211 21 341 39
Ours 15 7 46 5 101 13 97 6 0.42 0.00 1,750K 196K 507 321 260 9
Table 5: Repeating Table 1 of the paper, but listing more fine-grained comfort metrics, namely (longitudinal) jerk and lateral acceleration.

Appendix C: Policy architecture and state representation

In this section we disclose full details of the proposed network architecture, shown in Figure 11: Each high level object (such as an agent, lane, cross walk) is comprised of a certain number of points of feature dimension . All points are individually embedded into a 128-dimensional space. We then add a sinusoidal embedding to points of each object to introduce an understanding of order to the model, and feed this to our PointNet implementation. This consists of 3 PointNet layers, in the end producing a descriptor of size 128 for each object. We follow this up with one layer of scaled dot-product attention: for this, the feature descriptor corresponding to ego is used as key, while all feature descriptors are taken as keys / value. We add an additional type embedding to the keys, s.t. the model can attend the values using also the object types – inspired by [9]. Via a final MLP the output is projected to the desired shape, i.e. for a trajectory of length , in which each step is described via position and a yaw angle.

A full description of our model input state is included in Table 6

. We define the state as the whole set of static and dynamic elements the model receive as input. Each element is composed of a variable number of points, which can represent both time (e.g. for agents) and space (e.g. for lanes). The number of features per point depends on the element type. We pad all features to a fixed size

to ensure they can share the first fully connected layer. We include all elements up to the listed maximal number in a circular FOV of radius 35m around the SDV. Note that for performance and simplicity we only execute this query once, and then unroll within this world state.

Figure 11: Overview of our policy model. Each element in the state is independently forwarded to a set of PointNet layers. The resulting features go through a Multi-Head Attention layer which takes into account their relations to output the final action for the SDV. The bird’s-eye-view image on the left is only for illustrative purposes; we do not employ any rasterizations in our pipeline.
State element(s) Elements per state Points per element Point features description
SDV 1 4 SDV X, Y, yaw pose of the current time step and in recent past
Agents up to 30 4 other agents X, Y, yaw poses of the current time step and in recent past
Lanes mid up to 30 20 interpolated X,Y points of the mid lanes with optional traffic light signal
Lanes left up to 30 20 interpolated X,Y points of the left lanes boundaries
Lanes right up to 30 20 interpolated X,Y points of the right lanes boundaries
Crosswalks up to 20 up to 20 crosswalks polygons boundaries X,Y points
Table 6: Model input state description. The state is composed of multiple elements (e.g. agents and lanes) and each of them has multiple points according to its type. Each point is a multi-dimensional feature vector.

Appendix D: Differentiable Collision Loss

We use a similar differentiable collision loss as proposed in [39]: idea is approximating each vehicle via circles, and checking these for intersections. Assume loss calculation for timesteps to , we then define our collision loss as:


Here, describes a pair-wise collision term between our SDV and vehicle at timestep . Assume, and SDV (given by pose ) are represented via circles and , then is calculated as the maximum intersection of any two such circles:




in which denotes the distance between the respective circles’ centers and their radius. Thus, this term is 0 when the circles do not intersect, and otherwise grows linearly to a maximum value of 1.