Self-Driving Vehicles (SDVs) have been an active research area for decades and make regular headlines since the DARPA Grand Challenges in 2005-2007. Many companies are attempting to develop the first level 4+ SDVs , some for more than a decade. We have seen small-scale SDV testing, but despite the numerous unrealised predictions that ubiquitous SDVs are ‘only 5 years away’ [39, 37, 15], production-level deployment still seems a distant future. Given the limited rate of progress, some questions naturally arise: (a) why did the research community underestimate the problem’s difficulty? (b) are there fundamental limitations in today’s approaches to SDV development?
After the DARPA challenges, most of the industry decomposed the SDV technology stack into HD mapping, localisation, perception, prediction, and planning 
. Following breakthroughs enabled by ImageNet, the perception and prediction parts started to become primarily machine-learned. However, behaviour planning and simulation are still largely rule-based: performance improves using humans writing increasingly detailed rules that dictate how the SDV should drive. There has been the belief that, given very accurate perception, rule-based approaches to planning may suffice for human-level performance. We refer to this approach as Autonomy 1.0.
Production-level performance requires significant scale
to discover and appropriately handle the ’long tail’ of rare events. We argue that Autonomy 1.0 will not achieve this due to three scalability bottlenecks: (i) rule-based planners and simulators do not effectively model the complexity and diversity of driving behaviours and need to be re-tuned for different geographical regions, they fundamentally have not benefited from the breakthroughs in deep-learning (Fig.2) (ii) due to the limited efficacy of rule-based simulators, evaluation is done mainly via road-testing, which lengthens the development cycle, and (iii) road testing operations are expensive and scale poorly with SDV performance.
Our proposed solution to these scale bottlenecks is to turn the entire SDV stack into an ML system that can be trained and validated offline using a large dataset of diverse, real-world data of human driving. We call this Autonomy 2.0. Autonomy 2.0 is a data-first paradigm: ML turns all parts of the stack (including planning and simulation) into data problems, and performance improves with better data sets rather than by designing new driving rules (Fig. 1). This unlocks the scalability required for mastering the long tail of rare events and scaling to new geographies. All that is needed is to collect large enough datasets and retrain the system.
The key challenges to Autonomy 2.0 are (i) formulating the stack as an end-to-end differentiable network, (ii) validating it offline in a closed-loop with a machine-learned simulator, and (iii) collecting the large amounts of human driving data required to train them.
In the next section, we explain and analyse the bottlenecks in Autonomy 1.0 as typically adopted in the industry, then explain the benefits of Autonomy 2.0, and end by highlighting the open questions that must be answered for SDVs to truly be 5 years away.
2 Autonomy 1.0
In this section we survey the state-of-the-art Autonomy 1.0 technology used in a typical self-driving program. The typical stack (Fig. 3) is composed of these components: perception, prediction and planning, which subsequently answer the questions of what is around the car? what is likely to happen next? and what should the car do?. An essential part of the development cycle of the stack is testing, which answers ”what is the performance of the system?”
. The typical solution consists of neural networks processing raw camera, LIDAR [33, 45] and radar data . Observations from different sensors are implicitly or explicitly fused [23, 10], producing the 3D positions and other attributes (e.g. size, orientation) of the traffic participants around the SDV (e.g. vehicles, pedestrians and cyclists). These systems are trained on hand-labelled sensor data . They also typically rely on HD maps [6, 22] containing information about the static environment and its traffic rules.
Prediction. Today, most rule-based methods that extrapolate the observed motion of traffic participants  are being replaced with ML motion forecasting, where neural network are trained to predict a future trajectory from a few seconds of observations [25, 36, 7, 31]
. These predictions are used to estimate future space occupancy and the likelihood of collisions.
. Despite a lot of progress in the field of reinforcement learning[29, 40], Autonomy 1.0 still heavily relies on rule-based trajectory optimisation techniques. These techniques combine global search and local optimisation to find a trajectory that minimises an expert-designed cost function [46, 20, 1, 30], which is then fed to a controller converting it into speed and steering signals. The cost function consists of various comfort and safety terms hand-engineered by domain experts to achieve a desirable behaviour, such as average acceleration, distance to other vehicles, adherence to traffic rules, etc.
Testing estimates “what is the performance of the system?” Typically, the system performance is road-tested by deploying it under the supervision of a safety driver, who is ready to disengage the autonomy system at any time, i.e. take control in the case of incorrect behaviour. Such disengagements are an important part of the development cycle (Fig. 1). They provide a valuable source of information on which part of the self-driving stack to improve. For example, errors caused by planning lead to modifying the expert cost function by adding new weights.
3 Scalability bottlenecks of Autonomy 1.0
The Autonomy 1.0 stack described in the previous section can perform well under regular conditions [4, 30]. Attaining L4-L5 production-level performance, however, requires scaling the paradigm to cover the ’long tail’ of rare events such as road closures, road accidents, other agents breaking the rules-of-the-road etc. At the same time, the solution needs to scale to multiple cities with diverse agent behaviours. In this section we explore the scalability bottlenecks that make it challenging for Autonomy 1.0 to handle this long tail of events.
3.1 Trying to capture complex and diverse behaviours with rule-based systems
Rule-based systems seek to optimise a cost function designed by domain experts who write new terms and balances their weights. New scenarios require new terms and weights, and so development requires large engineering teams and is a major bottleneck. Operating SDVs in a new location can mean person-years of re-tuning. Designing interfaces between the Autonomy 1.0 components also come with a high engineering and performance cost. Even though they provide human-interpretable abstraction between different tasks, they limit the ability to express the nuance of the encountered driving situations.
3.2 Reliance on road-testing and low-realism offline simulation
On-road testing provides the highest possible realism, and safety driver disengagements can give actionable feedback to improve the system. However, the time between cost-engineering and road testing makes for a long development cycle: days or weeks. Moreover, it is non-reproducible: one cannot repeat scenarios exactly, preventing direct comparison between versions.
One strategy for offline evaluation is to replay sensor logs to evaluate the performance of a new system version. However, if the SDV actions during replay are different to those in the log, other traffic participants do not react accordingly. E.g. an SDV, driving slower than in the original log can cause a false positive rear-end collision. This hinders fair comparison between software versions.
A degree of reactivity can be achieved by adding rule-based behaviours to the other traffic participants or employing a synthetic simulator . However, rule-based simulation has the same limitations as rule-based planning: hand encoding diverse and realistic behaviours for pedestrians, cyclists, and other vehicles scales poorly.
3.3 Limited fleet deployment scale
As requirements on system performance, scenario rarity, and the statistical significance of evaluation increase, so does the amount of required road-testing. Increasing fleet size can speed up the development cycle. Still, Autonomy 1.0 SDVs rely on expensive sensors and compute to satisfy the stringent requirements on perception accuracy imposed by rule-based planning. This makes SDVs expensive, and so the available budget puts an upper bound on the development speed and statistical performance guarantees.
4 Autonomy 2.0
In this section, we propose Autonomy 2.0, an ML-first approach to self-driving focused on achieving high scalability. It is based on three key principles: i) closed-loop simulation, which is learned from the collected real-world driving logs; ii) decomposing SDV into an end-to-end differentiable neural network; and iii) the data needed to train the planner and simulator is collected at a large-scale using commodity sensors.
4.1 Data-driven closed-loop reactive simulations
Most of the evaluation in Autonomy 2.0 is done offline in simulation. This is in contrast to the Autonomy 1.0 reliance on road testing due to the limitations of rule-based simulation. It does not mean ceasing road-testing altogether, but its purpose is less prominent in the development cycle and is mainly used to verify the simulator performance. For simulation to be an effective replacement of road testing for development, it needs three properties:
an appropriate simulation state representation for the task;
the ability to synthesise diverse and realistic driving scenarios with high fidelity and reactivity;
when applied to new scenarios and geographies, performance increases with the amount of data.
Simulation outcomes must be very realistic since any difference between simulation and reality would result in inaccurate performance estimates, but it does not need to be photo-realistic  and instead focus only on the representation of the planner. We reason that that in order to achieve a high level of realism, the simulation itself must be learned directly from the real world. Recently,  showed how realistic and reactive simulations can be constructed from previously collected real-world logs using birds-eye-view representations. As shown in Fig. 4, this simulation can be then deployed to turn any log into a reactive simulator for testing autonomous driving policies. The appealing property of such an approach is that simulation fidelity scales with data. By collecting more logs, the coverage of simulation proportionally grows as each log can be turned into a simulation case. However, further work needs to be done to address behavioural multi-modality since we cannot know the intentions of other traffic participants ahead of time, and the same action from an SDV can lead to many different outcomes. There are initial works [34, 7, 42], but more work is required (Sec. 5).
4.2 A fully differentiable stack trained from human demonstrations
Autonomy 1.0 has rule-based components that are hand-engineered, as are the human-interpretable interfaces between perception, prediction, planning, and simulation (Sec. 2). Unlike Autonomy 1.0, the Autonomy 2.0 stack is fully trainable from human demonstrations, and so its complexity scales in proportion to the amount of training data. In order to train such a system, several conditions need to be met:
each component, including planning, needs to be trainable and end-to-end differentiable;
trainable using human demonstrations;
performance scales with the amount of training data.
End-to-end differentiability is essential as it allows to back-propagate gradients across the stack. Perception and prediction are already differentiable in Autonomy 1.0, but planning is not. Recently,  have shown that a high-capacity driving policy can be represented by a neural network and trained from a collection of human demonstrations. The appealing property is this avoids the need for hand-engineering of cost function since the training signal is implicitly given by the demonstrations.
Training the system requires proper formulation based on the nature of sequential decision-making tasks. Imitation learning, while simple, suffers from covariate shift caused by the accumulation of errors. To implicitly avoid the problem of covariate shift and causal confusion  we argue that the system needs to be trained in a closed loop with the reactive data-driven simulation outlined Sec. 4.1. We can generate new driving episodes under a current policy without further data collection and then update the policy based on the imitation loss .
Furthermore, inspired by  we propose that the networks for each component (i.e. driving policy, perception, and simulation) can be coupled and trained together as outlined in Fig. 5. The policy generates the action that the SDV should take at a given state , and a (learnable) simulator (Sec. 4.1) transitions the state conditioned on the action and the current state. One way to enable co-training is for the entire stack to be fully differentiable. The ML planner is then plugged into a differentiable architecture and is trained jointly with perception. Importantly, it allows the quality of perception output to be determined in a way that is most suitable for training driving policies, rather than setting a more arbitrary perception requirement of centimetre-level accurate 3D bounding boxes. The interpretability property, though, is still maintained as the learned perception representation can be back-projected into interpretable space for examination. This approach to formulating the autonomy problem has many similarities to MuZero .
4.3 Large-scale low-cost data collection
The systems discussed so far use human demonstrations as training data, i.e. sensor data with the corresponding trajectory chosen by a human driver serving as supervision. To unlock production-level performance, these data need to have:
enough size and diversity to include the long tail of rare events (Sec. 3);
sufficient sensor fidelity, i.e. the sensors used to collect the data need to be accurate enough to train the planner and simulator effectively and
cheap enough to collect at this scale and fidelity.
While recently we saw the first public datasets with human demonstrations, these are limited to a few thousand miles of data . Observing the long tail will likely require collecting hundreds of millions of miles of data because most driving is uneventful, e.g. there are roughly 5 crashes per million miles driven in the US .
Collecting these volumes constrains us to rely on crowd-sourcing, as using cars and drivers dedicated solely for data collection is not economically viable at this scale. It also constrains us to use only low-cost commodity sensors like cameras. These, however, produce lower perception accuracy than high fidelity sensors (like HD LIDAR) used in Autonomy 1.0 to power rule-based planning. Hence, in choosing sensors for data collection, we face a trade-off between the scalability (how much data we collect) and the fidelity (the resulting perception accuracy) we can expect from them.
|Sensor type||Perception accuracy on KITTI||Scalability|
|(car mAP, moderate)|
|Monocular cameras||22 ||High|
|Stereo cameras||52 ||Medium|
|Stereo + sparse LIDAR||64 ||Medium|
|HD LIDAR||75 ||Low|
Which sensors should we use? Recent progress in perception algorithms showed a reduced gap in perception accuracy between HD and commodity sensors like cameras [27, 9] and sparse LIDAR  on the KITTI benchmark  (Table 1). Moreover, the actual perception accuracy requirements for training Autonomy 2.0 systems  might be relaxed by training perception and planning jointly (Sec. 4.2). This is corroborated by recent results suggesting that training ML planners on large amounts of data with lower accuracy perception might be preferable to smaller amounts with higher accuracy . As the sensor prices stand today, this points to the camera only (and optionally sparse LIDAR) as the most promising solution to strike a balance between cost and fidelity, but this remains an open problem.
5 Open questions
In this work, we outlined the Autonomy 2.0 paradigm, which is designed to solve self-driving using an ML-first approach. By removing the human-in-the-loop, this paradigm is significantly more scalable, which we argue is the main limitation for achieving high SDV performance. While promising, this direction leaves many open questions that are yet to be answered, such as:
What is the appropriate state representation for simulation and planning? How much should it be learned vs interpretable?
What are the limits of what can be trained offline from human demonstrations vs need real-time reasoning using search?
How much do we need to simulate? How should we measure the performance of the offline simulation itself?
How much data do we need to train high-performing planning and simulation components? What sensors should we use for large-scale data collection?
Answering these questions is critical to self-driving and other real-world robotic problems and can stimulate the research community to unlock high-performing SDVs sooner rather than later.
We would like to thank following contributors who work on Autonomy 2.0 at Lyft Level 5: Moritz Niendorf, Qiangui Huang, Oliver Scheel, Matt Vitelli, Błażej Osiński, Ana Ferreira, Luca Bergamini, Ray Gao, Yawei Ye, Sammy Sidhu, Lukas Platinsky, Christian Pereone, Jasper Friedrichs, Maciej Wołczyk, Long Chen, and Sacha Arnoud.
-  (2013) Intention-aware motion planning. In Algorithmic Foundations of Robotics X, E. Frazzoli, T. Lozano-Perez, N. Roy, and D. Rus (Eds.), Cited by: §2.
-  (2019) ChauffeurNet: learning to drive by imitating the best and synthesizing the worst. Cited by: §4.2.
-  (2021) SimNet: learning reactive self-driving simulations from real-world observations. External Links: Cited by: Figure 4, §4.1.
-  (2009) The darpa urban challenge. Cited by: §3.
-  Bureau of transportation statistics motor vehicle safety data. Note: https://www.bts.gov/content/motor-vehicle-safety-data Cited by: §4.3.
-  (2019) NuScenes: a multimodal dataset for autonomous driving. arXiv preprint arXiv:1903.11027. Cited by: §2.
-  Cited by: §2, §4.1.
-  (2021) What data do we need for training an av motion planner?. Cited by: §4.3.
-  (2020-06) DSGN: deep stereo geometry network for 3d object detection. In , Cited by: §4.3, Table 1.
-  (2014) A multi-sensor fusion system for moving object detection and tracking in urban driving environments. In Int. Conf. on Robotics and Automation (ICRA), Cited by: §2.
-  (2019) Causal confusion in imitation learning. CoRR. Cited by: §4.2.
-  (2009) Imagenet: a large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp. 248–255. Cited by: §1.
-  (2017) CARLA: An open urban driving simulator. In 1st Annual Conference on Robot Learning, Cited by: §3.2.
-  (2017) CARLA: An open urban driving simulator. In Proceedings of the 1st Annual Conference on Robot Learning, pp. 1–16. Cited by: §4.1.
-  (2016) Ford autonomy team predicts self driving vehicles are five years away. Note: https://www.zdnet.com/article/ford-self-driving-cars-are-five-years-away-from-changing-the-world/ Cited by: §1.
-  (2013) Vision meets robotics: the kitti dataset. International Journal of Robotics Research (IJRR). Cited by: §4.3, Table 1.
-  (1998) Social force model for pedestrian dynamics. Physical Review E. Cited by: §2.
-  (2016) Generative adversarial imitation learning. CoRR. Cited by: §4.2.
-  (2020) One thousand and one hours: self-driving motion prediction dataset. Cited by: §4.3.
-  (2000) RRT-connect: an efficient approach to single-query path planning. In Int. Conf. on Robotics and Automation, Cited by: §2.
SSD-6d: making rgb-based 3d detection and 6d pose estimation great again. Cited by: §2.
-  (2019) Lyft level 5 perception dataset 2020. Note: https://level5.lyft.com/dataset/ Cited by: §2.
-  (2018) Sensors and sensor fusion in autonomous vehicles. In 2018 26th Telecommunications Forum (TELFOR), Cited by: §2.
-  (2019-06) PointPillars: fast encoders for object detection from point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: Table 1.
-  (2017) DESIRE: distant future prediction in dynamic scenes with interacting agents. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2165–2174. Cited by: §2.
-  Levels of autonomy. Note: https://www.sae.org/binaries/content/assets/cm/content/blog/sae-j3016-visual-chart_5.3.21.pdf Cited by: §1.
-  (2021-06) M3DSSD: monocular 3d single stage object detector. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6145–6154. Cited by: §4.3, Table 1.
-  (2019) Deep learning based 3d object detection for automotive radar and camera. In 2019 16th European Radar Conference (EuRAD), Cited by: §2.
-  (2013-12) Playing atari with deep reinforcement learning. pp. . Cited by: §2.
-  (2009) Junior: the stanford entry in the urban challenge. In The DARPA Urban Challenge: Autonomous Vehicles in City Traffic, Cited by: §2, §3.
-  Cited by: §2.
-  (2020) Learning to evaluate perception models using planner-centric metrics. Cited by: §4.3.
-  (2016) PointNet: deep learning on point sets for 3d classification and segmentation. Cited by: §2.
-  Cited by: §4.1.
A reduction of imitation learning and structured prediction to no-regret online learning.
Proceedings of the fourteenth international conference on artificial intelligence and statistics, pp. 627–635. Cited by: §4.2.
-  (2019) SoPhie: an attentive gan for predicting paths compliant to social and physical constraints. Cited by: §2.
-  (2015) Sam Altman predicts self driving vehicles are four years away. Note: https://techcrunch.com/2015/10/06/elon-musk-sam-altman-say-self-driving-cars-are-going-to-be-on-the-road-in-just-a-few-years/ Cited by: §1.
-  (2020) Mastering atari, go, chess and shogi by planning with a learned model. Nature. Cited by: §4.2.
-  (2012) Sergey Brin predicts self driving vehicles are five years away. Note: https://www.youtube.com/watch?v=BEjxQd219is Cited by: §1.
-  (2016-01) Mastering the game of go with deep neural networks and tree search. Nature 529, pp. 484–489. External Links: Cited by: §2.
-  (2018) A general reinforcement learning algorithm that masters chess, shogi, and go through self-play. Science. Cited by: Figure 5, §4.2.
-  (2021) TrafficSim: learning to simulate realistic multi-agent behaviors. Cited by: §4.1.
-  (2005) Probabilistic robotics (intelligent robotics and autonomous agents). The MIT Press. External Links: Cited by: §1.
-  (2019-06) Pseudo-lidar from visual depth estimation: bridging the gap in 3d object detection for autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §4.3, Table 1.
-  (2018) VoxelNet: end-to-end learning for point cloud based 3d object detection. In Int. Conf. on Computer Vision and Pattern Recognition, Cited by: §2.
-  (2008) Navigating car-like robots in unstructured environments using an obstacle sensitive cost function. In Intelligent Vehicles Symposium, Cited by: §2.