The devkit of the nuScenes dataset.
Robust detection and tracking of objects is crucial for the deployment of autonomous vehicle technology. Image-based benchmark datasets have driven the development in computer vision tasks such as object detection, tracking and segmentation of agents in the environment. Most autonomous vehicles, however, carry a combination of cameras and range sensors such as lidar and radar. As machine learning based methods for detection and tracking become more prevalent, there is a need to train and evaluate such methods on datasets containing range sensor data along with images. In this work we present nuTonomy scenes (nuScenes), the first dataset to carry the full autonomous vehicle sensor suite: 6 cameras, 5 radars and 1 lidar, all with full 360 degree field of view. nuScenes comprises 1000 scenes, each 20s long and fully annotated with 3D bounding boxes for 23 classes and 8 attributes. It has 7x as many annotations and 100x as many images as the pioneering KITTI dataset. We also define a new metric for 3D detection which consolidates the multiple aspects of the detection task: classification, localization, size, orientation, velocity and attribute estimation. We provide careful dataset analysis as well as baseline performance for lidar and image based detection methods. Data, development kit, and more information are available at www.nuscenes.org.READ FULL TEXT VIEW PDF
Vehicle detection and tracking is a core ingredient for developing auton...
Detecting vehicles and representing their position and orientation in th...
Tracking in urban street scenes plays a central role in autonomous syste...
Research in machine learning, mobile robotics, and autonomous driving is...
As the roadside perception plays an increasingly significant role in the...
For autonomous vehicles to be able to operate successfully they need to ...
We present Argoverse – two datasets designed to support autonomous vehic...
The devkit of the nuScenes dataset.
Design multi-agent environments and simple reward functions such that social driving behavior emerges
Object Detection : Pedestrians, Cars, Cyclists
Autonomous driving technology has the potential to radically change the cityscape and save many human lives . A crucial part of safe navigation is the detection and tracking of agents in the environment surrounding the vehicle. To achieve this, a modern self-driving vehicle deploys several sensors along with sophisticated detection and tracking algorithms. Such algorithms rely increasingly on machine learning, which drives the need for benchmark datasets for training and evaluation. While there is a plethora of image datasets for this purpose, there is a lack of large-scale multi-modal datasets that cover . We release the nuScenes dataset to address this gap.
Multimodal datasets are of particular importance as no single type of sensor is sufficient and the sensor types are complementary. Cameras allow accurate measurements of edges, color and lighting enabling classification and localization on the image plane. However, 3D localization from images is challenging [7, 6, 39, 54, 46, 43, 50]. Lidar pointclouds, on the other hand, contain less semantic information but highly accurate localization in 3D . Furthermore the reflectance of lidar is an important feature [29, 35]. However, lidar data is sparse and the range is typically limited to 50-100m. Radar sensors achieve a range of 200-300m and measure the object velocity through the Doppler effect. However, the returns are even sparser than lidar and less precise in term of localization. While radar has been used for decades [1, 3], we are not aware of any autonomous driving datasets that provide radar data.
Since the three sensor types have different failure modes during difficult conditions, the joint treatment of sensor data is essential for agent detection and tracking. Literature  even suggests that multimodal sensor configurations are not just complementary, but provide redundancy in the face of sabotage, failures, adverse conditions and blind spots. And while there are several works that have proposed fusion methods based on cameras and lidar [34, 8, 42, 36, 55, 51], PointPillars  has recently shown that a lidar only method currently performs on par, or even stronger than fusion based methods. This suggests that more work is required to combine the multimodal measurements in a principled manner.
In order to train such fusion-based methods, quality data annotations are required. Most datasets provide 2D semantic annotations as either boxes or masks (class or instance) [5, 12, 23, 57, 38]. Only a few datasets annotate objects using 3D boxes [22, 30, 41], and they do not provide the full sensor suite. To the best of our knowledge, none of these 3D datasets provide annotation of object attributes, such as pedestrian pose or vehicle state.
Existing AV datasets and vehicles are highly specialized for particular operational design domains. As suggested in , more research is required on how to generalize to “complex, cluttered and unseen environments”. Therefore, we need to study how fusion-based methods will generalize to different countries, different lighting (daytime vs nighttime), driving directions, signage, road markings, vegetation, precipitation and previously unseen object types.
Contextual knowledge using semantic maps is also an important prior for scene understanding[56, 2, 24]. For example, one would expect to find cars on the road, but not on the sidewalk or inside buildings. While semantic categories can be inferred at inference time, manual labeling of the semantic labels of static background is typically more accurate. With the notable exception of , the majority of the AV datasets do not provide semantic maps.
|Dataset||Year||# scenes||Size (hr)||# rgb imgs||# pc lidar||# pc radar||# ann. frames||# 3D boxes||Night||Rain||Snow||Locations|
|Cityscapes ||2016||-||-||25k||0||0||25k||0||No||No||No||50 cities|
|Vistas ||2017||-||-||25k||0||0||25k||0||Yes||Yes||Yes||6 continents|
|BDD100k ||2017||100k||1k||100M||0||0||100k||0||Yes||Yes||Yes||NY, SF|
|M. obj. det. ||2017||-||-||7.5k||0||0||7.5k||0||Yes||No||No||Tokyo|
|M. sem. seg. ||2017||-||-||1.6k||0||0||1.6k||0||Yes||No||No||-|
|AS lidar ||2018||-||2||0||20k||0||20k||475k||-||-||-||-|
The last decade has seen the release of several driving datasets which have played a huge role in scene-understanding research for Autonomous Vehicles (AV). Most of these datasets have focused on 2D annotations (bounding-boxes, segmentation polygons) for RGB camera images. The CamVid  dataset released four HD videos with semantic segmentation annotations for 701 images. Cityscapes  released stereo video sequences captured from 50 different cities with high quality pixel-level annotations for 5k images. Mapillary Vistas , BDD100k  and Apolloscape  released even larger datasets containing segmentation masks for 25k, 100k, and 144k images respectively. Vistas and BDD100k also contain images captured during different weather and illumination settings. Other datasets [13, 18, 53, 17, 59, 15, 40] focus exclusively on pedestrian annotations on images.
The ease of capturing and annotating RGB images have made the release of these large-scale image-only datasets possible. On the other hand, multimodal datasets, which are typically comprised of images, range sensor (lidars, radars) data and GPS/IMU data, are expensive to collect and annotate due to the difficulties of integrating, synchronizing, and calibrating multiple sensors. KITTI  was the pioneering multimodal dataset providing dense pointclouds from a lidar sensor as well as front-facing stereo images and GPS/IMU data. It provides 200k 3D boxes over 22 scenes which helped advance the state-of-the-art in 3D object detection. The recent H3D dataset  includes 160 crowded scenes with a total of 1.1M 3D boxes annotated over 27k frames. The objects are annotated in the full view, as opposed to KITTI where an object is only annotated if it is present in the frontal view. They provide data from lidar, 3 cameras and GPS/IMU. However, the 3 cameras are all front facing which means that the vision sensors only provide coverage. Further, both KITTI and H3D do not provide any radar or nighttime data. The KAIST multispectral dataset  is a multimodal dataset that consists of RGB/thermal camera, RGB stereo, 3D lidar and GPS/IMU. It provides nighttime data, but the size of the dataset is limited and annotations are in 2D. Other notable multimodal datasets include  providing driving behavior labels,  providing place categorization labels and [4, 38] providing raw data without semantic labels.
An alternative to collecting real-world multimodal driving data is by generating synthetic data via simulation. CARLA , SYNTHIA , and Virtual KITTI  simulate virtual cities using game engines. Playing for Benchmarks  retrieves renderings and annotations from GTA without access to their source code. These have the advantage of simulating arbitrarily situations and avoiding the cost of human annotation. However, for the foreseeable future the generated images are not photo-realistic and can therefore not replace real-world datasets.
From the complexities of the multimodal 3D detection challenge, and the limitations of current AV datasets, a large-scale multimodal dataset with coverage across all vision and range sensors collected from diverse situations alongside map information would boost AV scene-understanding research further. nuScenes does just that, and it is the main contribution of this work.
nuScenes represents a large leap forward in terms of data volumes and complexities (Table 1), and is the first dataset to provide sensor coverage from the entire sensor suite. It is also the first AV dataset to include radar data and the first captured using an AV approved for public roads. It is further the first multimodal dataset that contains data from nighttime and rainy conditions, and that carries annotation of object attributes in addition to class and location. nuScenes thus enables research on 3D object detection, 3D tracking, behavior modeling and prediction and trajectory estimation.
Our second contribution is a new detection metric that summarizes all aspects of 3D object detection into a single metric. We also train 3D object detectors as a baseline reference and to help guide future avenues of research. These baselines include a novel approach of using multiple lidar sweeps to enhance object detection. All data, code, and information is made available at www.nuscenes.org.
Here we describe how we plan drives, setup our vehicles, select interesting scenes, annotate the dataset and protect the privacy of third parties.
We drive in Boston (Seaport and South Boston) and Singapore (One North, Holland Village and Queenstown), two cities that are known for their dense traffic and highly challenging driving situations. We emphasize the diversity across locations in terms of vegetation, buildings, vehicles, road markings and right versus left-hand traffic. From a large body of training data we manually select 84 logs with 15h of driving data (242km travelled at an average of 16km/h). Driving routes are carefully chosen to capture a diverse set of locations (urban, residential, nature and industrial), times (day and night) and weather conditions (sun, rain and clouds).
|6x Camera||RGB, capture frequency, CMOS sensor, resolution, auto exposure, JPEG compressed|
|1x Lidar||Spinning, beams, capture frequency, horizontal FOV, to vertical FOV, range, accuracy, up to points per second.|
|5x Radar||range, , FMCW, capture frequency, vel. accuracy|
Front and side cameras have a FOV and are offset by . The rear camera has a FOV of .
To achieve a high quality multi-sensor dataset careful calibration or sensor intrinsic and extrinsic parameters is required. We express extrinsic coordinates of each sensor to be relative to the ego frame, i.e. the midpoint of the rear vehicle axle, using tools like laser liner and calibration target boards.
In order to achieve good cross-modality data alignment between the lidar and the cameras, the exposure of a camera is triggered when the top lidar sweeps across the center of the camera’s FOV. The timestamp of the image is the exposure trigger time; and the timestamp of the lidar scan is the time when the full rotation of the current lidar frame is achieved. Given that the camera’s exposure time is nearly instantaneous, this method generally yields good data alignment 111The cameras run at while the lidar runs at . The camera exposures are spread as evenly as possible across the lidar scans, so not all lidar scans have a corresponding camera frame..
Most existing datasets provide the vehicle location based on GPS+IMU [22, 30]. As we operate in dense urban areas, we find that GPS signals are not always reliable. To accurately localize our vehicle, we create a detailed prior map of lidar points in an offline step. On the car we use a Monte Carlo Localization scheme from lidar and odometry information . This method is very robust and we achieve localization errors of . We also provide highly accurate semantic maps of the relevant areas with a resolution of . These human-annotated maps provide information on roads and sidewalks. We encourage the use of localization and semantic maps as strong priors for object detection, tracking and other tasks (e.g. pedestrians are typically found on sidewalks or crosswalks).
After collecting the raw sensor data, we manually select interesting scenes of duration each. Interesting scenes include scenes with high traffic density (e.g. intersections, construction sites), rare classes (e.g. ambulances, animals), potentially dangerous traffic situations (e.g. jaywalkers, incorrect behavior), maneuvers (e.g. lane change, turning, stopping) and situations that may be difficult for an AV. We also select some scenes to encourage diversity in terms of spatial coverage, different scene types, as well as different weather and lighting conditions. Expert annotators write textual descriptions or captions for each scene (e.g.: “Wait at intersection, peds on sidewalk, bicycle crossing, jaywalker, turn right, parked cars, rain”).
Having selected the scenes, we sample keyframes (image, lidar, radar) at . We annotate each of the 23 object classes in every keyframe in the form of cuboids modeled as x, y, z, width, length, height and yaw angle. We annotate objects continuously throughout each 20s scene if they are covered by at least one lidar or radar point. This provides temporal context so we can exploit multiple lidar sweep configuration in pointclouds and velocity/trajectory estimation. Using expert annotators and multiple validation steps, we achieve highly accurate annotations. All objects in the nuScenes dataset come with a semantic category, a 3D bounding box, and attributes (visibility, activity and pose) for each frame they occur in.
It is our priority to protect the privacy of third parties. As manual labeling of faces and license plates is prohibitively expensive for 1.4M images, we use state-of-the-art object detection techniques. Specifically for plate detection, we use Faster R-CNN  with ResNet-101 backbone  trained on Cityscapes 222https://github.com/bourdakos1/Custom-Object-Detection
. For face detection, we use333https://github.com/TropComplique/mtcnn-pytorch. We set the classification threshold to achieve an extremely high recall (similar to ). To increase the precision, we remove predictions that do not overlap with the reprojections of the known pedestrian and vehicle boxes in the image. Eventually we use the predicted boxes to blur faces and license plates in the images.
Contrary to most existing datasets [22, 41, 30], we store the annotations and metadata (e.g. localization, timestamps, calibration data) in a relational database which avoids redundancy and allow for efficient access. The nuScenes devkit, taxonomy and annotator instructions can be found in the devkit444https://github.com/nutonomy/nuscenes-devkit..
We analyze statistics of the annotations in nuScenes. Our dataset has 23 categories including different vehicles, types of pedestrians, mobility devices and other objects as seen in Figure 4. Statistics on geometry and frequencies of different classes are shown in Figure 5. Per keyframe there are 7 pedestrians and 20 vehicles on average. Moreover, 40k keyframes were taken from four different scene locations (Boston: 55%, SG-OneNorth: 21.5%, SG-Queenstown: 13.5%, SG-HollandVillage: 10%) with various weather and lighting conditions (rain: 19.4%, night: 11.6%). Figure 7 shows the map locations with spatial coverage across all scenes where the most coverage comes from intersections.
Figure 6 shows that car annotations are seen at varying distances and as far as 80m from the ego-vehicle. Box orientation is also varying, with the most number in vertical and horizontal angles for cars as expected due to parked cars and cars in the same lane.
Lidar and radar points statistics inside each box annotation are shown in Figure 8. Our annotations have up to 100 lidar points even at a radial distance of 80m and at most 12k lidar points at 3m. At the same time they contain up to 40 radar returns at 10m and 10 at 50m. The radar range far exceeds the lidar range at up to 200m.
To analyze the quality of our localization data, we compute the merged point cloud of an entire scene by registering approximately 800 point clouds in global coordinates. We remove points corresponding to the ego vehicle and assign to each point the mean color value of the closest camera pixel that the point is reprojected to. Scene reconstruction can be seen in Figure 9 which demonstrate accurate synchronization and localization.
This section outlines the metrics for the nuScenes detection task. In the future we may add other tasks and metrics.
The nuScenes detection task requires detecting 10 object classes with full 3D bounding boxes, attributes, and velocities. The 10 classes are a subset of all 23 classes annotated in nuScenes (details in the devkit).
We design metrics for each of these aspects and a schema for consolidation into a scalar score indicating method performance, the nuScenes detection score (NDS):
Here mAP is mean Average Precision (2), and the set of the five mean True Positive metrics (3). Half of NDS is thus based on the detection performance while the other half quantifies the quality of the detections in terms of box location, size, orientation, attributes, and velocity.
Since the nuScenes dataset contains continuous 20s scenes, one could use data from the beginning of the scene to time to determine the object locations at . Indeed, this is what a production system would, and should do. However, for the purpose of this benchmark, and in order to separate out the performance of a tracker from a detector, we define the detection task to only operate on sensor data between seconds. Subsequent tracking tasks will allow data between .
We use the Average Precision (AP) metric [22, 19], but define a match by thresholding the 2D center distance on the ground plane instead of intersection over union. This is done in order to decouple detection from object size and orientation but also because small objects, like pedestrians, have such small footprints that a small translation error results in a zero intersection over union, which makes it hard to compare the performance of vision-only methods which tend to have large location errors .
We then calculate AP as the normalized area under the precision recall curve for recall and precision over . Operating points where recall or precision is less than 10% are removed for two reasons: First, the measurement can be noisy in these regions, in particular for low recalls. Second, such extreme operating points would be highly unsuitable for deployment on public roads. If no operating point in this region is achieved, the AP for that class is set to zero. We finally average over matching thresholds of meters and the set of classes :
In addition to AP, we measure a set of true positive metrics (TP metrics) for each prediction that was matched with a ground truth box. All TP metrics are calculated using m center distance during matching, and they are all designed to be positive scalars. Matching and scoring happen independently per class and each metric is the average of the cumulative mean at each achieved recall levels above . If recall is not achieved for a particular class, all TP errors for that class is set to . The following TP errors are defined:
Average Translation Error (ATE) is the Euclidean center distance in 2D (units in ). Average Scale Error (ASE) is the 3D IOU after aligning orientation and translation (). Average Orientation Error (AOE) is the smallest yaw angle difference between prediction and ground-truth (). All angles are measured on a full period except for barriers where they are measured on a period. Average Velocity Error (AVE) is the absolute velocity error as the L2 norm of the velocity differences in 2D (). Average Attribute Error (AAE) is defined as 1 minus attribute classification accuracy (). Finally, the mTP is calculated as:
Here we omit a few measurements that are not well defined for each class: AVE for cones and barriers since they are stationary; AOE of cones since they do not have a well defined orientation; and AAE for cones and barriers since there are no attributes defined on these classes (Figure 4).
Note that since mAVE, mAOE and mATE can be larger than , we bound each metric between and when calculating NDS (1).
In this section we present object detection experiments on the nuScenes dataset to serve as reference baselines and suggest avenues for future research.
To demonstrate the performance of a leading algorithm on nuScenes, we train a lidar only 3D object detector, PointPillars . We take advantage of temporal data available in nuScenes by accumulating lidar sweeps for a richer pointcloud input to the point pillar encoder. A single PointPillars network was trained to predict 3D boxes for all classes. The previously published PointPillars network was modified to also learn velocities as an additional regression target for each 3D box. In these experiments we set the box attributes to the most common attribute for each class in the train data, and future work will explore jointly learning attributes with the other outputs.
We investigate PointPillars performance by varying two important hyperparameters: the number of lidar sweeps and the type of pre-training.
According to our evaluation protocol (Section 4), one is only allowed to use of previous data to make a detection decision. This corresponds to 10 previous lidar sweeps since the lidar is sampled at 20 Hz. Accumulation is implemented by moving all point clouds to the coordinate system of the keyframe and appending a scalar time-stamp to each point indicating the time delta in seconds from the keyframe. The PointPillars encoder includes the time delta as an extra decoration for the lidar points. Aside from the advantage of a richer point cloud input for detection, this also provides inherent temporal information in a single input which helps the network in localization and enables velocity prediction. We experiment with using , , and lidar sweeps. 555, , and lidar sweeps is, in practice, only , , and sweeps on average: (1) the first keyframe of each scene has no previous sweeps and (2) limiting sweeps to the past s may discard the th sweep.
We examine whether features obtained from other domains or datasets generalize to nuScenes. No pretraining means weights are initialized randomly using a uniform distribution as in. ImageNet  pretraining 
uses a backbone that was first trained to accurately classify images. KITTI pretraining uses a backbone that was trained on the lidar pointclouds to predict 3D boxes.
As shown in Table 3, increasing the number of lidar sweeps leads to better detection performance although the performance saturates with increasing number of sweeps. The increased point density provided by extra sweeps leads to higher mean average precision of %, %, and % for , , and sweeps respectively. Additionally, the temporal information provides context for learning velocities with an AVE of m/s, m/s, and m/s respectively.
Interestingly, while the KITTI pretrained network did converge faster, the final performance of the network only marginally varied between different pretrainings. Since the ImageNet pretraining was the best, we will use that to examine performance in more detail and all analysis refers to this network. The per class performance on the nuScenes detection metrics is shown in Table 4 and Figure 10. The network performed best overall on cars and pedestrians which are the two most common categories. The worst performing categories were bicycles and construction vehicles, two of the rarest categories that also present additional challenges. However, the network achieved a bicycle mAP of approximately when filtering for only predictions and ground truth on the semantic map. Bicycles that are not on the semantic map are especially difficult to detect because they are usually parked and often occluded or only visible from the front or back. Construction vehicles pose a unique challenge due to their high variation in size and shape. While the translational error is similar for cars and pedestrians, the orientation errors for pedestrians (21 deg) is higher than that of cars (11 deg). This smaller orientation error for cars is expected since cars have a greater distinction between their front and side profile relative to pedestrians. The vehicle velocity estimates are promising (e.g. m/s AVE for the car class) considering the typical speed of a vehicle in the city would be to m/s.
To examine image-only 3D object detection, we adapt and train a leading algorithm, Orthographic Feature Transform (OFT)  on nuScenes. A single OFT network was used for all classes. We modified the original OFT implementation to use a SSD detection head and confirmed that this architecture matched published results on KITTI. The network takes in a single camera image and the full predictions were obtained by using non-maximal suppression (NMS) to combine together the independent predictions from all 6 cameras. In this experiment we set the box velocity to zero and attributes to the most common attribute for each class in the train data.
As shown in Figure 11, the OFT baseline achieved promising performance on car category and future work will be required to adapt OFT to the complexities of nuScenes to achieve higher performance on all categories. Comparing OFT and PointPillars in Figure 11 shows that PointPillars achieved a significantly higher average precision and max recall. However, OFT and PointPillars achieved a similar scale error over all recalls, demonstrating that object scale is equally well inferred from images or lidar. As expected, PointPillars has lower localization error than OFT since lidar points provide range information while OFT has to learn to associate range information with image only features. When averaged over all recalls, PointPillars and OFT had similar orientation error, but as shown in Figure 11, PointPillars achieved lower orientation errors when compared over the same recall. This shows that it is either important to compare the true positives over the same recall or to consider true positive metrics and average precision in one metric as in NDS (1).
The two baselines demonstrate that while lidar only or image only detectors are both able to achieve promising detection results on cars, lidar only networks currently provide superior performance. Each sensor modality provides complimentary features for training 3D object detection and we encourage research on a fusion network that uses all sensor data (image, lidar, radar) as well as exploits prior information from semantic maps to achieve the best performance.
In this paper we present the nuScenes dataset, metrics, and baseline results. This is the only dataset collected from an autonomous vehicle on public roads and the only dataset to contain the full sensor suite (lidar, images, and radar). nuScenes has the largest collection of 3D box annotations of any public dataset. To spur research on 3D object detection for autonomous vehicles, we introduce a new detection metric that balances all aspects of detection performance. We demonstrate novel adaptations of leading lidar only and image only 3D object detectors on nuScenes. We hope this dataset will help accelerate research and development of autonomous vehicle technology.
The nuScenes dataset was annotated by Scale.ai and we thank Alexandr Wang and Dave Morse for their support. We also thank Sun Li and Karen Ngo at nuTonomy for data inspection and quality control, and Bassam Helou for OFT baseline results. We thank Thomas Roddick for useful discussions about OFT.
Imagenet classification with deep convolutional neural networks.In NIPS, 2012.
3d bounding box estimation using deep learning and geometry.In CVPR, 2017.