Automated vehicles (AVs) have the potential to radically impact our society[societal_impact]
by improving safety, congestion and energy consumption. Reliable AV operations require reliable sensing and perception of the surrounding environment, e.g., to understand the presence and future motions of road users and the governing traffic rules. Robust perception is the basis of safe/proper trajectory planning and control. To achieve reliable perception, deep neural networks are frequently used, which require large sets of data. In recent years, many open datasets were created and shared, first from universities and more recently, from companies. Not all datasets include all three of the common AV sensor types and the tags/labels vary considerably among those datasets.
All three types of commonly used AV sensors (cameras, lidars and radars) have strength and weakness. In addition, even within the lidar family, the mechanical scanning Velodyne lidar we used has 32 beams and covers much wider horizontal and vertical field of view, while the Ibeo lidar only has 4 beams and a limited field of view. Because both lidar sensors are widely used in automotive applications but commonly for different purposes (e.g., Level 4 vs. Level 2), compare/contrast their performance is of interest [dingzhao_multiple_lidar].
In the past, vehicle controls were largely designed based on model-based algorithms through mathematically rigorous processes [Peng_preview_control, dong_control_1, dong_control_2]. Recent advances, however, have indicated the potential of data-driven approaches [dingzhao_driving_behavior, xianan_paper], which requires a large amount of training (and validation) data. In the past, quite a few open datasets were published, which help to elevate the state of the art of the data-driven approaches tremendously. Nevertheless, many of them seem to only capture naturalistic driving, i.e., not deliberately focusing on challenging scenarios. Based on our previous work on accelerated evaluation, we believe the challenging driving behaviors should be emphasized more. In other words, collecting data naturalistically is time-consuming and costly. A deliberate, choreography-designed set of scenarios conducted inside a safe and closed test facility can provide a different and useful set of data that is complementary to naturalistic data.
The overall guiding principle of our data collection effort is completeness, including to deploy a wide set of sensors, cover a wide array of weather and lighting conditions, diverse lane marking on diverse road topology, and situations involving challenge interactions from other road users (vehicles, bicycles, pedestrians). In addition to collect naturalistic data on open roads, we also capture designed scenarios inside the Mcity test facilities, with the focus on intersections.
[leftmargin = 9pt]
Controlled diversity: We repeatedly collect naturalistic driving data on fixed routes with deliberate variation in lighting, weather, traffic, and human driver characteristics.
Designed choreography: We designed representative urban driving scenarios with the host vehicle interacting with other vehicle/pedestrian/cyclist inside the Mcity test facility. The test case parameters were selected to cover both normal (courteous, law-abiding) and abnormal (aggressive, against traffic law) conditions.
Completeness: As shown in Table I, we use a comprehensive set of sensors.
|Mcity dataset||2019||AA, Mcity||50/3k||17.5k|
|Notes: (1) All listed datasets have front facing camera(s). We define Lighting Diversity as whether both daytime and night data were collected, and Weather Diversity if both clear and rainy/snowy/foggy weathers are involved. Driving Behavior studies the driver’s facial/posture/steering commands. Designed Choreography refers to designed vehicle-vehicle/pedestrian/bicyclist interactions. (2) In the table, “” denotes Yes, “” denotes No, and “-” indicates no information is provided. (3) AA: Ann Arbor, SG: Singapore, CA: California, NY: New York, SF: San Francisco, AZ: Arizona, WA: Washington, AS: ApolloScape, H3D: Honda Research Institute 3D Dataset; HDD: Honda Research Institute Driving Dataset.|
The remainder of this paper is organized as follows: Section II describes related literature. Section III introduces our vehicle platform and sensors setup. Section IV outlines our effort in data calibration, tagging, and labeling. Section V example data analytics. Finally, Section VI points out general conclusions and ongoing/future efforts of our work.
Ii Related Work
Ii-a Image Datasets
Many image datasets have been openly released for AV development. Examples including Imagenet[imagenet] and COCO[coco] provides a seminal starting point for large-scale AI study. CamVid[camvid] offers semantic segmentation for 701 images, and Cityscapes[Cityscapes] captured in 50 cities include pixel-level annotations for 5k images. More recent datasets include Vistas[mapillary], BDD100k[bdd], and ApolloScape[appolo]. Some datasets were designed to capture particular diversities/challenges in driving. Vistas and BDD100k target large-scale naturalistic driving from many drivers with wide varieties of weather and lighting, [scnn] focuses on data for lane lines, and [cityperson, nightperson] focuses on pedestrians. In the literature there were also efforts that rely exclusively on camera images for AV perception. However, 3D localization using images only is challenging [39, 54, 46, 50]. This leads to a more comprehensive setup of sensors to utilize both semantic (cameras) and ranging sensors (Radar/Lidar/Ibeo). This combination provides better performance or redundancy under hardware failure [nuscenes]. Many datasets released recently include both semantic and ranging sensors.
Ii-B Multimodal Datasets
The seminal work that conveys the strength of multimodal sensors is KITTI[kitti], which provides Lidar scans as well as stereo images and GPS/IMU data. The H3D dataset[h3d] provides annotations in 360 view, not just the front objects. The KAIST dataset also uses a thermal camera for night time perception [kaist], Oxford RobotCar studies repeated driving on the same route[oxford], AppoloScape captures Lidar scans in dense traffic [aslidar], and nuScenes focuses on 360 semantic views[nuscenes]. Very recent datasets also includes the work from industrial entities such as Waymo[waymo_dataset] and Lyft[lyft_dataset].
Ii-C Driving Behavior Datasets
The aforementioned datasets primarily focused on data collection for different road environment. We believe another important aspect is the interaction with other road users and driver’s steering or speed control inputs. A prominent example of data collection not focusing on AV development is the University of Michigan safety pilot project data222http://safetypilot.umtri.umich.edu/index.php?content=video. This dataset captures vehicle speed, location, and front perception using a MobilEye camera. Many data analysis results have been published[dingzhao_driving_behavior, xianan_paper, xianan_paper_2]. Multimodel datasets also usually include accurate GPS positions, thus providing the possibility of extracting vehicle speed, acceleration, and heading angle for human driver modeling [kitti, nuscenes, oxford]. Datasets focused on human driver behavior also include [honda_dataset, brain4car].
In the literature, different annotation and labeling strategies have been used. For images, 2D bounding boxes[imagenet, bdd], 3D bounding boxes[kitti, nuscenes, Cityscapes], and pixel-level segmentation[coco, cam_radar_fusion, bdd] are the most common formats. When (360) Lidar points are available, 3D bounding boxes may be provided[kitti, aslidar]. For Ibeo and Radar, annotations are usually not provided because the sensor outputs are too sparse or too complicated to annotate.
Iii Vehicle Platform and Sensors
We collect the data manually driving a instrumented Lincoln MKZ. This vehicle is equipped with the following sensors:
[leftmargin = 9pt]
3 Velodyne Ultra Puck VLP-32C Lidar, horizontal angular resolution 0.2, vertical 0.33, range 200m, 10Hz.
2 forward-facing cameras, 60 and 30FOV, 1080P, 30Hz.
1 backward-facing camera, 90FOV, 1080P, 30Hz.
1 Cabin pose camera, 12801080, 30Hz.
1 Cabin head/eyeball camera, 640P, 30Hz.
1 Ibeo four beam LUX sensor, horizontal angular resolution 0.25, vertical 0.8, range 50m, 25Hz.
1 Delphi ESR 2.5 Radar, range 60m, 90FOV, 20Hz.
1 NovAtel FlexPak6 with IMU-IGM-S1 and 4G cellular for RTK GPS, single antenna, 1Hz.
The locations and example output of the sensors are shown in Fig. 1-2. We use two cameras for forward perception, one with wide FOV for general object detection/tracking, the other with narrower FOV for traffic signs and signals. We use a Logitech BRIO camera for backward monitoring which uses a wide FOV. The internal cabin cameras capture the body pose anf head/eyes movement of the human driver. We use three mechanical scanning Lidars, all on the rooftop to capture objects in front, rear left and rear right of the vehicle.
We record all sensors and critical vehicle CAN bus data. The CAN bus reports throttle, brake, and steering commands from the human driver, turn signals, high/low beam state. All the external cameras are connected to a laptop with a GPU, and the videos are recorded via the FFmpeg software. Other sensors (including the head/eyeball movement camera) and the CAN bus data are logged in the ROS formats.
Iv Data Collection and Annotation
Iv-a Data Collection Overview
The data is collected both on open roads and inside Mcity. On open roads, we focus on highways and major local roads. Three human drivers drive manually on these routes with different lighting, weather, road, and traffic conditions. See Fig. 3-4 for the four routes and an example of the collected scenes. We select routes that take roughly 1 hour round-trip. In total, over 3,000 miles have been covered. In the near future, we plan to focus on urban environments.
The second set of data focuses on designed choreography inside Mcity. We refer particularly to the challenging scenarios in the Mcity ABC testing [mcity_abc], and design 18 scenarios to study vehicle to vehicle and vehicle to pedestrian/bicyclist interactions, see Fig. 6–6 and Table II. We record the sensor data wherein both normal (obeying traffic rules) and abnormal (disobeying traffic rules) driving behaviors are involved. We swap the roles of the interacting vehicles when appropriate. We also repeat the collection runs three times for each scenario.
|Scenario 1||Low speed merge|
|Scenario 2||Vehicle cuts in|
|Scenario 3||Parked vehicle door ajar|
|Scenario 4||Pass parallel parked vehicle|
|Scenario 5||Roadside parked vehicle start up|
|Scenario 6||Inclined parked vehicle start up|
|Scenario 7||Intersection right turn, other straight|
|Scenario 8||Intersection left turn, other right turn/straight|
|Scenario 9||Vehicle entering round-about|
|Vehicle–pedestrian (P)/bicyclist (B) interactions|
|Scenario 1||Vehicle driving straight at intersection|
|Scenario 2||Vehicle right turn at intersection|
|Scenario 3||Vehicle left turn at intersection|
|Scenario 4&5||Vehicle follows&passes P/B on road|
|Scenario 6||Pedestrian yields to vehicle driving on road|
|Scenario 7||P/B emerges from behind occlusion|
|Scenario 8&9||Vehicle entering&exiting round-about|
|: Other vehicle door ajar, no role swap for the recording MKZ.|
|: For 1–3, 8, 9, P/B uses the crosswalk to cross road.|
Overall we have collected more than 50 hours of naturalistic driving data covering more than 3,000 miles, and 255 runs for the designed choreography. In total we have roughly 8 TB of ROS files and 3 TB of FFmpeg video.
The synchronization of the data mainly consists of temporal and spatial calibration. In the temporal calibration, we synchronize using UTC timestamp. For videos recording, we tweak the FFmpeg software to report the UTC timestamp when each frame is written to the disk. For ROS formatted files, the ROS time is equivalent to the UTC timestamp.
The spatial calibration mainly includes camera intrinsic/extrinsic parameters calibration, camera-Lidar, camera-Radar, and camera-Ibeo calibrations. For the camera parameters, we adopt open tools in ROS and use chess boards for the calibration. For camera-Lidar and camera-Ibeo, we follow a methodology similar to KITTI, i.e., using the marker boards wherein manual efforts are needed [yuanxin_calib]. As for camera-Radar calibration, we follow [cam_radar_fusion]. See Fig. 7 for an illustration of the camera and front Lidar alignment results.
Iv-C Data Tagging
We tag the data (images) for two purposes: for ease of data query, and to balance between the diversity and laborious labeling in the annotation. We devise four tags for each frame: road type, road (surface) condition, weather condition, and image quality. Associated labels are then assigned into each tag. See Fig. 8 for the tagging hierarchy. More explanation and tag distribution analysis can be found in V.
Iv-D Data Annotations
We divide our open road data over the year into 5 stages (batches). Currently we provide results primarily for the front 60FOV camera. We have annotated more than 17.5k image frames. The annotation class list mainly consists of different objects and traffic signs. Individual files are generated for each frame, illustrating the segmentation boundaries of listed objects/traffic signs. See Fig. 4 for example results.
V Data Analysis and Use
V-a Image Annotations
We perform diversity analysis for our current image. We divide our annotations into 4 groups, i.e. objects, traffic lights, traffic signs, and lanes. Fig. 10 shows the statistics of different labels in each group, and Fig. 10 shows example results of number of labels in the object group. For the other three annotation groups, we label traffic lights and traffic signs, visually discernible lane lines/markers/road curbs/stop line, and labels for each lane line segments, which is among the most elaborated datasets we are aware of.
Statistics pertaining to the images tagging are shown in Fig. 11. For weather conditions, most of the data were captured in normal or sunny weathers. However, rainy, foggy, and snowing days were also included. Road condition mainly depicts the lighting/friction condition on the road surface. We mainly consider surface deterioration, material change, snow coverage on the road in our tagging. Statistics of image quality and road types tagging are shown in Fig. 11. While many other datasets include data only when the camera works perfectly, poor image quality due to weather or hardware malfunction should be considered. We include both normal, adverse-lighting, lens-condensation, and blur images. Road type describes the shape of collected road/lane lines. See the statistics in Fig. 11. The tagging for such property is also quite elaborated among all open datasets.
V-B Driving Behaviors
Our data includes open naturalistic driving and designed choreography inside the Mcity test facility. For latter the behaviors are illustrated in Table II and Fig. 6-6, the analysis in this section focuses on the open road data. Following [kitti, honda_dataset], we discuss ego motions of the recording vehicle. However, we analyze the driving behaviors separately for different scenarios. Following the previous research efforts for highway entrance/exit in [highway_enter_1, highway_exit], lane change in [LC_2], and intersection interactions in [Intersection_1], we split the recorded data in each run into 7 different scenarios, i.e. left turn, left lane change (LC), ramp entrance, right turn, right LC, highway exit, and lane keeping. We then organize the data following these scenarios. See the distribution plot of the 7 scenarios in Fig. 12. We also mark the total size (number) of each scenario we have recorded in the figure. To our best knowledge, our dataset is the only one that organizes according to driving scenarios.
The results of human commands and vehicle motion for right LC and highway exit scenarios can be seen in Fig. 13. Although both scenarios should use right turning signal, the distributions for the states and commands are distinctively separated. This indicates the need to organize data based on driving scenarios. We are currently annotating the collected data to summarize the perception data.
V-C The Complete Driving Flow
Our data also provides the complete flow of human driver’s actions. We show two examples in Fig. 14-15. Fig. 14 depicts an unprotected right turn on open roads. The human driver catches a safe gap to make the turn. In Fig. 15, we illustrate a trip wherein the driver disobeys traffic rules in an unprotected left turn. In both figures, we record and plot the visual perception direction (gray arrow), throttle, brake, and steering commands; the ego vehicle speed is also shown. We believe that in addition to be efficient and safe, being naturalistic is also a desired trait for AV. The complete data capture will be useful for such analysis.
Vi Conclusion and Future Works
This paper presents the ongoing data collection effort at Mcity. Compared to existing datasets, our data is complete with all commonly used sensor types. We collect the data both on open roads naturalistically and inside the Mcity test facility with designed choreography. We perform preliminary analysis on our data, which use tags to indicate different driving scenarios and conditions.
We want to thank many students/engineers at Mcity for the vehicle platform development and their thoughtful suggestions on data analysis. We also thank Seres for providing the vehicle and funding the project, and Might AI for image labeling.