The rapid improvement of machine learning and computer vision systems has spurred the development of self driving vehicles, which have already covered millions of kilometers in real world scenarios. It appears that the development of processing technology and algorithms currently advances at greater speed than the development of sensing hardware for capturing the necessary information from the surroundings of the vehicle, such as obstacles, traffic, marks, and signs. Automotive image sensors are being intensively developed to deal with the conflicting requirements for low cost, high dynamic range, high sensitivity, and resistance to artifacts from flickering light sources such as LED traffic signs and car taillights. Operation under bad weather and/or lighting conditions is a primary requirement for automotive self driving or automatic driver assistance systems (ADAS), however, current ADAS sensors and systems still face many problems compared to human driver performance in challenging situations. Since event cameras have been proposed as possible ADAS sensors(Posch et al., 2014), we collected data to study the use of an event camera to augment conventional imager technology.
Rather than providing frame-based video as output, the event camera dynamic vision sensor (DVS) detects local changes in the brightness of individual pixels and asynchronously outputs those changes at the time of occurrence (Lichtsteiner et al., 2008; Posch et al., 2014). Thus, only the parts of the scene that change produce data, lowering the output data rate, increasing the temporal resolution and reducing the latency in comparison to frame-based systems, since changes in pixel brightness are streamed out of the camera as they occur. The local instantaneous gain control increases usability under uncontrolled lighting conditions. The higher temporal resolution and limited data rate makes the DVS well suited for autonomous driving applications, where both latency and power consumption are important. A dynamic and active-pixel vision sensor (DAVIS) has pixels that concurrently output DVS events and standard image sensor intensity frames (Brandli et al., 2014).
Recent studies have shown the utility of using DVS in data-driven convolutional neural network (CNN) real time applications (Moeys et al., 2016; Lungu et al., 2017). In these applications, DVS input frames typically consist of a 2D histogram image of a constant number of a few thousand DVS events. Because the DVS event rate is proportional to the rate of change of brightness, i.e. scene reflectance (Lichtsteiner et al., 2008), the CNN frame rate is variable, ranging from about 1 fps up to 1000 fps. Moeys et al. (2016) showed that combining the standard image sensor frames from the sensor with the DVS frames resulted in higher accuracy and lower average reaction time. Here we extend this work to real world driving in the first published end-to-end dataset of DVS or DAVIS driving data.
2 Davis Driving Dataset 2017 (DDD17)
DDD17 is available from sensors.ini.uzh.ch/databases. This data is collected from Swiss and German road driving under various conditions. It includes DAVIS data and car data. Since the main aim of this dataset is to enable studying the fusion of APS and DVS data for ADAS, we did not include other sensors such as LIDAR.
2.1 DAVIS data
Visual data was captured using a DAVIS346B prototype camera, containing a DAVIS APS+DVS camera, such that event-based and traditional frame-based data could be recorded at the same time, through the same optics. The camera resolution is pixels. The camera architecture is similar to Brandli et al. (2014), but the sensor has 2.1X more pixels and includes on-chip column parallel analog to digital converters (ADCs) for frame-based APS output up to 50 fps. The DAVIS346B also has optimized buried photodiodes with microlenses that increase fill factor and reduce dark current, thereby improving operation at low light intensities by factor of about 4 compared with the Brandli et al. (2014) DAVIS240C. A fixed focal length lens (C-mount, 6mm) was used for all recordings, providing a horizontal field of view of 56. The aperture was set manually, depending on lighting conditions. The APS frame rate depended on exposure duration to a value between 10 fps and 50 fps; in some recordings it varied depending on the auto-exposure duration algorithm. The frames were captured using the DAVIS global shutter mode to minimize motion artifacts. The camera was mounted using a glass suction tripod mount behind the windshield, just below the rear mirror, and aligned to point to the center of the hood. Markers on the car hood were used to initially align the camera for the first recording session and the camera was never moved from this position. These markers were left on the hood throughout the entire recording period for control. A polarization filter was used in some of the recordings to reduce windshield and hood glare. The camera was powered by and connected to a laptop computer through high speed USB 2.0. The raw data was read out using inilabs cAER software222cAER support and streamed to the custom recording framework described in Sec. 2.3 for further processing.
2.2 Vehicle control and diagnostic data
Data was acquired using a Ford Mondeo MK 3 European Model. We used the OpenXC Ford Reference vehicle interface, that plugs into the passenger compartment OBDII port, to read out control and diagnostic data from the car’s CAN bus. The vehicle interface connects to a host USB port333OpenXC vehicle interface.
The vehicle interface was programmed with the vendor-provided firmware for the Ford Mondeo MK 3 car model (“type 3” firmware) and read out using the OpenXC python library. The raw data was passed to the custom recording software described in Sec. 2.3. The following quantities were read out at rates of about 10 Hz each. Likely targets for experiments in end-to-end learning are in boldface.
steering wheel angle (degrees, up to 720)
accelerator pedal position (% pressed),
brake pedal status (pressed/not pressed),
engine speed (rpm),
vehicle speed (km/h),
headlamp status (on/off),
high beam status (on/off),
windshield wiper status (on/off),
torque at transmission,
transmission gear position (gear no.),
fuel consumed since restart,
fuel level (%),
parking brake status (on/off).
2.3 Recording and viewing software
A python software framework 444ddd17-utils for recording, viewing, and exporting the data was created for the main purpose of combining and synchronizing the data from the different input devices and storing it in a standardized file format. In particular, since the APS frames and DVS data are microsecond time-stamped on the camera using its own local clock, whereas the data provided by the vehicle interface is not, both data streams were augmented with the millisecond system time of the recording computer, which could then be used for synchronization. With the vehicle interface streaming data at rates of only around 10 Hz per recorded variable, such off-device time-stamping is justified. The computer time was synchronized to a standard time server before recordings. The data was stored in HDF5 format, for which widely used libraries for various environments exist. Each data type (e.g. DVS events, steering wheel angle, vehicle speed…) was stored in a separate container, each containing one container for the system timestamp and one for the data. In this way, the system timestamp can be used for fast indexing and for synchronizing the data when reading. With data being provided at irregular intervals by the recording devices, each data type was stored in an event-driven fashion, such that different containers contain different numbers of samples. The DAVIS data was stored in its native cAER AER-DAT3.1 format555inilabs file formats in each HDF5 container.
In addition to the recording framework, a python-based viewer view.py visualizes the recorded DAVIS data together with selected vehicle data such as the steering angle or speed (Fig. 1). The script export.py exports the data into frames for preparing data for further processing by machine learning algorithms.
3 Recorded data
In total, over 12 h of data were recorded under various weather, driving, road, and lighting conditions on six consecutive days, covering over 1000 km of different types of roads in Switzerland and Germany. Recordings were started and stopped manually and typically have durations of between a minute and an hour. The resulting recordings are summarized in Table 1. Fig. 2 shows the distributions of several recorded variables over the whole dataset. Steering angles are dominated by straight driving and small deviations of
. Speed is uniformly distributed over the range 0-160 km/h. The automatically controlled headlight is on about half the time, indicating a substantial fraction of the data was captured in low-light conditions.
4 Experiments: Steering prediction network
End-to-end learning of a control model is an attractive approach for self-driving applications, since it eliminates the need for tedious hand-labeling of the data or features – a task which is prohibitive in the face of the enormous amounts of data acquired by today’s vehicles (Bojarski et al., 2016). The presented dataset has clear limitations, since it does not include other sensors such as LIDAR, does not include route information that would allow better prediction of user intentions, and the data tends to be unbalanced. Nevertheless, under certain conditions such as highway driving, driving along roads without turns onto other roads, or unpredictable user actions, it can be used to study the utility of of the data for prediction of measured user actions.
We trained simple steering prediction networks. These networks take input APS and/or DVS data and attempt to predict the instantaneous steering wheel angle. They are inspired by LeCun’s early work (LeCun et al., 2005), the seminal open dataset from comma.ai (Santana & Hotz, 2016), and by recent Nvidia (Bojarski et al., 2016) and unpublished VW studies.
Our results compare the steering prediction accuracy of networks operating on pure APS data to such operating on pure DVS data. Our example implementation should be regarded as a preliminary study to validate the usability of the data and associated software. In particular, the experiments presented here are based on a small subset of the whole dataset (recordings 1487858093 and 1487433587 in Table 1). Work is ongoing to train more architectures using more of the data.
shows our first results, obtained from a CNN with 4 convolutional layers, each with 8 feature maps and using 3x3 kernels, and trained on a single 1.5 h recording. Each layer is followed by a 2x2 max pooling layer. The final feature map layer is mapped to a 64-unit fully connected (FC) layer. The FC layer is mapped to an output steering angle in the range. The DVS and APS inputs were subsampled to 80x60 images. Input frame normalization was done as in Moeys et al. (2016).
Our quantitative accuracy results are too inconclusive to report but we have verified the usability of the dataset and tools. Further analysis is necessary and the subject of ongoing work.
The main result of this paper is to introduce the DDD17 first open dataset of DAVIS driving data with end-to-end labeling, along with necessary software tools. A preliminary study on an end-to-end steering angle prediction by a CNN show usability of the data.
- Bojarski et al. (2016) Bojarski, Mariusz, Del Testa, Davide, Dworakowski, Daniel, Firner, Bernhard, Flepp, Beat, Goyal, Prasoon, Jackel, Lawrence D, Monfort, Mathew, Muller, Urs, Zhang, Jiakai, et al. End to end learning for self-driving cars. arXiv preprint arXiv:1604.07316, 2016.
- Brandli et al. (2014) Brandli, C., Berner, R., Yang, M., Liu, S-C., and Delbruck, T. A 240180 130 dB 3 s latency global shutter spatiotemporal vision sensor. IEEE Journal of Solid-State Circuits, 49(10):2333–2341, 2014.
- LeCun et al. (2005) LeCun, Yann, Muller, Urs, Ben, Jan, Cosatto, Eric, and Flepp, Beat. Off-Road Obstacle Avoidance through End-to-End Learning. In Advances in Neural Information Processing Systems, pp. 739–746, 2005.
- Lichtsteiner et al. (2008) Lichtsteiner, P., Posch, C., and Delbruck, T. A 128x128 120 dB 15 s latency asynchronous temporal contrast vision sensor. IEEE Journal of Solid-State Circuits, 43(2):566–576, Feb 2008.
- Lungu et al. (2017) Lungu, Iulia-Alexandra, Corradi, Federico, and Delbruck, Tobias. Live Demonstration: Convolutional Neural Network Driven by Dynamic Vision Sensor Playing RoShamBo. In 2017 IEEE Symposium on Circuits and Systems (ISCAS 2017), Baltimore, MD, USA, 2017.
- Moeys et al. (2016) Moeys, D. P., Corradi, F., Kerr, E., Vance, P., Das, G., Neil, D., Kerr, D., and Delbrück, T. Steering a predator robot using a mixed frame/event-driven convolutional neural network. In 2016 Second International Conference on Event-based Control, Communication, and Signal Processing (EBCCSP), pp. 1–8, June 2016.
- Posch et al. (2014) Posch, C., Serrano-Gotarredona, T., Linares-Barranco, B., and Delbruck, T. Retinomorphic Event-Based Vision Sensors: Bioinspired Cameras With Spiking Output. Proceedings of the IEEE, 102(10):1470–1484, October 2014.
- Santana & Hotz (2016) Santana, Eder and Hotz, George. Learning a driving simulator. arXiv preprint arXiv:1608.01230, 2016.