Advanced driving assistance systems (ADAS) and autonomous driving require a detailed comprehension of complex driving scenes. With safety as the main objective, complementary and redundant sensors are mobilized to tackle this challenge. The best current systems rely on deep learning models that are trained on large annotated datasets for tasks such as object detection or semantic segmentation in video images and lidar point clouds. To improve further the performance of AD systems, extending the size and scope of open annotated datasets is a key challenge.
ADAS and AD-enabled vehicles are usually equipped with cameras, lidars and radars to gather complementary information from their environment and, thus, to allow as good a scene understanding as possible in all situations. Unfortunately, bad weather conditions are challenging for most of sensors: lidars have poor robustness to fog[bijelic_benchmark_2018], rain or snow; cameras behave poorly in low lighting conditions or in case of sun glare. Radar sensors, on the other hand, generate electromagnetic wave signals that are not affected by weather conditions or darkness. Also, radar informs not only about the 3D position of other objects, as lidar, but also about their relative speed (radial velocity). However, in comparison to other sensory data, radar signals are difficult to interpret, very noisy and with a low angular resolution. This is one reason why cameras and lidars have been preferred for the past years.
In this paper, we present two main contributions. First of all, we introduce CARRADA, a dataset with synchronized camera data and raw radar sequences together with range-angle-Doppler annotations for scene understanding (see a sample in Fig. 1). Annotations with bounding boxes, sparse points and dense masks are provided for range-Doppler and range-angle representations of the radar data. Each object has a unique identifier, being categorized as a pedestrian, a car or a cyclist. This dataset could be used for object detection, semantic segmentation (as illustrated in our segmentation baseline) or tracking in raw radar signals. It should also encourage sensor fusion in temporal data. Secondly, we describe a semi-automatic method to generate the radar annotations by only using the camera information instead of a lidar as usual [weston_probably_2019, lim_radar_2019, major_vehicle_2019]. The aim of this contribution is to reduce annotation time and cost by exploiting visual information without the need for an expensive sensor. A baseline for radar semantic segmentation is also proposed and evaluated on well-known metrics. We hope that it will encourage deep learning research applied to raw radar representations.
The paper is organised as follows. In Section II, we discuss the related work and provide background on radar signals. Section III introduces the proposed dataset and its acquisition setup. Section IV explains our semi-automatic annotation method, from visual and physical information to the labelling of radar signals with temporal tracking. Section V details a baseline for radar semantic segmentation on raw representations. Finally, we discuss the proposed dataset and its current limitations in Section VI, before concluding in Section VII.
Ii-a Related Work
Previous works have applied deep learning algorithms to range-Doppler radar representation. Indoor activity recognition [kim_human_2016] and gait classification [klarenbeek_multi-target_2017] have been explored. Privacy motivates this application, as cameras can thus be avoided for scene understanding. Hand gesture recognition has been an active field of research using millimeter wave radar for classification [wang_interacting_2016, kim_hand_2016, dekker_gesture_2017, zhang_latern_2018, zhang_u-deephand_2019] or object signature recognition [sun_automatic_2019, wang_rammar_2019]. Sensor fusion using radar and cameras has been studied for hand gesture classification [molchanov_multi-sensor_2015]
. Outdoor applications have also been considered to classify models of Unmanned Aircraft Vehicles (UAV)[brooks_temporal_2018].
Driving applications have recently shown an interest for radar sensors, using representations that depend on the target task: Doppler spectrograms for vehicle classification [capobianco_vehicle_2018], range-angle for object classification [patel_deep_2019] or odometry [aldera_what_2019], range-angle and range-Doppler for object detection [major_vehicle_2019]. Radar data are also used for position prediction with range-Doppler [zhang_object_2019] or object box detection in images [nabati_rrpn_2019] with only a few data points. Sensor fusion is considered for driving applications such as occupancy grid segmentation [lekic_automotive_2019] or object detection in range-angle [lim_radar_2019] while using radar and camera.
Scene understanding for autonomous driving using deep learning requires a large amount of annotated data. This challenge is well known by the community and open source datasets have emerged in the past few years,e.g., [geiger_vision_2013, yu_bdd100k_2018, cordts_cityscapes_2016, huang_apolloscape_2019, sun_scalability_2019]. Several types of annotations are usually provided, notably 2D or 3D bounding boxes and semantic segmentation masks for each object. They describe video frames from camera or 3D point clouds from lidar sensor. None of these datasets provides raw radar data recordings synchronized with the other sensors. Only very recent datasets include radar signals, but they are usually pre-processed and barely annotated.
The nuScenes [caesar_nuscenes_2020] dataset is the first large scale dataset providing radar data alongside lidar and camera data. However, the radar data are released with a non-annotated processed representation with only tens of points per frames.
The Oxford Radar Robot Car dataset [barnes_oxford_2020] groups cameras, lidar and a radar data for odometry. Only raw radar data with a range-angle representation is available, and it is not annotated for scene understanding.
Astyx has released a small dataset with a camera, a lidar and a high definition radar [meyer_automotive_2019]. Annotations with 3D bounding boxes are provided on each modality by using the lidar sensor for calibration. Raw radar data are processed and provided as a point cloud with a high resolution comparable to a lidar with longer range. However, the number of frames is limited to a few hundred.
In [gao_experiments_2019], the authors describe a partially annotated dataset with synchronised camera and raw radar data. A single object is recorded during each sequence. Bounding boxes are provided on both camera and range-angle representations with a calibration made by a lidar sensor.
Range-angle segmented radar data are provided by [nowruzi_deep_2020] for occupancy grid map in Cartesian coordinates. The annotations are generated using odometry from scene reconstruction of camera images.
To the best of our knowledge, range-angle and range-Doppler raw radar data have not been previously released together, nor have the corresponding annotations for object detection, semantic segmentation and tracking been provided. Moreover, there is no related work of deep learning algorithms exploiting both range-angle and range-Doppler annotations at the same time. This dataset will encourage exploration of advanced neural networks architectures.
Ii-B Radar Sensor and Effects
A radar sensor emits electromagnetic waves via one or several transmitter antennas (Tx). The waves are reflected by an object and received by the radar via one or several receiver antennas (Rx). The comparison between the transmitted and the received waveforms infers the distance, the relative velocity, the azimuth angle and the elevation of the reflector regarding the radar position [ghaleb_micro-doppler_2009]. Most of the automotive radars use Multiple Input Multiple Output (MIMO) systems: each couple of Tx/Rx receives the reflected signal assigned to a specific Tx transmitting a waveform.
Frequency-Modulated Continuous Wave (FMCW) radar transmits a signal, called a chirp [brooker_understanding_2005], whose frequency is linearly modulated over the sweeping period : At time , the emitted sinusoidal signal has
where is the carrier frequency and the bandwidth, and its phase reads
After reflection on an object at distance from the emitter, the received signal has phase:
where is the time delay of the signal round trip, with the velocity of the wave through the air considered as constant, and is the phase shift:
Measuring this phase shift (or equivalently the time delay between the transmitted and the reflected signal) grants access to the distance between the sensor and the reflecting object.
Its relative velocity is accessed through the frequency shift between the two signals, a.k.a. the Doppler effect. Indeed, the phase shift varies when the target is moving:
where is the radial velocity of the target object w.r.t. the radar. This yields the frequency Doppler effect whereby frequency change rate between transmitted and received signals, , depends linearly on the relative speed of the reflector. Measuring this Doppler effect hence amounts to recovering the radial speed
The transmitted and received signals, and are compared with a mixer that generates so-called Intermediate Frequency (IF) signal. The transmitted signal term is filtered using a Low-Pass filter and digitized by an Analog-to-Digital Converter (ADC). This way, the recorded signal carries the Doppler frequencies and ranges of all reflectors.
Using the MIMO system with multiple Rx antennas, the time delay between the received signals of each Rx transmitted by a given Tx carries the orientation information of the object. Depending on the positioning of the antennas, the azimuth angle and the elevation of the object are respectively deduced from the horizontal and vertical pairs of Tx/Rx. The azimuth angle is deduced from the variation between the phase shift of adjacent pairs of Rx. We have , where is the distance separating the adjacent receivers.
Consecutive filtered IF signals are stored in a frame buffer which is a time-domain 3D tensor: the first dimension corresponds to the chirp index; the second one is the chirp sampling defined by the linearly modulated frequency range; the third tensor dimension indexes Tx/Rx antenna pairs.
The Fast Fourier Transform (FFT) algorithm applies a Discrete Fourier Transform (DFT) to the recorded data from the time domain to the frequency domain. The 3D tensor is processed using a 3D-FFT: a Range-FFT along the rows resolving the object range, a Doppler-FFT along the columns resolving the object relative velocity and an Angle-FFT along the depth resolving the angle between two objects.
The range, velocity and angle bins in the output tensor correspond to discretized values defined by the resolution of the radar. The range resolution is defined as . The relative velocity resolution is inversely proportional to the frame duration time. The angle resolution is the minimum angle separation between two objects to be distinguished, with the number of Rx antennas and the azimuth angle between the radar and an object at distance reflecting the signal.
The next section will describe the settings of the radar sensor used and recorded dataset.
The dataset has been recorded in Canada on a test track to reduce environmental noise. The acquisition setup consists of a FMCW radar and a camera mounted on a stationary car. Both sensors are synchronised. The radar uses the MIMO system configuration with 2 Tx and 4 Rx producing a total of 8 virtual antennas. The parameters and specifications of the sensor are provided in Table I
. The recorded image data from the camera and the radar data are synchronized to have the same frame-rate in the dataset. The sensors are also calibrated to have the same Cartesian coordinate system. The image resolution ispixels.
|Sweep Bandwidth||4 Ghz|
|Maximum Range||50 m|
|Range Resolution||0.20 m|
|Maximum Radial Velocity||13.43 m/s|
|Radial Velocity Resolution||0.42 m/s|
|Field of View|
|Number of Chirps per Frame||64|
|Number of Samples per Chirp||256|
Scenarios with cars, pedestrians and cyclists have been recorded. The distribution of the object categories is illustrated in Figure 2. One or several objects are moving in the scene at the same time with various trajectories to simulate complex urban driving scenarios. The objects are moving in front of the sensors: approaching, moving away, going from right to left or from left to right (see examples in Figure 3). Each object is an instance tracked in the sequence. Statistics about the recordings are provided in Table II.
|Number of sequences||30|
|Total number of instances||78|
|Total number of frames||12726 (21.2 min)|
|Maximum number of frames per sequence||1018 (1.7 min)|
|Minimum number of frames per sequence||157 (0.3 min)|
|Mean number of frames per sequence||424 (0.7 min)|
|Total number of annotated frames with instance(s)||7545 (12.6 min)|
Object signatures are annotated in both range-angle and range-Doppler radar representations for each sequence. Each instance has an identification number, a category and a localization in the data. Three types of annotation format for localization are provided: sparse points, boxes and dense masks.
The next section will describe the pipeline used to generate the annotations.
Iv Pipeline for Annotation Generation
Automotive radar representations are difficult to understand compared to natural images. Objects are represented by shapes with varying sizes carrying physical measures. It is not a trivial task to produce good quality annotations on this data. This section details a semi-automatic pipeline based on video frames to provide annotations on radar representations.
Iv-a From vision to physical measurements
The camera and radar recordings are synchronized. Visual information in the natural images is used to get physical prior knowledge about an instance as well as its category. The real world coordinates of the instance and its radial velocity are estimated generating the annotation in the radar representation. This first step instantiates a tracking pipeline propagating the annotation in the entire radar sequence.
Each camera image sequence is processed by a Mask R-CNN [he_mask_2017] model to detect and classify each instance. The Simple and Online Real time Tracking (SORT) algorithm [bewley_simple_2016] is simultaneously used between each frame to track the detected instances. It computes the overlap between the predicted boxes and the tracked boxes of each instance at the previous frame. The selected boxes are the most likely to contain the same instance, i.e. the boxes with the highest overlap.
The center of mass of the each segmented instance is projected on the top-down pixel coordinates of the segmentation mask. This projected pixel localized on the ground is considered as the reference point of the instance.
Using the intrinsic and extrinsic parameters of the camera, pixel coordinates of a point in the real world space are expressed as:
where and are respectively the pixel coordinates in the image and the real world point coordinates, a scaling factor, and and are the intrinsic and extrinsic parameters of the camera defined as:
Using this equation, one can determine c knowing p with a fixed value of elevation.
Regarding a given time interval separating two frames and
, the velocity vectoris defined as:
where is the real-world coordinate in frame . The time interval chosen in practice is second.
The Doppler effect recorded by the radar is equivalent to radial velocity of the instance reflecting the signal. The radial velocity at a given frame is defined as:
where is the angle formed by and the straight line between the radar and the instance. The quantization of the radial velocity is illustrated in Figure 4.
This way, each instance detected in the frame is characterized by a feature point . This point will be projected in a radar representation to annotate the raw data and track it in this representation.
Iv-B DoA clustering and centroid tracking
The range-angle representation is a radar scene in polar coordinates. Its transformation in Cartesian coordinates is called Direction of Arrival (DoA). Points are filtered by a Constant False Alarm Rate (CFAR) algorithm [rohling_radar_1983] keeping the highest intensity values while taking into account the local relation between points. The DoA is then a sparse point cloud in a 2D coordinate space similar to a Bird Eye’s View (BEV) representation.
The representation is enhanced using the recorded Doppler for each point. The 3D point cloud combines the Cartesian coordinates of the reflected point and its Doppler value. This representation helps to distinguish the signature boundaries of different objects. The feature point is projected in this space and assigned to a cluster of points considered as the reflection of the targeted instance. It is then tracked in the past and future using the following process, illustrated in Figure 5.
At a given timestamp chosen by the user, a 3D DoA-Doppler point cloud is clustered using the Mean Shift algorithm [comaniciu_mean_2002]. Let be a point cloud of points. For a given starting point, the algorithm iteratively computes a weighted mean of the current local neighborhood and updates the point until convergence. Each iteration reads:
where is multivariate spherical Gaussian kernel with bandwidth parameter . All initial points leading to close final locations at convergence are considered as belonging to the same cluster.
Mean Shift clustering is sensitive to the bandwidth parameter. Its value should depend on the point cloud distribution and it is usually defined with prior knowledge about the data. In our application, it is not straightforward to group points belonging to the same object in the DoA-Doppler point cloud representation. The number of points and their distribution depend on the distance and the surface of reflectivity of the target. Moreover, these characteristics change during a sequence while the instance is moving in front of the radar. Inspired by [bugeau_bandwidth_2007], an optimal bandwidth is automatically selected for each instance contained in each point cloud.
For a given DoA-Doppler point cloud, the closest cluster to the feature point is associated to an instance. Let be a bandwidth in a range of ordered values. A Mean Shift algorithm noted selects the closest cluster to containing points. After computing the algorithm with all bandwidth values,
optimal clusters are found. The optimal bandwidth is selected by comparing the stability of the probability distribution of the points between the selected clusters.
For each , the probability distribution estimated with the points of the cluster
is the Gaussian distributionwith expectation
Using these fitted distributions, the bandwidth is selected by choosing the one which is the most “stable” with respect to a varying bandwidth:
where is the Jensen-Shannon divergence [endres_new_2003]. This is a proper metric derived from Kullback-Leibler () divergence [kullback_information_1951] as , for two probability distributions and .
Once is found, the closest cluster to using is considered as belonging to the targeted instance. The points and are set with the centroid of this cluster. The process is then iterated in the previous and next frames to track the center of the initial cluster until the end of the sequence.
Iv-C Projections and annotations
We recall that is the cluster associated to the point at time using , where is the estimated optimal bandwidth. This cluster is considered as belonging to the tracked object. A category is associated to it by using the segmentation model on the image (Section IV-A). The points are projected onto the range-Doppler representation using the radial velocity and the distance is computed with the real world coordinates. They are also projected onto the range-angle representation by converting the Cartesian coordinates to polar coordinates.
Let be the function which projects a point from the DoA-Doppler representation into the range-Doppler representation. Similarly, we denote with the projection into the range-angle representation. The sets of points and correspond, respectively, to the range-Doppler and range-angle representations of . They are called the sparse points annotations.
The bounding box of a set of points in (either from or ) is defined as a rectangle parameterized by where is the minimum -coordinate of the set, is the maximum, and similarly for the -coordinates.
Finally, the dense mask annotation is obtained by dilating the sparse annotated set with a circular structuring element: Given the sparse set of points , the associated dense mask is the set of discrete coordinates in , where is the disk of radius centered at .
In the following section, we propose a baseline for radar semantic segmentation trained and evaluated on the annotations detailed above.
We propose a baseline for semantic segmentation using range-Doppler or range-angle radar representation to detect and classify annotated objects. Fully Convolutional Networks (FCNs) [long_fully_2015] are used here to learn features at different scales by processing the input data with convolutions and down-sampling. Feature maps from convolutional layers are up-sampled with up-convolutions to recover the original input size. Each bin of the output segmentation mask is then classified. The particularity of FCN is to use skip connections from features learnt at different levels of the network to generate the final output. We denote FCN-32s a network where the output mask is generated only by up-sampling and processing feature maps with resolution of the input. Similarly, FCN-16s is a network where and resolution features maps are used to generate the output mask. The same way, FCN-8s fuses , and resolution feature maps for output prediction.
The models are trained to recover dense mask annotations with four categories: background, pedestrian, cyclist and car
. The background corresponds to speckle noise, sensor noise and artefacts which are covering most of the raw radar data. Parameters are optimized for 100 epochs using a categorical cross entropy loss function and the Adam optimizer[kingma_adam_2015] (, and ). The batch size is fixed to 20 for the range-Doppler representation and to 10 for the range-angle representation. For both representations, the learning rate is initialized to for FCN-8s and
for FCN-16s and FCN-32s. The learning rate has an exponential decay of 0.9 each 10 epochs. Training has been completed using the PyTorch framework with a singleGeForce RTX 2080 Ti GPU.
Performances are evaluated for each radar representation using the Intersection over Union (IoU), the Pixel Accuracy (PA) and the Pixel Recall (PR) for each category. Metrics by category are aggregated using arithmetic and harmonic means. To ensure consistency of the results, all performances are averaged from three trained models initialized with different seeds. Results are presented in TableIII. Models are trained on dense mask annotations and evaluated on both dense mask (top values) and sparse points (bottom values in parentheses) annotations. Sparse points are more accurate than dense masks, therefore evaluation on this type of annotation provides information on the behaviour of predictions on key points. However, localization should not be evaluated for sparse points using a model trained on dense masks, therefore IoU performances are not reported. The background category cannot be assessed for the sparse points because some of the points should belong to an object but are not annotated per se. Thus, arithmetic and harmonic means of sparse points evaluations are computed for only three classes against four for the dense masks.
The baseline shows that meaningful representations are learnt by a popular visual semantic segmentation architecture. These models succeed in detecting and classifying shapes of moving objects in raw radar representations even with sparse points annotations. Performances on range-angle are not as good as in range-Doppler because the angular resolution of the sensor is low, resulting in less precise generated annotations. An extension to improve performances on this representation could be to transform it into Cartesian coordinates as done in [major_vehicle_2019]. For both representations, results are promising since the temporal dimension of the objects signatures has not yet been taken into account.
|RD||FCN-32s||99.6 (—)||36.4 (—)||10.3 (—)||41.4 (—)||46.9 (—)||25.1 (—)||99.8 (—)||57.0 (66.5)||23.8 (31.6)||59.8 (77.8)||60.1 (58.2)||45.4 (49.1)||99.8 (—)||51.4 (9.1)||16.3 (3.6)||57.4 (13.2)||56.2 (8.6)||36.4 (6.4)|
|FCN-16s||99.6 (—)||38.0 (—)||11.8 (—)||46.3 (—)||48.9 (—)||27.6 (—)||99.7 (—)||59.6 (68.0)||33.7 (42.7)||71.0 (86.8)||66.0 (65.8)||53.3 (56.8)||99.9 (—)||51.3 (8.8)||16.3 (3.5)||57.3 (12.4)||56.2 (8.3)||36.7 (6.3)|
|FCN-8s||99.6 (—)||42.0 (—)||16.4 (—)||52.2 (—)||52.6 (—)||34.9 (—)||99.7 (—)||67.9 (76.1)||36.8 (47.8)||79.4 (94.1)||71.0 (72.7)||57.3 (62.1)||99.9 (—)||53.6 (9.1)||28.6 (6.4)||60.6 (12.7)||60.7 (9.4)||48.5 (8.4)|
|RA||FCN-32s||99.8 (—)||4.5 (—)||0.9 (—)||30.1 (—)||33.8 (—)||2.8 (—)||99.9 (—)||5.3 (9.3)||2.1 (4.0)||47.4 (60.5)||38.7 (24.6)||5.1 (7.3)||99.9 (—)||28.6 (6.3)||1.7 (0.4)||45.9 (7.0)||44.0 (4.6)||6.0 (1.2)|
|FCN-16s||99.8 (—)||10.4 (—)||1.4 (—)||30.3 (—)||35.5 (—)||4.1 (—)||99.9 (—)||16.3 (23.2)||3.8 (5.5)||43.7 (58.3)||40.9 (29.0)||8.9 (10.2)||99.9 (—)||23.7 (4.1)||2.1 (0.4)||49.7 (7.9)||43.9 (4.1)||6.8 (1.0)|
|FCN-8s||99.8 (—)||8.5 (—)||1.7 (—)||28.5 (—)||34.7 (—)||5.3 (—)||99.9 (—)||11.4 (17.9)||5.2 (8.6)||39.9 (55.8)||39.1 (27.4)||11.3 (14.5)||99.9 (—)||28.2 (5.4)||2.8 (0.7)||50.3 (8.4)||45.3 (4.8)||9.4 (1.6)|
The semi-automatic algorithm presented in Section IV generates precise annotations on raw radar data, but it has limitations. Occlusion phenomena are problematic for tracking, since they lead to a disappearance of the object point cloud in the DoA-Doppler representation. An improvement could be to detect this occlusion in the natural image data and include it in the tracking pipeline. The clustering in the DoA-Doppler representation is also a difficult task in specific cases. When objects are closed to each other with a similar radial velocity, point clouds are difficult to distinguish. Further work on the bandwidth selection and optimisation of this selection could be explored.
The CARRADA dataset provides precise annotations to explore a range of supervised learning tasks. Object detection could be considered by using bounding boxes to detect and classify object signatures. We propose a simple baseline for semantic segmentation trained on dense mask annotations. It could be extended by using temporal information or both dense mask and sparse points annotations at the same time during training. Current architectures and loss functions could also be optimized for semantic segmentation of sparse ground-truth points. By identifying and tracking specific instances of objects, other opportunities are opened. Tracking of sparse points or bounding boxes could also be considered.
The CARRADA dataset contains synchronised video frames, range-angle and range-Doppler raw radar representations. Radar data are annotated with sparse points, bounding boxes and dense masks to localize and categorize the object signatures. A unique identification number is also provided for each instance.
Annotations are generated using a semi-supervised algorithm based on visual and physical knowledge. The pipeline could be used to annotate any camera-radar recordings with similar settings.
The dataset, code for the annotation algorithm and code for dataset visualisation will be released. We hope that this work will encourage other teams to record and release annotated radar datasets combined with other sensors. This work also aims to motivate deep learning research applied to radar sensor and multi-sensor fusion for scene understanding.
The authors would like to express their thanks to the Sensor Cortex team, which has recorded these data and spent time to answer their questions, and to Gabriel de Marmiesse for his valuable technical help.