Off-the-shelf sensor vs. experimental radar – How much resolution is necessary in automotive radar classification?

06/09/2020 ∙ by Nicolas Scheiner, et al. ∙ Daimler AG 0

Radar-based road user detection is an important topic in the context of autonomous driving applications. The resolution of conventional automotive radar sensors results in a sparse data representation which is tough to refine during subsequent signal processing. On the other hand, a new sensor generation is waiting in the wings for its application in this challenging field. In this article, two sensors of different radar generations are evaluated against each other. The evaluation criterion is the performance on moving road user object detection and classification tasks. To this end, two data sets originating from an off-the-shelf radar and a high resolution next generation radar are compared. Special attention is given on how the two data sets are assembled in order to make them comparable. The utilized object detector consists of a clustering algorithm, a feature extraction module, and a recurrent neural network ensemble for classification. For the assessment, all components are evaluated both individually and, for the first time, as a whole. This allows for indicating where overall performance improvements have their origin in the pipeline. Furthermore, the generalization capabilities of both data sets are evaluated and important comparison metrics for automotive radar object detection are discussed. Results show clear benefits of the next generation radar. Interestingly, those benefits do not actually occur due to better performance at the classification stage, but rather because of the vast improvements at the clustering stage.



There are no comments yet.


page 1

page 2

page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Fig. 1:

Two radar sensor recordings and a camera image of the same scene. On the left, the radar detections of the off-the-shelf radar are visualized. On the right, the same scene is captured by a next generation radar. Through visual augmentation, the superiority of the next generation radar is clearly visible as much more detection points fall on the two objects of interest. The higher amount of clutter points is far less stable than the real object detections and can be easily distinguished from the real objects in moving scenarios. In this article, the general validity of this superiority for machine learning algorithms is investigated.

Driven by market demands for advanced driver assistance systems and autonomous vehicles, automotive radar sensors are continuously evolving. One important step in their development was the utilization of printed-circuit-board antenna arrays. By abandoning waveguides and mechanically steered antennas, the sensors can be manufactured with more antenna elements while being cheaper and more compact than their predecessors. Nowadays, there is a trend towards using more and more antenna elements in a sensor to increase the radar’s angular ambiguity range. Moreover, additional elements can be used to measure not only the incident angle of a reflecting signal in azimuth, but also in elevation, or even to measure polarimetric information [1, 2]. Beside the antennas, other internal hardware components are also continuously evolving. Hence, it is now possible to utilize, for example, higher sampling rates or steeper frequency ramps, leading to an improved sensor resolution.

It was shown, e.g. in [3] and [4], that these improved sensor specifications can serve well for the discrimination between two target objects and even allow for discerning body parts of pedestrians [5]. Furthermore, current research showed that automotive object detection tasks can be eased using these new types of radar sensors [6].

Probably due to the lack of publicly available data sets, a systematic evaluation with state-of-the-art classifiers on real world automotive radar scenes is currently not available. The necessity for such an evaluation is apparent for a simple reason. Every new generation of radar sensors produces a lot of work and costs, starting from the development, the integration in a car, and finally the collection of new data sets along with extensive testing. A major concern in this process is that even though the data might seem superior to the data engineers, e.g., as visible in Fig. 1, an object detector may actually not benefit from it. Modern classifiers for automotive radar data are often based on machine learning methods, e.g. [7, 8, 9, 10, 11, 12]. For these methods it is more complicated to predict in what way the algorithms actually benefit from higher resolved data.

In this article, the preconception, “high resolution sensors are always better,

” is investigated. Therefore, two data sets from two different cars are put together in a manner that allows for a fair comparison later on. Moreover, a modularized object detector consisting of a clustering algorithm, a feature extractor, and a recurrent neural network ensemble is described and optimized separately on both sensor data sets. The object detection results are evaluated based on the respective sensor. Finally, an estimation on the quantitative benefits of using a next generation radar is presented for all components of the object detector. In addition to a sensor comparison, the combined assessment of recently published object detection modules allows for the first in-depth investigation of the interaction of those modules.

In summary, the following contributions are made:

  • Two sensors from different generations are evaluated using real world data recordings.

  • Every detection module is compared individually to highlight the exact sensor advantages.

  • The utilized object detector framework is presented as a whole for the first time.

  • Important assessment factors are identified to ease the evaluation for future radar generations.

Ii Data Sets

For the comparison of the two radar sensors, two different proprietary data sets are used. Both contain recordings from different real world scenes and were collected with two individual test vehicles. Please note, that there are no open radar data sets which can be used for the purposes of this article. The only currently available data sets comprise nuScenes [13], Astyx [14], and Oxford Radar RobotCar [15]. However, the nuScenes radar has far worse data density even than the off-the-shelf sensor compared here. The Astyx data set has a much higher data density, but it is too small overall. Lastly, the Oxford Radar RobotCar data set comprises automotive scenarios but does not use an automotive radar sensor which fails the purpose of this article.

In this section, an overview is given of the vehicle setups, the sensor specifications, and how the recorded scenes of both data sets compare. The first data set, recorded by vehicle using four off-the-shelf radar sensors is referred to as and the data set recorded by vehicle with two experimental next generation radars as . The sensor specifications for both sensors are listed in Tab. I. The upper half of the table represents the frequency range of the emitted signal and the operational bands for range (distance) , azimuth angle , and radial (Doppler) velocity respectively. In the second part the resolutions and for , , , and time are noted. All values are given separately for both sensors, the conventional sensor in vehicle and the next generation sensor in vehicle . Both sensors operate at the same frequency range. Sensors in vehicle have a wider Doppler velocity ambiguity range and a faster sensor update rate, i.e., a higher time resolution. The latter is actually not a sensor property but a practical choice during recording with vehicle in order to decrease the data load. Theoretically, the maximum update rate for the next generation radar lies at . In all other categories, the experimental sensors in vehicle are superior to the off-the-shelf sensors in . For the recorded scenarios, the most important aspects are the two to five times smaller resolution values for range, angle, and Doppler. The detection filtering thresholds have been set to well-fitting thresholds for both sensors types, i.e., to find a good compromise between false positive and false negative detection. Hence, no sensor is favored over the other.

TABLE I: Radar sensor specification for both compared sensors.

The sensor setups for both test vehicles are displayed in Fig. 2. For , four sensors are distributed over the front bumper of the car. For , however, only two sensors are mounted in the front of the vehicle, as their wider field of view suffices for a coverage of the entire vehicle front. Due to data sparsity, overlapping regions in the sensors’ fields of view can be superposed by simple data accumulation. In order to keep the sensors from interfering with each other, the sensor cycles are interleaved.

Fig. 2: Schematic sensor distribution of both test vehicles. Test car was used to collect the data for . was recorded with car . Both vehicles contain only a single radar sensor type.

Both, and , are subsets of larger proprietary data sets. In order to make both data sets comparable, special attention has to be payed to the selection of appropriate recordings. Especially for vehicle , the original data base has a strong imbalance between classes and currently contains only very few examples of, e.g., cars, trucks, and motorcycles. To make both data sets comparable, solely recordings with pedestrians, bicycles, and background detections are used for the evaluation in this article. The background class consists of static points, measurement artifacts, and other clutter. Furthermore, vehicle remains stationary during a large percentage of the recordings, i.e., no ego-motion compensation is necessary. This is a potentially advantageous factor for later classification, and, therefore, for both data sets only data from scenarios without ego-motion are chosen.

This results in a total of 53 selected sequences for data set with a total length of about . All data was manually labeled by human experts. The second data set contains 25 sequences with a total of 105 repetitions, i.e., many sequences have been recorded multiple times at the same locations, but with different road user trajectories. The recording time of adds to about . consists of 5 purely manually labeled recordings and 100 further ones, where a hand-held global navigation satellite system reference was used for automatic labeling according to [16]. All automatically labeled data were manually checked and corrected if necessary. The distributions of object samples and detection points within both data sets are given in Tab. II. “Objects” refers to the amount of time windows during which the actual object instances are present in the data. For the background class, object samples are created by removing all ground truth objects from the data and afterwards clustering the remainder of the detection points with the same clustering algorithm as discussed in Sec. III-A. The reported numbers are obtained after application of the data filter discussed also in Sec. III-A, which basically reduces the amount of background detections to roughly one tenth. One important observation from Tab. II is the strong data imbalance between the classes in both data sets. Set is a lot larger with respect to object samples. However, the amount of detections on the actual road users is similar, i.e., the amount of detections per road user is much larger for the next generation sensor in due to the higher sensor resolution. Despite the remaining differences between both data sets, their sizable quantities together with the described sequence selection strategy allows for a suitable comparison.

Data Set Pedestrian Bicycle Static/Garbage
Objects 22424 5810 66020
Objects 2751 1809 13248
TABLE II: Data set distribution comparison after filtering. For each data set and class, the amounts of object samples and detection points are listed.

Iii Comparison Methodology

A basic overview of the utilized detection framework is given in Fig. 3.

Fig. 3:

Modularized object detection framework: The data are first structured using a clustering algorithm. Every cluster formation is subject to feature extraction. For each cluster, six feature sets are extracted – three of those are visualized. Each feature set is optimized for one of the six classifiers in the classification stage. Recurrent neural networks with a single layer of long short-term memory (LSTM) cells are used as classifiers. The set of classifiers consists of three one-vs-all classifiers and three one-vs-one classifiers which make a combined final class decision for each cluster sample.

The pipeline consists of three stages which expect as input radar detection points which are resolved in range , azimuth angle , Doppler velocity , and time . Furthermore, for each detection an amplitude is measured which is an estimation of the radar cross section of the part of the object the detection belongs to. The main components of the framework serve to:

  1. Clustering: merge radar detections to object instances

  2. Feature extraction: enrich the feature space of the data clusters by collecting cluster meta information

  3. Classification: make a class decision for each cluster

The remainder of this section gives details about the three parts of the algorithm.

Iii-a Data Clustering

The clustering method uses a DBSCAN algorithm [17], following the findings from [18]. First, the data points are transformed to Cartesian coordinates and , and pre-filtered in order to ease the actual clustering process. For filtering, the clustering algorithm is altered to consider only points above a certain velocity threshold which depends on the number of detection points in a close distance according to:


The neighbor threshold , the velocity threshold , and the spatial search radius are parameters which have to be optimized.

The other differences to conventional DBSCAN are an adaptive number of minimum points required to form a cluster core point, where is the range, i.e., distance between detection and sensor. Therefore, the sensor-specific range information of each detection has to be kept in the data to avoid extra calculations. This adjustment is based on the fact that remote objects have a smaller maximum number of possible detections due to range-independent angular resolution as discussed in [19]. Since the physical extents of road users do not change, the minimum point property of DBSCAN is set to:


Both and are tuning parameters which represent a minimum point baseline at and the slope of the reciprocal relation. Furthermore, only detections that exceed a certain radial velocity threshold can become core points in accordance with [20].

Lastly, the distance region (also known as region) of the DBSCAN algorithm is customized to cluster points that have low spatial distances and and low differences in Doppler values . The whole neighborhood criterion can be expressed as:


In this case and have a scaling effect rather than representing absolute maximum velocity or spatial distance thresholds as in conventional DBSCAN processing. The scaling of allows for better tuning capabilities than, e.g., normalizing all values to the same range. As amplitudes often have very high variations even on a single object, they are completely neglected during clustering. The time is not included in the Euclidean distance, i.e., is always required to be smaller or equal than its corresponding threshold due to real-time processing constraints. In offline processing, this is addressed by using a sliding window of length and an update rate of . All tuning parameters are adjusted using Bayesian Optimization [21] and a measure [22] optimization score. More details on are given in Sec. IV-A.

Iii-B Feature Selection And Extraction

For feature extraction, all labeled cluster sequences are first sampled in time using a non-overlapping sliding window of

. The feature extraction window is chosen smaller than during the clustering process because this allows the subsequent classifier to better capture the variation between subsequent time frames. In the next step, the features are extracted from each of the cluster samples so that with this increased number of feature vectors, the classifier can learn from more data. The extracted features can be roughly divided into six groups. The first four groups contain statistical values such as the minimum and maximum, the spread, and the standard deviation of the four base units (range, angle, amplitude, and Doppler). The fifth group consists of geometric features describing the spatial distribution of detections in a cluster sample, e.g., the circularity or the size of a convex hull. The final group addresses the “micro-Doppler” characteristics, i.e., the distribution of Doppler values within a cluster. In total, 98 features are extracted from each cluster. The entire list can be found in

[7]. From the total of 98 features, only a subset is passed to each of the models in the following classifier ensemble. Every classifier in that stage has its own task, hence, a different feature subset results in optimal performance. Finding the exact optimal feature set for each classifier is an NP-hard problem. Therefore, a guided backward elimination algorithm is used. Backward elimination is a wrapper method which repeatedly tests the utilized classifier and then eliminates the least fitting feature in a greedy fashion until a stopping criterion is reached [23]

. A complete backward elimination run for all 98 features is computationally too expensive for each classifier in the ensemble. Thus, for the backward elimination every feature is only assessed once but in a fixed order. After each examination a feature is either dropped or kept which drastically reduces the computational effort. The evaluation order is determined by the combination of two other feature selection techniques: the Joint Mutual Information (JMI)

[24] and the Relief-based MultiSURF algorithm [25]. Both algorithms belong to the group of filters. As filter methods do not require multiple classifier trainings, they are usually much faster than wrapper methods, even though the latter often show superior results. The combined approach of filter and wrapper methods is a compromise which yields well performing feature sets at a reasonable computational effort.

Iii-C Classification

The classifier units used in this article are long short-term memory (LSTM) cells. LSTMs are a special kind of recurrent neural network which introduce gating functions in order to avoid the vanishing gradient problem during training


A fixed configuration of 80 LSTM cells followed by a softmax layer is used for all classifiers in the ensemble. The LSTM network is configured to accept up to eight consecutive feature vectors from the same cluster instance, if available.

The performance of the LSTM network is further improved by adding a few tweaks to the standard implementation:

Multiclass binarization

is a wide-spread technique for improving the classification performance on moving road users, especially for unbalanced data sets. The classification stage uses a combined one-vs-one (OVO) and one-vs-all (OVA) approach. Class membership is estimated by summing all pairwise class posterior probabilities

from corresponding OVO classifiers. Additionally, each OVO classifier is weighted by the sum of corresponding OVA classifier outputs . Subscripts and denote the corresponding class identifiers for which the classifier is trained. During testing, this limits the influence of OVO classifiers which were not trained on the same class as the regarded sample, i.e., the OVA classifiers act as correction classifiers [27]. The final class decision for a feature vector is then calculated as:


where is the number of classes in the training set. This combined approach of OVO and OVA yields a total of classifiers and feature sets, also indicated in Fig. 3. Moreover, during training, class weighting is used to further reduce data imbalance effects. Therefore, the influence of all training samples is adjusted inversely to their share in the total class distribution. More details on the classification network and the multiclass binarization techniques for moving road users are given in [28].

Iv Results

For the evaluation of the sensors, based on both data sets, each stage of the object detection framework is at first evaluated individually. This is done in a ceiling analysis fashion, i.e., all steps are first computed assuming perfect accuracy of all other components. Then, a combined result is estimated by evaluating the framework as a whole. A summary of all results is presented in Tab. IV

. All reported results are calculated based on distinct test sets which are not used for hyperparameter tuning or model training. The two test sets consist of roughly

of their corresponding data set. The split is sequence-based in order to avoid having the same object instances in the training and test split. To this end, several million random permutations of all recorded sequences were tested for sequence combinations that yield class proportions similar to the full data set but at only of their size. At the feature extraction stage, the results are the found parameter sets and their degree of equivalence. Hence, the training set, i.e., the remaining of the data is used for the evaluation at this stage.

Iv-a Clustering

An important step towards a meaningful cluster evaluation is the choice of a good optimization score. Compared to other clustering applications, this method is used to identify object instances and separate them from background points. It is important to represent every road user with an individual cluster containing as many points as possible from the original one. Additionally, the clustering process needs to stop before merging clusters from different road users or before adding background points to the object instance. The majority of the data points in a radar scene are background detections. Clusters containing only background detections and clutter are not desired in this application, but also not critical because the classifier stage should be able to distinguish road users from such unwanted cluster formations.

The V-measure [22] combines two intuitive clustering criteria, homogeneity and completeness. Completeness aims to assign all points from a single ground truth cluster into a single cluster prediction. Contrary to that, homogeneity is maximal when a predicted cluster only contains points from a single ground truth cluster.

is the harmonic mean of homogeneity and completeness:


To stop the penalization of background clusters creation, the completeness score is calculated assuming perfect matching of the detections that belong to a labeled object in the ground truth. This adaptation makes the score’s objective sufficiently similar to the requirements for automotive radar clustering.

Four configurations are evaluated: Both data sets are first optimized and evaluated individually ( and ). Then, the optimal configurations for both data sets are used to evaluate the other one ( and ). This gives a deeper insight into how well the data sets allows for generalization. The results are listed in Tab. IV. has the highest score of and scores much lower with . Similar to the scores, the cluster parameters differ a lot. The best configuration for has a setting of , , , , and , whereas uses the following parameters: , , , , and . It is hence possible to use much smaller regions for data of which is apparently beneficial for the overall performance. Unfortunately, the largely different cluster parameterizations lead to massive decreases in the scores (both around ) when using them to segment the respective other data set, i.e., generalization is very low.

Iv-B Feature Selection

When comparing the feature selection stage for both data sets, three factors are of interest for this evaluation: First, does any of the two data sets require substantially more or less features? Second, are specific features more important for one data set as for the other? Third, how similar are the optimized feature sets of one sensor to the ones of the other sensor? Fig. 4 aims to answer the first two questions. The diagram shows the amount of features presented to each classifier in the ensemble separately for both data sets. Furthermore, the features are grouped into the six categories mentioned in Sec. III-B. In comparison to previous studies (e.g. [7]), the total amount of utilized features is rather high (averages: and out of

in total). However, this number remains more or less constant over different classifiers and data sets. Also, there is no category for which a clear preference is visible. The degree of equivalence can be estimated using the Jaccard index (aka. intersection over union – IoU) between matching classifiers for both data sets. The amount of common features for each classifier pair is divided by their union yielding the results in Tab. 

III. This summarizes to a mean IoU of with standard deviation which shows great conformity of both feature sets. Thus, it is concluded that the choice of data set corresponding to a specific sensor has no remarkable impact on the feature extraction stage.

Fig. 4: Feature distribution over classifiers in the ensemble. OVA classifiers are identified by their corresponding “one” class, i.e., pedestrian (P), bicycle (B), or static/garbage (S). OVO classifiers are indicated likewise by two letters. The full feature set distribution is displayed on the right for comparison.
P B S PvB PvS BvS Mean & StdDev
TABLE III: Jaccard index between each pair of classifiers in the ensembles for both data sets. OVA classifiers are identified by their corresponding “one” class, i.e., pedestrian (P), bicycle (B), or static/garbage (S). OVO classifiers are indicated likewise by two letters.

Iv-C Classification

As mentioned before, both data sets contain strong imbalances between classes. To preserve the influence of each individual class, all classification scores are reported as macro-averaged scores. The score is the harmonic mean of precision (true positive / predicted positive) and recall (true positive / condition positive). Macro-averaging uses the mean value of all classes’ individual scores:


All classification results are listed in Tab. IV. The model trained and tested on performs best, with a score of compared to . The gain in at total score of would be a good result if the data sets contained identical scenarios. As this is not the case, the difference has to be regarded too low to make the strong claim that the classifier generally benefits from the increased sensor resolution. In order to look closer into the classification results, Fig. 5 shows the confusion matrices for the two single data set experiments. When comparing both matrices, it is apparent that even though the scores for both experiments are not remarkably different, the formation of those scores is. The most decisive factor is that for , the confusion of both vulnerable road user (VRU) classes with the background class is much higher than for . In turn, the latter has a higher confusion between VRUs. In practice, the second behavior is more desirable as VRUs are not overlooked.

Fig. 5: Confusion matrices of the two best performing classification experiments. On top: , and on the bottom: .

In terms of generalization between data sets, the classification stage behaves similar to the clustering stage. While still manages to make some useful predictions, cannot overcome the class imbalance and simply predicts the static/garbage for almost all samples resulting in . This leads to the conclusion, that the features from the previous stage, even though important for both data sets, are not robust enough to cover the difference in data. This may be the case, e.g., for radar cross section (amplitude) estimates which are often part of the internal sensor processing, i.e., the processing may differ between sensor types. Another example is the number of detection points in a cluster which varies heavily between sensors. The coverage of all specific features is, however, not in the scope of this article. Summarizing the classification stage, it can be stated that there seem to be some small benefits in classification performance when using a high resolution radar. However, the improvements are less distinct as for the clustering stage.

Iv-D Object Detection Framework

To evaluate the whole object detection pipeline at once, new metrics are necessary. The important difference to previous steps is that only fractions of objects may be classified correctly or different object instances might be merged during clustering. As the choice of evaluation metric has a big influence on the final results, four different metrics are proposed.

Point-wise score

Adopted from the classification stage, a macro-averaged score is calculated, this time based on the prediction of all detection points instead of cluster samples. The advantage of this score is that it gives comprehensible feedback about how well the scene was segmented.

Instance-based score

While the simplicity of the first score may be advantageous to gain a general understanding, it does not capture correct instance segmentation. In image-based object detection, the usual way to decide if an object is detected or not is by calculating the pixel-based IoU [29]. This can easily be adopted to radar point clouds by calculating the intersection and union based on detection points instead of pixels:


A VRU instance is defined as correctly detected if the cluster’s IoU is greater or equal to for a ground truth instance with the same label. This corresponds to a true positive (TP). Other predicted instances on the same ground truth object make up false positives (FP). Finally, non-detected VRU ground truth instances count as false negatives (FN) and everything else as true negatives (TN). By using an alternative notation of ,


this can easily be used to calculate an instance-based score, which is also macro-averaged according to Eq. 6. Most recently, this metric was also used and illustrated in [12].

Binary VRU detection score

The second criterion can be eased in order to not punish the object detector if it correctly segments an object, but assigns the wrong VRU label, i.e., pedestrian instead of bicycle or the other way round. In this case all cluster samples with for any VRU object count as TP if either VRU label is predicted. If one is only interested in how well road users are recognized by the proposed object detector, a VRU-based true positive rate (TPR or recall), can be calculated as:


On its own, the recall can be easily misleading, since it does not account for FP. In addition to the other scores, however, this gives extra information about where fine-tuning of the object detector is most appropriate.

VRU Balanced Accuracy

A more general version of a VRU-based score is the Balanced Accuracy (BAAC), which is calculated as:


where the true negative rate also takes into account the performance of the background class rejection. Similar to the score, BAAC is an indicator for classification or object detection tasks, which works particularly well on imbalanced data sets.

All results are presented in Tab. IV. It is clearly visible, that the next generation radar in experiment outperforms the conventional sensor in with big margins in all categories. Despite their similarity in terms of pure classification performance on perfect data samples, the differences are very distinct on an object detection level. Both scores improve roughly and , the TPR even goes up by . This is a strong indicator that the improved results at the clustering stage are extremely beneficial for the overall object detection performance. Only, the generalization ability for the cross data set experiments and is now close to non-existing. When evaluating the two VRU-based scores in any data set combination, it is clearly visible that the TNR part of BAAC is rather high compared to the TPR part. This stresses the advantage of having the recall as a separate score, as this is the one to optimize further.

Clustering Results Classification Results
Combined Detection Results
Point Instance
TABLE IV: Evaluation result summary for all evaluated categories and experiments. Details on all scores are given in the corresponding sections.

V Discussion And Conclusion

In this article, two data sets from different radar sensor generations have been tested against each other. For this purpose, properties for a fair comparison of two radar data sets have been worked out and applied to crop two proprietary data sets to comparable subsets. Both subsets are processed using an identical radar object detection framework consisting of a clustering algorithm, a feature selection stage, and a recurrent neural network ensemble. The interaction of all modules in the framework is an important part of the evaluation, as to date, most research has been focused solely on single components. Results are reported for each intermediate category as well as for the whole framework. The main question: “Does an object detector benefit from next generation radar sensors?”, can be positively answered. The instance-based object detection score improves from with an off-the-shelf sensor to using a high resolution next generation radar. Even though, these numbers do not seem very high when compared to modern image-based object detectors, it is an excellent result for solely radar-based VRU detection. The greatest improvements are made in the clustering stage. Formerly the weak point of the pipeline, the clustering errors are now at a mediocre level. The main reason for this improvement is that for the next generation radar, the detection points on an object are located close enough for the clustering algorithm to utilize much smaller neighborhood regions. This effectively allows for better separation of real objects and background clutter. In the classification stage, only a slight improvement in

scores is obtained. Regarding the confusion between classes, VRU discrimination is a lot better with the new sensor technology. While no big differences can be observed in the feature extraction module, tests at all other stages of the framework showed that the generalization capability from one sensor to the other is minimal. Nevertheless, it is highly likely – and shall be tested in future work – that transfer learning, i.e., using either one data set for pre-training a classifier and fine-tuning it with the other data set, leads to an improved performance. Furthermore, the weak generalization capability without fine-tuning motivates the search for more robust features. As an example, a principal component analysis could be used to obtain lower-dimensional features with a possibly higher generalization capability. Another approach is adjusting existing features by calibrating them with sensor properties, e.g., the number of detection points in a cluster may be scaled by the maximum possible number of points during one sensor scan. This might be of even greater interest in cases where different types of sensors, e.g., short-range radars and long-range radars, complement each other in the same vehicle.


  • [1] F. Weishaupt, K. Werber, J. Tilly, J. Dickmann, and D. Heberling, “Polarimetric Radar for Automotive Self-Localization,” in 2019 20th International Radar Symposium (IRS).   Ulm, Germany: IEEE, jun 2019, pp. 1–8.
  • [2] J. F. Tilly, F. Weishaupt, O. Schumann, J. Klappstein, J. Dickmann, and G. Wanielik, “Polarimetric Signatures of a Passenger Car,” in 2019 Kleinheubach Conference, Miltenberg, Germany, sep 2019, pp. 1–4.
  • [3] S. Brisken, J. Gütlein-Holzer, and F. Höhne, “Elevation Estimation with a High Resolution Automotive Radar,” in 2019 IEEE Radar Conference (RadarConf), apr 2019, pp. 1–5.
  • [4] E. Schubert, F. Meinl, M. Kunert, and W. Menzel, “High resolution automotive radar measurements of vulnerable road users – pedestrians cyclists,” in 2015 IEEE MTT-S International Conference on Microwaves for Intelligent Mobility (ICMIM), apr 2015, pp. 1–4.
  • [5] D. Steinhauser, P. Held, A. Kamann, A. Koch, and T. Brandmeier, “Micro-Doppler Extraction of Pedestrian Limbs for High Resolution Automotive Radar,” in 2019 IEEE Intelligent Vehicles Symposium (IV).   IEEE, jun 2019, pp. 764–769.
  • [6]

    M. Meyer and G. Kuschk, “Deep Learning Based 3D Object Detection for Automotive Radar and Camera,” in

    2019 16th European Radar Conference (EuRAD), oct 2019, pp. 133–136.
  • [7]

    N. Scheiner, N. Appenrodt, J. Dickmann, and B. Sick, “Radar-based Road User Classification and Novelty Detection with Recurrent Neural Network Ensembles,” in

    2019 IEEE Intelligent Vehicles Symposium (IV).   Paris, France: IEEE, jun 2019, pp. 642–649.
  • [8] O. Schumann, M. Hahn, J. Dickmann, and C. Wöhler, “Semantic Segmentation on Radar Point Clouds,” in 2018 21st International Conference on Information Fusion (FUSION).   Cambridge, UK: IEEE, jul 2018, pp. 2179–2186.
  • [9]

    O. Schumann, J. Lombacher, M. Hahn, C. Wohler, and J. Dickmann, “Scene Understanding with Automotive Radar,”

    IEEE Transactions on Intelligent Vehicles, vol. 5, no. 2, 2019.
  • [10] A. Danzer, T. Griebel, M. Bach, and K. Dietmayer, “2D Car Detection in Radar Data with PointNets,” in 2019 IEEE 22nd Intelligent Transportation Systems Conference (ITSC).   Auckland, New Zealand: IEEE, oct 2019, pp. 61–66.
  • [11] J. Lombacher, K. Laudt, M. Hahn, J. Dickmann, and C. Wöhler, “Semantic radar grids,” in 2017 IEEE Intelligent Vehicles Symposium (IV).   Redondo Beach, USA: IEEE, jun 2017, pp. 1170–1175.
  • [12] A. Palffy, J. Dong, J. Kooij, and D. Gavrila, “CNN based Road User Detection using the 3D Radar Cube CNN based Road User Detection using the 3D Radar Cube,” IEEE Robotics and Automation Letters, vol. PP, 2020.
  • [13] H. Caesar, V. Bankiti, A. H. Lang, S. Vora, V. E. Liong, Q. Xu, A. Krishnan, Y. Pan, G. Baldan, and O. Beijbom, “nuScenes: A multimodal dataset for autonomous driving,” arXiv preprint arXiv:1903.11027, 2019.
  • [14] M. Meyer and G. Kuschk, “Automotive Radar Dataset for Deep Learning Based 3D Object Detection,” in 2019 16th European Radar Conference (EuRAD).   Paris, France: IEEE, oct 2019, pp. 129–132.
  • [15] D. Barnes, M. Gadd, P. Murcutt, P. Newman, and I. Posner, “The oxford radar robotcar dataset: A radar extension to the oxford robotcar dataset,” in IEEE International Conference on Robotics and Automation (ICRA), Paris, France, 2020.
  • [16] N. Scheiner, S. Haag, N. Appenrodt, B. Duraisamy, J. Dickmann, M. Fritzsche, and B. Sick, “Automated Ground Truth Estimation For Automotive Radar Tracking Applications With Portable GNSS And IMU Devices,” in 2019 20th International Radar Symposium (IRS).   Ulm, Germany: IEEE, jun 2019, pp. 1–10.
  • [17] M. Ester, H.-P. Kriegel, J. Sander, and X. Xu, “A Density-based Algorithm for Discovering Clusters in Large Spatial Databases with Noise,” in 1996 2nd International Conference on Knowledge Discovery and Data Mining (KDD).   Portland, OR, USA: AAAI Press, aug 1996, pp. 226–231.
  • [18] N. Scheiner, N. Appenrodt, J. Dickmann, and B. Sick, “A Multi-Stage Clustering Framework for Automotive Radar Data,” in 2019 IEEE 22nd Intelligent Transportation Systems Conference (ITSC).   Auckland, New Zealand: IEEE, oct 2019, pp. 2060–2067.
  • [19] D. Kellner, J. Klappstein, and K. Dietmayer, “Grid-based DBSCAN for clustering extended objects in radar data,” in 2012 IEEE Intelligent Vehicles Symposium (IV).   Alcala de Henares, Spain: IEEE, jun 2012, pp. 365–370.
  • [20] O. Schumann, M. Hahn, J. Dickmann, and C. Wöhler, “Supervised Clustering for Radar Applications: On the Way to Radar Instance Segmentation,” in 2018 IEEE MTT-S International Conference on Microwaves for Intelligent Mobility (ICMIM).   Munich, Germany: IEEE, apr 2018.
  • [21] J. Mockus, “On Bayesian Methods for Seeking the Extremum,” in IFIP Technical Conference.   Nowosibirsk, USSR: Springer-Verlag, jul 1974, pp. 400–404.
  • [22] A. Rosenberg and J. Hirschberg, “V-Measure: A Conditional Entropy-Based External Cluster Evaluation Measure,” in

    Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning

    .   Prague, Czech Republic: Association for Computational Linguistics, jun 2007, pp. 410–420.
  • [23] R. Kohavi and G. H. John, “Wrappers for Feature Subset Selection,” Artificial Intelligence, vol. 97, no. 1-2, pp. 273–324, 1997.
  • [24] H. H. Yang and J. Moody, “Feature Selection Based on Joint Mutual Information,” in International ICSC Symposium on Advances in Intelligent Data Analysis, Rochester, NY, USA, jun 1999, pp. 22–25.
  • [25] R. J. Urbanowicz, M. Meeker, W. L. Cava, R. S. Olson, and J. H. Moore, “Relief-based feature selection: Introduction and review,” Journal of Biomedical Informatics, vol. 85, pp. 189–203, 2018.
  • [26] S. Hochreiter and J. Schmidhuber, “Long Short-Term Memory,” Neural Computation, vol. 9, no. 8, pp. 1735–1780, 1997.
  • [27] M. Moreira and E. Mayoraz, “Improved pairwise coupling classification with correcting classifiers,” in 1998 10th European Conference on Machine Learning (ECML).   Chemnitz, Germany: Springer-Verlag, apr 1998, pp. 160–171.
  • [28] N. Scheiner, N. Appenrodt, J. Dickmann, and B. Sick, “Radar-based Feature Design and Multiclass Classification for Road User Recognition,” in 2018 IEEE Intelligent Vehicles Symposium (IV).   Changshu, China: IEEE, jun 2018, pp. 779–786.
  • [29] M. Everingham, S. M. A. Eslami, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman, “The Pascal Visual Object Classes Challenge: A Retrospective,”

    International Journal of Computer Vision

    , vol. 111, no. 1, pp. 98–136, 2015.