For autonomous driving, the perception of vehicle surrounding is an important task. In particular, tracking of multiple objects using noisy data from sensors, such as radar, lidar and camera, is a crucial task. A standard approach is to preprocess received sensor data in order to generate object detections, which are used as input for a tracking algorithm. For camera and lidar sensors, various object detectors have already been developed to obtain object hypotheses in form of classified rectangular 2D or 3D bounding boxes. To the best of the authors’ knowledge, no object detection method for radar has been presented in literature so far which performs a classification as well as a bounding box estimation. Although modern high-resolution radar sensors generate multiple detections per object, the received radar data is extremely sparse compared to lidar point clouds or camera images. For this reason, it is a challenging task to recognize different objects solely using radar data.
This contribution presents a method to detect objects in high-resolution radar data using a machine learning approach. As shown in Figure1
, radar data is presented as a point cloud, called radar target list, consisting of two spatial coordinates, ego-motion compensated Doppler velocity and radar cross section (RCS). Since radar targets are represented as point cloud, it is desirable to use simply point clouds as input for a neural network. Existing approaches which process radar data use certain representation transformations in order to use neural networks. However, PointNets[qi2017pointnet, qi2017pointnetplusplus] make it possible to directly process point clouds and are suitable for this data format. Frustum PointNets [qi2018frustum] extend the concept of PointNets for the detection of objects by combining a 2D object detector with a 3D instance segmentation and 3D bounding box estimation. The proposed object detection method for radar data is based on the approach of Frustum PointNets to perform 2D object detection, i.e., object classification and segmentation of radar target lists together with a 2D bounding box regression. The 3D lidar point cloud used in [qi2017pointnet, qi2017pointnetplusplus, qi2018frustum] is dense and even captured fine-grained structures of objects. Although radar data is sparse compared to lidar data, radar data contains strong features in form of Doppler and RCS information. For example, wheels of a vehicle generate notable Doppler velocities, and license plates result in targets with high RCS values. Another advantage of radar data is that it often contains reflections of an object part which is not directly visible, e.g., wheel houses at the opposite side of a vehicle. All these features can be very beneficial to classify and to segment radar targets, and above all to perform bounding box estimations of objects.
This work is structured as follows. Section II
displays related work in the field of object classification and bounding box estimation using radar data. Furthermore, deep learning approaches on point clouds using PointNets are introduced. In SectionIII the problem of this contribution is stated. Section IV presents the proposed method to detect 2D object hypotheses in radar data. In addition to that, the automatically generated radar dataset is described, and the training process is explained. Section LABEL:sec:experiments shows results which are evaluated on real-world radar data. Finally, a short conclusion is given.
Ii Related Work
Object Classification in Radar Data
For the object classification task in radar data, Heuel and Rohling [heuel2011, heuel2012, heuel2013] present approaches with self-defined extracted features to recognize and classify pedestrians and vehicles. Wöhler et al. [woehler2017]lombacher2016] accumulate raw radar data over time and transform them into a grid map representation [werber2015]. Then, windows around potential objects are cut out and used as input for a deep neural network. Furthermore, Lombacher et al. [lombacher2017_semanticradargrids]
infer a semantic representation for stationary objects using radar data. To this end, a convolutional neural network is fed with an occupancy radar grid map.
Bounding Box Estimation in Radar Data
Instead of classifying objects, another task is to estimate a bounding box, i.e., position, orientation and dimension of an object. Roos et al. [roos2016] present an approach to estimate the orientation as well as the dimension of a vehicle using high-resolution radar. For this purpose, single measurements of two radars are collected and enhanced versions of orientated bounding box algorithms and the L-fit algorithm are applied. Schlichenmaier et al. [schlichenmaier2016] show another approach to estimate bounding boxes using high-resolution radar data. For this reason, position and dimension of vehicles are estimated using a variant of the -nearest-neighbors method. Furthermore, Schlichenmaier et al. [schlichenmaier2017] present an algorithm to estimate bounding boxes representing vehicles using template matching. Especially in challenging scenarios, the templating matching algorithm outperforms the orientated bounding box methods. A disadvantage of the proposed method is that clutter points not belonging to the vehicle are however taken into account for the bounding box estimation.
The input of most neural networks has to follow a regular structure, e.g., image grids or grid map representation. This requires that data such as point clouds or radar targets have to be transformed in a regular format before feeding them into a neural network. The PointNet architecture overcomes this constraint and support point clouds as input. Qi et al. [qi2017pointnet] present a 3D classification and a semantic segmentation of 3D lidar point clouds using PointNet. Since the PointNet architecture does not capture local structures induced by the metric space, the ability to capture details is limited. For this reason, Qi et al. [qi2017pointnetplusplus] also propose a hierarchical neural network, called PointNet++, which applies PointNet recursively on small regions of the input point set. The PointNet++ architecture facilitates to leverage neighborhoods at multiple scales resulting in the ability to learn deep point set features, i.e., to capture fine-grained patterns, and robustness. Schumann et al. [schumann2018] use the same PointNet++ architecture for semantic segmentation on radar point clouds. For this purpose, the architecture is modified to handle point clouds with two spatial and two further feature dimensions. The radar data is accumulated over a time period of ms to get a denser point cloud with more reflections per object. Subsequently, each radar target is classified among six different classes. Thus, only a semantic segmentation is performed but no semantic instance segmentation or bounding box estimation. Utilizing image and lidar data, Qi et al. [qi2018frustum] present the Frustum PointNets, to detect 3D objects. First, a 3D frustum point cloud including an object is extracted using a 2D bounding box from an image based object detector. Second, a 3D instance segmentation in the frustum is performed using a segmentation PointNet. Third, an amodal 3D bounding box is estimated using a regression PointNet. Hence, Frustum PointNets are the first method performing object detection including bounding box estimation using PointNets on unstructured data.
Iii Problem Statement
Given radar point clouds, the goal of the presented method is to detect objects, i.e., to classify and localize objects in 2D space (Figure 1). The radar point cloud is represented as a set of four-dimensional points , where denotes the number of radar targets. Furthermore, each point contains -coordinates, ego-motion compensated Doppler velocity , and radar cross section . The radar data are generated from one measurement cycle of a single radar and are not accumulated over time.
For the object classification task, this contribution distinguishes between two classes, car and clutter. Note that this method can easily be extended to multiple classes. Additionally, for radar targets that are assigned to the class car, an amodal 2D bounding is predicted, i.e., even if only parts are captured by the radar sensor, the entire object is estimated. The 2D bounding box is described by its center , its heading angle in the -plane and its size containing length and width .
Iv 2D Object Detection with PointNets
An overview of the proposed 2D object detection system using radar data is shown in Figure 2. This sections introduces the three major modules of the system: patch proposal, classification and segmentation, and amodal 2D bounding box estimation.
Iv-a Patch Proposal
The patch proposal divides the radar point cloud into region of interests. For this purpose, a patch with a specific length and width is determined around each radar target. The lenght and width of the patch must be selected in such a way that it comprises the entire object of interest, here a car. The patch proposal generates multiple patches containing the same object. As a result, the final 2D object detector provides multiple hypotheses for a single object. This behavior is desirable because the object tracking system in the further processing chain for environmental perception deals with multiple hypotheses per object. Note that the tracking system is not part of this work. As described in [qi2018frustum], the patches are normalized to a center view which ensures rotation-invariance of the algorithm. Finally, all radar targets within a patch are forwarded to the classification and segmentation network.
Iv-B Classification and Segmentation
The classification and object segmentation module consists of a classification and a segmentation network to get classified object with segmented radar targets. To this end, the entire patches are classified using the classification network to distinguish between car and clutter
patches. For car patches, the segmentation network predicts a probability score for each radar target which indicates the probability of radar targets belonging to a car. In the masking step, radar targets which are classified as car targets are extracted. As presented in[qi2018frustum], coordinates of the segmented radar targets are normalized to ensure translational invariance of the algorithm.
Note that the classification and segmentation module can easily be extended to multiple classes. For this purpose, the patch is classified as a certain class and, consequently the predicted classification information is used for the segmentation step.
Iv-C Amodal 2D Bounding Box Estimation
After the segmentation of object points, this module estimates an associated amodal 2D bounding box. First, a light-weight regression PointNet, called Transformer PointNet (T-Net), estimates the center of the amodal bounding box and transforms radar targets into a local coordinate system relative to the predicted center. This step and the T-Net architecture are described in detail in [qi2018frustum]. The transformation using T-Net is reasonable, because regarding the viewing angle, the centroid of segmented points can differ to the true center of the amodal bounding box.
For the 2D bounding box estimation, a box regression PointNet, which is conceptually the same as proposed in [qi2018frustum], is used. The regression network predicts the parameters of a 2D bounding box by its center , its heading angle and its size . For the box center estimation, the residual based 2D localization of [qi2018frustum] is performed. Heading angle and size of bounding box is predicted using a combination of a classification and a segmentation approach as explained in [qi2018frustum]. More precisely, for size estimation, predefined size templates are incorporated for the classification task. Furthermore, residual values regarding those categories are predicted.
In case of multiple classes, the box estimation network also uses the classification information for the bounding box regression. Therefore, the size templates have to be extended by additional classes, e.g. pedestrians or cyclists.
Iv-D Network Architectures
For the object detecion task in radar data, two network architectures were considered, the v1 model based on the concepts of PointNet [qi2017pointnet], and the v2 model based on the concepts of PointNet++ [qi2017pointnetplusplus]. Figure 3 shows both network architectures. For the classification and segmentation network the architecture is conceptually similar to PointNet [qi2017pointnet] and [qi2017pointnetplusplus]. The network for amodal 2D bounding box estimation is the same as proposed in [qi2018frustum]. Since this work uses radar data as input, the input of both networks is extended. For the classification and segmentation network, the input is a four-dimensional radar target list of a patch containing points with 2D spatial data, ego-motion compensated Doppler velocity and RCS information. For the bounding box regression network, the input is spatial information of a segmented radar target list consisting of points.