RAANet: Range-Aware Attention Network for LiDAR-based 3D Object Detection with Auxiliary Density Level Estimation

3D object detection from LiDAR data for autonomous driving has been making remarkable strides in recent years. Among the state-of-the-art methodologies, encoding point clouds into a bird's-eye view (BEV) has been demonstrated to be both effective and efficient. Different from perspective views, BEV preserves rich spatial and distance information between objects; and while farther objects of the same type do not appear smaller in the BEV, they contain sparser point cloud features. This fact weakens BEV feature extraction using shared-weight convolutional neural networks. In order to address this challenge, we propose Range-Aware Attention Network (RAANet), which extracts more powerful BEV features and generates superior 3D object detections. The range-aware attention (RAA) convolutions significantly improve feature extraction for near as well as far objects. Moreover, we propose a novel auxiliary loss for density estimation to further enhance the detection accuracy of RAANet for occluded objects. It is worth to note that our proposed RAA convolution is lightweight and compatible to be integrated into any CNN architecture used for the BEV detection. Extensive experiments on the nuScenes dataset demonstrate that our proposed approach outperforms the state-of-the-art methods for LiDAR-based 3D object detection, with real-time inference speed of 16 Hz for the full version and 22 Hz for the lite version. The code is publicly available at an anonymous Github repository https://github.com/anonymous0522/RAAN.

READ FULL TEXT VIEW PDF

page 1

page 3

page 4

page 8

03/10/2022

Point Density-Aware Voxels for LiDAR 3D Object Detection

LiDAR has become one of the primary 3D object detection sensors in auton...
09/26/2019

Range Adaptation for 3D Object Detection in LiDAR

LiDAR-based 3D object detection plays a crucial role in modern autonomou...
07/30/2021

From Multi-View to Hollow-3D: Hallucinated Hollow-3D R-CNN for 3D Object Detection

As an emerging data modal with precise distance sensing, LiDAR point clo...
05/02/2022

RangeSeg: Range-Aware Real Time Segmentation of 3D LiDAR Point Clouds

Semantic outdoor scene understanding based on 3D LiDAR point clouds is a...
03/08/2022

Counting with Adaptive Auxiliary Learning

This paper proposes an adaptive auxiliary task learning based approach f...
12/18/2020

PC-RGNN: Point Cloud Completion and Graph Neural Network for 3D Object Detection

LiDAR-based 3D object detection is an important task for autonomous driv...
08/24/2022

AGO-Net: Association-Guided 3D Point Cloud Object Detection Network

The human brain can effortlessly recognize and localize objects, whereas...

1 Introduction

Figure 1: Motivation. Different from camera images with perspective distortion, in BEV, objects that are farther from the ego-vehicle do not appear in smaller sizes, but contain sparser set of LiDAR points instead. Thus, an object in BEV may appear different depending on the distance of the object to ego-vehicle, which weakens BEV feature extraction using shared-weight CNNs.

With rapid improvement of processing units, and thanks to the extraordinary success of deep neural networks, the perception of autonomous driving has been flourishing in recent years. 3D object detection from LiDAR sensors is one of the important capabilities for autonomous driving. Early works employ 3D convolutional neural networks, which have slow processing speeds and large memory requirements. In order to decrease the memory requirements and provide real-time processing, recent methodologies leverage voxelization and bird’s-eye view (BEV) projection. Voxelization is widely implemented as a prepossessing method for 3D point clouds due to the computing efficiency, provided by more structured data, and performance accuracy [lang2019pointpillars][zhou2018voxelnet][zhou2020endtoend-multiview-fusion][xuetao2019segmentation]. In general, voxelization divides a point cloud into an evenly spaced grid of voxels, and then assigns 3D LiDAR points to their respective voxels. The output space preserves the Euclidean distance between objects and avoids overlapping of bounding boxes. This fact keeps object size variation in a relatively small range, regardless of their distance from the LiDAR, which benefits the shape regression during training. However, as shown in Figure 1, the voxels that are farther away from the ego-vehicle contain significantly fewer LiDAR points than the near ones. This leads to a situation where different representations may be extracted for an object at different distances to the ego-vehicle. In contrast to perspective view, where feature elements are location-agnostic, BEV feature maps are location sensitive. Thus, different convolutional kernel weights should be applied to feature elements at different locations of the feature map. In other words, location information should be introduced to the feature maps, and the convolutional kernels should be adjustable to the location information of corresponding feature maps.

In this paper, we propose Range-Aware Attention Network (RAANet), which contains novel Range-Aware Attention Convolutional layer (RAAConv) designed for object detection from LiDAR BEV. RAAConv is composed of two independent convolutional branches and attention maps, which is sensitive to the location information of the input feature map. Our approach is inspired by properties of BEV images, which are illustrated in Fig. 1. Points get more sparse as the distance between an objects and ego-vehicle increases. Ideally, for BEV feature maps, elements at different locations should be processed by different convolution kernels. However, applying different kernels will significantly increase the computational expense. In order to utilize the location information during BEV feature extraction, while avoiding heavy computation at the same time, we regard a BEV feature map as a composition of sparse features and dense features. We apply two different convolution kernels to simultaneously extract sparse and dense features. Each extracted feature map has half the channel size of the final output. Meanwhile, range and position encodings are generated based on the input shape. Then, each range-aware attention heatmap is computed from the corresponding feature map and the range and position encodings. Finally, the attention heatmaps are applied on the feature maps to enhance feature representation. The feature maps generated from two branches are concatenated channel-wisely as the RAAConv output. Details are presented in Sec. 3.2.

In addition, the effect of occlusion on the LiDAR data cannot be ignored, since an object can have different point distributions under different amounts of occlusion. Thus, we propose an efficient auxiliary branch, referred to the Auxiliary Density Level Estimation Module (ADLE), allowing RAANet to take occlusion into consideration. Since annotating various occlusions is a time consuming and expensive task, instead of estimating an occlusion instance, we design the ADLE to estimate a density level for each object. If there is no occlusion, the point density levels for near objects are higher than those of far objects. However, if a near object is occluded, its point density level decreases. Therefore, by combining range information and density level information, we are able to simulate the occlusion information. ADLE is only used in the training stage for providing density information guidance, and can be removed in inference state for computational efficiency.

Contributions. The main contributions of this work include the following:

  • [leftmargin=*]

  • We propose the RAAConv layer, which allows LiDAR-based detectors to extract more representative BEV features. In addition, RAAConv layer can be integrated into any CNN architecture used for LiDAR BEV.

  • We propose a novel auxiliary loss for density estimation to help the main network learn occlusion related features. This proposed density level estimator further enhances the detection accuracy of RAANet on occluded objects as well.

  • We propose the Range-Aware Attention Network (RAANet), which integrates the aforementioned RAA and ADLE modules. RAANet is further optimized by generating anisotropic Gaussian heatmap based on the ground truth, which is introduced in Sec. 3.4.

  • The code is available at an anonymous GitHub repo. [our-code].

Figure 2: Overview of the proposed network. The RPN is designed to extract the shared feature from a BEV pseudo image. Then, the main detection heads and an auxiliary density estimation head are appended. The auxiliary head only exists during training and is omitted during inference. Finally, 3D bounding boxes are obtained from the main heads. Our proposed RAAConv layer, illustrated in the lower part of the figure, replaces the original Conv2D layer. In general, this layer contains two branches, and each branch applies position and range encodings to generate its range-aware attention heatmap. Then, extracted features in each branch are enhanced by the heatmap with a residual architecture. Those enhanced features are concatenated for the final output features.

2 Related Work

In recent years, many methods have been presented advancing the field of 3D object detection. Pixor [yang2018pixor] performs the voxelization via a customized encoding algorithm of the point clouds. However, the hand-craft design of voxelization limits the generalization and adaptation of Pixor. Hence, to overcome this shortcoming, VoxelNet [zhou2018voxelnet] proposes a PointNet and 3D convolutional layer to voxelize the points clouds autonomously. Compared to VoxelNet, SECOND [yan2018second] improves the inference time with the sparsely embedded convolutional layers. Although VoxelNet-based methodologies provide powerful encoding schemes of point clouds, researchers are still exploring new effective methods. In particular, PointPillar [lang2019pointpillars] discards 3D convolutional layers and voxelization, and presents a new encoding scheme called pillar. They merge the points of z-axis and then project them into BEV space. Therefore, the points could be processed by 2D convolutions. They propose an end-to-end pipeline with only 2D convolutional layers based on pillars, which significantly improves the processing speed.

The aforementioned studies have provided effective and fast methodologies to encode point clouds. However, how to effectively process these encodings and perform precise bounding box regression still remains to be explored further. CenterPoint [yin2021center-point] provides a regression method by focusing on the center points of all bounding boxes, and hence reduces the complexity of rotation. Although the rotation invariance of the center point makes this method accurate and fast, significant amount of positional information is lost because of the top-view encoding and the isotropic Gaussian mask. To avoid these drawbacks, a series of works consider the positional information in the feature extractor. Particularly, CBAM [woo2018cbam] adds spatial attention and channel attention modules into the convolutional layers. Although this scheme provides a new insight of adding attention mechanism, it still cannot well encode the complicated spatial information from the LiDAR point clouds. Thus, it is necessary to design a novel convolutional network to utilize the location and density information from the point clouds.

3 Range-Aware Attention Network

3.1 Overview of RAANet

The main architecture of our proposed Range-Aware Attention Network (RAANet) is presented in Fig. 2. We incorporate ideas from CenterNet [zhou2019centernet][yin2021center-point] to build an anchor-free detector, and also introduce two novel modules: the Range-Aware Attention Convolutional Layer (RAAConv) and the Auxiliary Density Level Estimation Module (ADLE).

RAANet takes the 3D LiDAR points as input and generates a set of 3D oriented bounding boxes as the output. Inspired by CenterPoint [yin2021center-point], we implement a feature extractor with 3D sparse convolutional network to extract the BEV feature map from voxelized point clouds. This resulting feature map is reshaped and formed into a BEV pseudo image, which can be regarded as a multi-channel 2D image. The Region Proposal Network (RPN) takes this BEV pseudo image as input and employs multiple down-sample and up-sample blocks to generate a high-dimensional feature map. In addition to the detection heads in main task, we propose an auxiliary task for density level estimation to achieve better detection performance. Meanwhile, the RAAConv layers are utilized in all convolutional modules shown as Fig. 2

. Center probability heatmap is generated from one of the detector heads, and maximums in local neighborhoods are treated as the centers of ground-truth bounding boxes (bbox). The regression of bbox attributes, including offsets, dimensions and orientations, are computed from the other heads in the main detection task. Parallel to the main task, the proposed ADLE, as an auxiliary task, estimates density level for each bounding box. In general, ADLE classifies each bounding box into different density levels, based on the number of points in each bbox. ADLE is a portable block for the whole network, i.e. it is employed only in the training process and is not used during inference for higher computational efficiency.

For the training phase, we denote the loss components as the heatmap classification head, bbox regression head and auxiliary density level head as , and

, respectively. The total loss function of RAANet is formulated as

(1)

where and are scalar weights for balancing multi-task loss. and are designed as the penalty-reduced focal losses [lin2017focalloss][zhou2019centernet], and is designed as the smooth L1 loss.

3.2 Range-Aware Attention Convolutional Layer (RAAConv)

We propose the Range-Aware Attention Convolutional Layer (RAAConv), which, by leveraging the specially designed attention blocks, becomes sensitive to range and position information. This is the major difference between the proposed RAAConv and traditional convolutional layer. RAAConv is employed in the aforementioned RPN and detection heads. As shown in Fig. 2, RAAConv first utilizes two sets of convolutional kernels to extract an intermediate feature map for each branch. Then, the position encodings are generated and embedded into intermediate feature maps. Two range-aware heatmaps are calculated by a series of convolution and pooling operations using the intermediate feature maps and range encodings. More specifically, given an input feature map , two separate intermediate feature maps, , are generated by

(2)

where represents convolution operation and denotes the corresponding convolution kernel.

Meanwhile, the position encoding map and range encoding map are generated from the intermediate feature map . and

are calculated from shape information and do not depend on the values inside the tensor

. There are two components of : row encoding and column encoding . and are generated as follows:

(3)

Values of are bounded in . Values of are bounded in .

Since the proposed network is designed by using two convolution kernels to extract dense and sparse features separately, we append to intermediate feature map channel-wise, and reversely append to . The generated features are denoted as and . Maximum pooling, mean pooling and conv2D are applied on and separately to obtain the spatial embeddings and . Then, similar to positional appending, the range encoding map is appended to , and is appended to . The appended features are processed by a conv2D layer followed by sigmoid activation to obtain the range-aware attention heatmaps and . The heatmaps and are then multiplied by learnable scalars and separately. and are initialized as and gradually learn the weights during training [zhang2019self]. The output feature maps and are calculated from , and

using a residual connection, which is defined as:

(4)

where is the operation performing channel-wise stacking and element-wise multiplication. The final output for RAAConv is the channel-wise concatenation of and , which has size of .

In addition, we customize the initialization of convolutional kernel weights, which are only related to the range encoding channel, as absolute values of the original initialization. The reason is that is normalized within . It is demonstrated by an ablation study that the aforementioned initialization method can improve the average precision (AP) performance of the RAANet.

It is worth noting that the proposed RAAConv can be readily plugged in any convolutional network for LiDAR BEV detection.

Figure 3: Illustration of Anisotropic 2D Gaussian Mask. Traditional Isotropic Gaussian Mask neglects the rotation of the rectangular bounding box. We propose an Anisotropic 2D Gaussian Mask, allowing the mask to better fit for the ground truth. The Anisotropic mask is able to introduce the heading angle information, which is beneficial to bbox regression as well.

3.3 Auxiliary Density Level Estimation Module (ADLE)

The Auxiliary Density Level Estimation Module (ADLE) is an additional classification head that is parallel to the main detection heads. The ADLE indicates density level for each bbox. In general, we design ADLE to estimate the number of LiDAR points inside a detected bbox. We have empirically found that this estimation is straightforward yet helpful to the detector. Moreover, instead of making ADLE perform a precise value regression, we divide the number of points into three density bins and let ADLE perform a classification task, which is more achievable. More specifically, we divide the point density level into three bins: sparse (label ), adequate (label ) and dense (label ). We have statistically analyzed the distribution of the number of points and employed two thresholds to determine the density level for each object class . The three density levels are defined as:

(5)

where denotes the number of points, and represents the density level for the -th class. As shown in Fig. 3, with no occlusion, objects near the ego-vehicle have high point density levels and faraway objects have low density levels. However, when an object is occluded, the density level will be lower than the unoccluded case at the same distance. Since we incorporate position and range information into RAAConv, ADLE is able to further help RAANet extract occlusion information by estimating the point density levels instead of the actual occlusion severity, which is hard to annotate from the ground truth.

Car Truck Bus Trailer
Cons.
Veh.
Ped. M.cyc. Bicycle
Traffic
Cone
Barrier mAP
PointPillars [vora2020pointpainting][lang2019pointpillars] 0.760 0.310 0.321 0.366 0.113 0.640 0.342 0.140 0.456 0.564 0.401
SA-Det3D [bhattacharyya2021sa] 0.812 0.438 0.572 0.478 0.113 0.733 0.321 0.079 0.606 0.553 0.470
InfoFocus [wang2020infofocus] 0.779 0.314 0.448 0.373 0.107 0.634 0.290 0.061 0.465 0.478 0.395
CyliNetRG-single [rapoport2021cylinet] 0.850 0.502 0.569 0.526 0.191 0.843 0.586 0.298 0.791 0.690 0.585
SARPNET [YE2020sarpnet] 0.599 0.187 0.194 0.180 0.116 0.694 0.298 0.142 0.446 0.383 0.324
PanoNet3D [chen2020panonet3d] 0.801 0.454 0.540 0.517 0.151 0.791 0.531 0.313 0.719 0.629 0.545
PolarStream [chen2021polarstream] 0.809 0.381 0.471 0.414 0.195 0.802 0.614 0.299 0.753 0.640 0.538
MEGVII [zhu2019MEGVII] 0.811 0.485 0.549 0.429 0.105 0.801 0.515 0.223 0.709 0.657 0.528
PointRCNN[shi2020pcdet][shi2019pointrcnn] 0.810 0.472 0.563 0.510 0.141 0.766 0.422 0.134 0.667 0.614 0.510
SSN-v2 [zhu2020ssn] 0.824 0.418 0.461 0.480 0.175 0.756 0.489 0.246 0.601 0.612 0.506
ReconfigPP-v2 [wang2020reconfigurable] 0.758 0.272 0.395 0.380 0.065 0.625 0.152 0.002 0.257 0.349 0.325
ReconfigPP-v3 [wang2020reconfigurable][wang2021probabilistic] 0.814 0.389 0.430 0.470 0.153 0.724 0.449 0.226 0.583 0.614 0.485
HotSpotNet-0.1m [chen2020hotspots] 0.831 0.509 0.564 0.533 0.230 0.813 0.635 0.366 0.730 0.716 0.593
MMDetection3D [mmdet3d2020] 0.847 0.490 0.541 0.528 0.216 0.793 0.560 0.387 0.714 0.674 0.575
CenterPoint [yin2021center-point] 0.852 0.535 0.636 0.560 0.200 0.846 0.595 0.307 0.784 0.711 0.603
CenterPoint-R [Lertn2018centerpr] 0.854 0.539 0.614 0.575 0.203 0.845 0.591 0.329 0.784 0.706 0.604
CVCNet-single [chen2020everyCVC] 0.827 0.461 0.458 0.467 0.207 0.810 0.613 0.343 0.697 0.699 0.558
CVCNet-ens [chen2020everyCVC] 0.826 0.495 0.594 0.511 0.162 0.830 0.618 0.388 0.697 0.697 0.582
RAANet-lite (Ours) 0.858 0.543 0.634 0.552 0.235 0.855 0.620 0.319 0.789 0.740 0.615
RAANet (Ours) 0.860 0.546 0.636 0.553 0.237 0.856 0.633 0.340 0.794 0.749 0.620
Table 1: Comparison with the state-of-the-art models for 3D detection on the nuScenes test dataset. We show AP scores for each class and overall mAP scores. Best results are marked in bold and 2nd best results are marked in blue color. The RAANet refers to the network that implements RAAConv layers for all modules. The RAANet-lite refers to the network that only uses RAAConv layers in the detection heads.

3.4 Anisotropic Gaussian Mask

Anisotropic Gaussian Mask represents the centerness probability for each position in a annotated bbox. The centerness heatmap is generated as the ground truth for the classification head. In this work, we propose an Anisotropic Gaussian Mask, shown in Fig. 3

, as the method of 2D Gaussian distribution generation, to be used for the centerness probabilities of a given oriented bbox. More specifically, the Gaussian distribution

is designed in anisotropic way: the values for the 2 dimensions, denoted as , which are the diagonals of , are different, and their values are determined from the length and width of the bbox, respectively. The procedure of generating the center heatmap, denoted by , can be formulated as:

(6)

where,

(7)

and and denote the center location and dimensions of a bbox, respectively. indicates the ground truth class for the bbox. is the indicator function and denotes the set of locations that are inside the bbox. is a decay factor that decides the sharpness of generated heatmap. The value of depends on the class of the given bbox.

3.5 Loss Functions

Following Eq. (1), the center heatmap loss is defined as:

(8)

where and are the predicted and ground-truth heatmap values, respectively. is a small value for numerical stability. and are the parameters to control penalty-reduced focal loss. For all the experiments, we follow the setting in [sun2021rsn]: , and .

The bounding box head is responsible for location, dimension and orientation regression, which are represented by the following parameters: , where are the box center offsets with respect to the corresponding voxel centers; and and denote the length, width, height and yaw of a bounding box. We apply smoothed L1 (SL1) losses to regress and for the attribute, and directly regress actual values for other attributes:

(9)
(10)

where, denotes ground truth values corresponding to original symbols. The total loss for box regression is:

(11)

where is the number of box samples and equals 1 if the ground truth heatmap value of the th feature map pixel is larger than a threshold , which is set to 0.2 in the experiments.

4 Experiments

We introduce the implementation details of the RAANet  and evaluate its performance in multiple experiments. The inference results are uploaded to and evaluated by the nuScenes official benchmark [nusceneslink]. The diagnostic analysis of each component is shown in the ablation studies.

4.1 NuScenes Dataset

We primarily perform the evaluation on the nuScenes dataset [nuscenes2019], which is a public, large-scale and challenging dataset for autonomous driving developed by the team at nuTonomy. The full dataset includes approximately 390k LiDAR sweeps and 1.4M object bounding boxes in 40k key frames. The dataset contains 1000 scenes in total and is splitted as 700 for training, 150 for validation and 150 for testing. There are a total of 10 classes: car, truck, bus, trailer, construction vehicle, traffic cone, barrier, motorcycle, bicycle, pedestrian. The scenes of 20 sec. length are manually selected to show a diverse and interesting set of driving maneuvers, traffic situations and unexpected behaviors. The dataset was captured with a 32-channel spinning LiDAR sensor with 360 degree horizontal and -30 to +10 degree vertical FOV. The sensor captures points, which are within 70 meters with cm accuracy and returns up to 1.39M points per second.

Model RAAConv 2D Gaussian ADLE AP (%)
Dual Branch Attention W. Init Car Pedestrian
Baseline - - - - - - 85.23 84.62
85.30 84.71
85.40 84.95
+ RAAConv 85.44 85.19
85.51 85.30
85.64 85.41
+ 2D Gaussian 85.90 85.42
+ auxiliary loss 86.14 85.64
Table 2: Ablation study of Range-Aware Attention Network. Various models are built by different combination of three attributes: RAAConv, Anisotropic Gaussian Mask (2D Gaussian for short) and ADLE. For RAAConv, there are four sub-components: dual branch structure, range-aware attention heatmap, initialization of range-aware kernel weights (Sec. 3.2) and learnable scalar (Fig. 2). Baseline is the original CenterPoint [yin2021center-point].

4.2 Implementation Details

RAANet is implemented in PyTorch framework 

[paszke2017Pytorch] with anchor-free object detection architecture. All classes are trained together in a single end-to-end network. The weights and in loss function (Eq. 1) are set to and respectively. The resolution for voxelization is set to and the voxelization region is . The decay factor in Sec. 3.4 is set to for large size classes (car, truck, bus, trailer, construction_vehicle), to for small object classes (traffic_cone, barrier, motorcycle, bicycle, pedestrian). As for the parameters in Sec. 3.5, we set , , and to , , and , respectively. More details are provided in the configuration file of our anonymous github repository [our-code]. Two versions of the RAANet are designed in this work: (i) the full version RAANet implements RAAConv layers for all modules; (ii) RAANet-lite version implements RAAConv layers only on the detection heads. Both designs outperform SOTAs. Compared to RAANet, RAANet-lite has faster inference speed, with a small drop on mAP performance.

4.3 Training and Inference

RAANet is trained end-to-end from scratch on 4 NVIDIA RTX-6000 GPUs. ADAM optimizer [kingma2014adam] with and One-Cycle learning rate scheduler [smith2019onecycle] with maximum

are used during the training. 20 epochs are trained in total and the batch size for each GPU is set to 4. Random flipping along the

-axis and global rotation around the -axis, with a random angle sampled from , are implemented as the augmentation strategies in the training process.

During inference, we reuse the checkpoint with the best mean average precision (mAP) value and evaluate on the official nuScenes evaluation detection metric. The detection score threshold is set to and IoU threshold is set to for non-maximal suppression. The inference latency for full version of RAANet and RAANet-lite is 62 ms and 45 ms, respectively. Both versions satisfy the real-time 3D detection criteria.

Figure 4: AP results at different loss weights for ADLE ( in Eq. (1)). Experiments are based on the full version RAANet.
Figure 5: AP results at different Gaussian decay factors, which is denoted as in Eq. (7). Experiments are based on the baseline.

4.4 Results

We present the results, obtained on the test set with both our lite-version and full-version models in Table 1. The architecture of the full version model is illustrated in Fig. 2. All detection results are evaluated by average precision (AP) for 3D object detection. We choose CenterPoint [yin2021center-point]

as the primary baseline for attribute comparison and ablation studies. Besides CenterPoint, we compare our work with other state-of-the-art models that have either published papers or open-source code. As can be seen in Table 

1, RAANet outperforms all the other models in mAP, and RAANet-lite takes the second place. For individual classes, RAANet achieves the best AP for 7 out of 10 classes, and takes the 2nd place for one of the remaining 3 classes.

4.5 Ablation Studies

In this section, we provide results of the ablation studies for components and critical parameters, which are attention module, decay factor (for the anisotropic Gaussian mask) and (weight for the density level estimation loss). The models are evaluated on nuScenes validation set. For computational efficiency, we perform all experiments in this section on ’Car’ and ’Pedestrian’ classes to represent large and small objects, respectively.

Figure 6: Qualitative results. (Please zoom-in for details). For each image pair, images on the left are the results inferred from the baseline, and the images on the right are the results from the proposed RAANet. Red and blue boxes denote the ground truth and predictions, respectively.

Anisotropic Gaussian Mask. We evaluate the performance of 2D Gaussian Mask by varying the decay factor . The results in Fig. 5 show the variation of decay factor has lower impact on the pedestrian than the car. We set and for large and small objects in the main experiments.

Auxiliary Density Level Estimation Module. The AP results obtained with various weights of the loss (), for the auxiliary component, are shown in Fig. 4. The loss weight values are chosen between to . We set for the main experiments.

Effects of different components. Tab. 2 shows the performance of the baseline as well as several types of the Range-Aware Attention Network. For RAAConv, there are four sub-components: dual branch structure, range-aware attention heatmap, initialization of range-aware kernel weights (Sec. 3.2) and learnable scalar (Fig. 2). The combination of those four sub-components can generate the optimal gain for the RAAConv. Each aforementioned component contributes to the improvement of detection performance. It should be noticed that the pedestrian AP performance has trivial improvement with anisotropic 2D Gaussian mask. One of the probable reasons is that the annotated bbox for pedestrian class has the close width and length. This leads the anisotropic 2D Gaussian mask to fall back to original isotropic 2D Gaussian mask.

Models RAAConv 2D Gaus. ADLE AP(%)
Car Ped.
Baseline 83.98 82.80
RAANet 84.32 82.82
84.43 83.61
84.69 83.62
84.80 83.79
Table 3: AP results for RAANet with PointPillar Backbone. The baseline and comparison models are built with PointPillar proposed backbone. Comparison models are composed by various combinations of RAAConv, 2D Gaussian Mask and ADLE.

4.6 Generalizability

The proposed RAANet is further evaluated using a different backbone from PointPillars. The pillar size is set to , which follows [lang2019pointpillars]. The perception range is set to . The baseline is trained with original CenterPoint-PointPillar version. As shown in Tab. 3, our proposed RAANet still generates competitive performance using a different backbone.

4.7 Qualitative Analysis

Pairs of example outputs, obtained by the baseline and the proposed RAANet are provided in Fig. 6 for qualitative comparison. The sample data is selected from nuScenes validation dataset. In a pair, images on the left are the results inferred from the baseline, and the images on the right are the results from the proposed RAANet. It can be observed that RAANet increases true positives and suppresses false positives at the far and occluded regions. Fig.1 in supplementary material shows more visualization examples on all-class detection results of the RAANet as well as the attention heatmaps that are generated from RAAConv.

5 Conclusion

In this paper, we have introduced the Range-Aware Attention Network (RAANet) to improve the performance of 3D object detection from LiDAR point clouds. The motivation behind the RAANet is that, in Bird’s Eye View (BEV) LiDAR images, objects appear very different at various distances to the ego-vehicle, and thus, there is a need to avoid using shared-weight convolutional feature extractors. In particular, we have leveraged position and range encodings to design a RAAConv layer with two independent convolution branches, which separately focus on sparse and dense feature extraction. Moreover, the auxiliary density level estimation module (ADLE) has been proposed to further help RAANet extract occlusion information of objects during training. Since RAAConv sets the channel number for each branch to the half of final output channels, and ADLE is not attached during inference, the proposed RAANet is able to run in real-time. Evaluations on nuScenes dataset have shown that our proposed RAANet outperforms the SOTA works. The code is available at the following anonymous link: https://github.com/anonymous0522/RAAN.

References