Pyramid R-CNN: Towards Better Performance and Adaptability for 3D Object Detection

09/06/2021 ∙ by Jiageng Mao, et al. ∙ 0

We present a flexible and high-performance framework, named Pyramid R-CNN, for two-stage 3D object detection from point clouds. Current approaches generally rely on the points or voxels of interest for RoI feature extraction on the second stage, but cannot effectively handle the sparsity and non-uniform distribution of those points, and this may result in failures in detecting objects that are far away. To resolve the problems, we propose a novel second-stage module, named pyramid RoI head, to adaptively learn the features from the sparse points of interest. The pyramid RoI head consists of three key components. Firstly, we propose the RoI-grid Pyramid, which mitigates the sparsity problem by extensively collecting points of interest for each RoI in a pyramid manner. Secondly, we propose RoI-grid Attention, a new operation that can encode richer information from sparse points by incorporating conventional attention-based and graph-based point operators into a unified formulation. Thirdly, we propose the Density-Aware Radius Prediction (DARP) module, which can adapt to different point density levels by dynamically adjusting the focusing range of RoIs. Combining the three components, our pyramid RoI head is robust to the sparse and imbalanced circumstances, and can be applied upon various 3D backbones to consistently boost the detection performance. Extensive experiments show that Pyramid R-CNN outperforms the state-of-the-art 3D detection models by a large margin on both the KITTI dataset and the Waymo Open dataset.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

3D object detection is a key component of perception systems for robotics and autonomous driving, aiming at detecting vehicles, pedestrians, and other objects with 3D point clouds as input. In this paper, we propose a general two-stage 3D detection framework, named Pyramid R-CNN, which can be applied on multiple 3D backbones to enhance the detection adaptability and performance.

Figure 1: Statistical results on the KITTI dataset. Blue bars denote the distribution of the number of object points. Orange bars denote the distribution of the number of points gathered by RoIs in Pyramid R-CNN. Our approach can mitigate the sparsity and imbalanced distribution problems of point clouds.

Among the existing 3D detection frameworks, two-stage detection models [40, 31, 28, 5, 29] surpass most single-stage 3D detectors [46, 38, 14, 20, 30] with remarkable margins owing to the RoI refinement stage. Different from the 2D counterparts [8, 9, 27, 11, 2] which apply RoIPool [9] or RoIAlign [11] to crop dense feature maps on the second stage, the 3D detection models generally perform various RoI feature extraction operations on the Points of Interest. For example, Point R-CNN [30] utilizes a point-based backbone to generate 3D proposals, treats the points near the proposals as Points of Interest and applies Region Pooling on those sparse points for box refinement; Part- Net [31] utilizes a voxel-based backbone for proposal generation, uses the upsampled voxel points as Points of Interest, and applies sparse convolutions on those voxel points for each RoI; PV-RCNN [28] encodes the whole scene into a set of keypoints, and utilizes keypoints as Points of Interest for RoI-grid Pooling. Those Points of Interest originate from raw point clouds and contain rich fine-grained information, which is required for the RoI refinement stage.

However, the Points of Interest inevitably suffer from the sparsity and non-uniform distribution characteristics of input point clouds. As is demonstrated by the statistical results on the KITTI dataset [7] in Figure 1: 1) Point clouds can be quite sparse in certain objects. More than of total objects have less than 10 points, and their visualized shapes are mostly incomplete. Thus it is hard to identify their categories without enough context information. 2) The distribution of object points is extremely imbalanced. The number of object points ranges from less than to more than on KITTI, and current RoI operations cannot handle the imbalanced conditions effectively. 3) The number of Points of Interest only accounts for a small proportion of input points or voxels, keypoints in [28] relative to the total input points, which exacerbates the above problems.

To overcome the above limitations, we propose Pyramid R-CNN, a general two-stage 3D detection framework that can effectively detect objects and adapt along with environmental changes. Our main contribution lies in the design of a novel RoI feature extraction head, named pyramid RoI head, which can be applied on multiple 3D backbones and Points of Interest. pyramid RoI head consists of three key components. Firstly, we propose RoI-grid Pyramid. Given the observation that Points of Interest inside RoIs are too sparse for object recognition, our RoI-grid Pyramid captures more Points of Interest outside RoIs while still maintaining fine-grained geometric details, by extending the standard one-level RoI-grid to a pyramid structure. Secondly, we propose RoI-grid Attention, an effective operation to extract RoI-grid features from Points of Interest. RoI-grid Attention leverages the advantages of the graph-based and attention-based point operators by combining those formulas into a unified formulation, and it can adapt to different sparsity situations by dynamically attending to the crucial Points of Interest near the RoIs. Thirdly, we propose the Density-Aware Radius Prediction (DARP) module, which can predict the feature extraction radius of each RoI, conditioning on the neighboring distribution of Points of Interest. Thus we can address the imbalanced distribution problem by adaptively adjusting the focusing range for each RoI. Combining all the above components, the pyramid RoI head shows adaptability to different point cloud sparsity levels and can accurately detect the 3D objects with only a few points. Our Pyramid R-CNN is compatible with the point-based [30], voxel-based [31] and point-voxel-based [28] frameworks, and significantly boosts the detection accuracy.

We summarize our key contributions as follows:
1) We propose Pyramid R-CNN, a general two-stage framework that can be applied on multiple backbones for accurate and robust 3D object detection.
2) We propose the pyramid RoI head, which combines the RoI-grid Pyramid, RoI-grid Attention, and the Density-Aware Radius Prediction (DARP) module together to mitigate the sparsity and non-uniform distribution problems.
3) Pyramid R-CNN consistently outperforms the baselines, achieves moderate car mAP on the KITTI dataset, and ranks among the LiDAR-only methods on the Waymo test leaderboard for vehicle detection.

2 Related Work

Single-stage 3D Object Detection. Single-stage methods can be divided into streams, , point-based, voxel-based and pillar-based. The point-based single-stage detectors generally take the raw points as input, and apply set abstraction [26, 21] to obtain the point features for box prediction. 3DSSD [39] introduces Feature-FPS as a new sampling strategy for raw point clouds. Point-GNN [32] proposes a graph operator to aggregate the points information for object detection. The voxel-based single-stage detectors typically rasterize point clouds into voxel-grids and then apply 2D and 3D CNN to generate 3D proposals. VoxelNet [46] divides points into voxels and leverages a 3D CNN to aggregate voxel features for proposal generation. SECOND [38] improves the voxel feature learning process by introducing 3D sparse convolutions. CenterPoints [42] proposes a center-based assignment that can be applied on feature maps for accurate location prediction. Pillar-based approaches generally transform point clouds into Bird-Eye-View (BEV) pillars and apply 2D CNNs for 3D object detection. PointPillars [14] is the first work that introduces the pillar representation. Pillar-based network [36] extends the idea by proposing the cylindrical view projection. Unlike the two-stage approaches, the single-stage methods cannot benefit from the fine-grained point information, which is crucial for accurate box prediction.

Two-stage 3D object detection. Two-stage approaches can be divided into streams, based on the representation of Points of Interest, , point-based, voxel-based and point-voxel-based. Point-based approaches treat the sampled point clouds as Points of Interest. PointRCNN [30] generates 3D proposals from raw point clouds and proposes Region Pooling to extract RoI features for the second stage refinement. STD [40] proposes a sparse-to-dense strategy and uses the PointsPool operation for RoI refinement. Voxel-based methods use the voxel points from 3D CNNs as Points of Interest. Part- Net [31] applies 3D sparse convolutions on the upsampled voxel points for RoI refinement. Voxel R-CNN [5] utilizes Voxel RoI Pooling to extract RoI features from voxels. Point-Voxel-based approaches use the keypoints that encode the whole scene as Points of Interest. PV-RCNN [28] designs RoI-grid Pooling to aggregate keypoint features near RoIs. PV-RCNN++ [29]

proposes Vector-Pooling to efficiently collect the keypoint features from different orientations. Compared with the previous methods, our Pyramid R-CNN shows better performance and robustness, and is compatible with all the representations of Points of Interest.

Figure 2: The overall architecture. Our Pyramid R-CNN can be plugged on diverse backbones (point-based, voxel-based and point-voxel-based networks), which generate 3D proposals and Points of Interest (yellow points) on the stage-. On the stage-, we propose the pyramid RoI head that can be applied upon the 3D proposals and Points of Interest. In the pyramid RoI head, an RoI-grid Pyramid is first built to capture more context information. Then for each RoI-grid point (red point), a focusing radius (red dashed circle) is learned by the Density-Aware Radius Prediction module. Finally, RoI-grid Attention is performed on the Points of Interest within for box refinement.

3 Pyramid R-CNN

In this section, we detail the design of Pyramid R-CNN, a general two-stage framework for 3D object detection. We first introduce the overall architecture in 3.1. Then we introduce three key components in the pyramid RoI head: RoI-grid Pyramid in 3.2, RoI-grid Attention in 3.3, and the Density-Aware Radius Prediction (DARP) module in 3.4.

3.1 Overall Architecture

Here, we present a new two-stage framework for accurate and robust 3D object detection, named Pyramid R-CNN, as shown in Figure 2. The framework can be compatible with multiple backbones, the point-based backbone, the voxel-based backbone , or the point-voxel-based backbone. On the first stage, those backbones output 3D proposals and corresponding Points of Interest: point clouds near RoIs in [30], upsampled voxels in [31], and keypoints in [28]. On the second stage, we propose a novel pyramid RoI head, which consists of three key components: the RoI-grid Pyramid, RoI-grid Attention, and the Density-Aware Radius Prediction (DARP) module. For each RoI, we first build an RoI-grid Pyramid, by gradually enlarging the size of the original RoI in each pyramid level, and the coordinates of RoI-grid points are determined by the enlarged RoI size and the grid size. In each pyramid level, the focusing radius of the RoI-grid points is predicted from the global context vector through the Density-Aware Radius Prediction module. Then RoI-grid Attention parameterized by is performed to aggregate the features of Points of Interest into the RoI-grids. Finally, the RoI-grid features are enhanced and fed into two individual heads for classification and regression. We will describe those key components in the following sections.

3.2 RoI-grid Pyramid

In this section, we present the RoI-grid Pyramid, a simple and effective module that captures rich context while still maintains internal structural information. Different from 2D feature pyramid [18] which hierarchically encodes context information upon dense backbone features, our RoI-grid Pyramid is applied on each RoI by gradually placing the grid points out of RoIs in a pyramid manner. The idea behind this design is based on the observation that image features inside RoIs generally contain sufficient semantic contexts, while point clouds inside RoIs contain quite limited information since object points are naturally sparse and incomplete. Even though each point has a large receptive field, the sparse compositional 3D shapes inside RoIs are hard to be recognized. In the following parts we will introduce detailed formulations.

RoI feature extraction generally relies on an RoI-grid for each RoI, and RoI-grid points collect the features of adjacent pixels or neighboring Points of Interest in the 2D or 3D cases respectively. Supposing we have an RoI with as width, length, and height and as the bottom left corner, in standard RoI-grid representation, the RoI-grid point location can be computed as:

(1)

where are the grid sizes in three dimensions and all grid points are generated inside RoIs.

Utilizing features only inside the RoIs works well in the 2D detection models, mainly owing to two facts: the input feature map is dense and the collected pixels have large receptive fields. However, the cases are different in 3D models. As is shown in Figure 3, the Points of Interest are naturally sparse and non-uniformly distributed inside the RoIs, and the object shape is extremely incomplete. Thus it is hard to accurately infer the sizes and categories of objects by solely collecting the features of few individual points and not referring to enough neighboring points information.

To resolve the above problems, we propose the RoI-grid Pyramid which balances the fine-grained and large context information. The detailed structure is in Figure 3. The key idea is to construct a pyramid grid structure that contains the RoI-grid points both inside and outside RoIs, so that the grid points inside RoIs can capture fine-grained shape structures for accurate box refinement, while the grid points outside RoIs can obtain large context information to identify incomplete objects. The grid points for a pyramid level can be computed as:

(2)

where is the enlarging ratio of the original RoI size. starts from at the bottom level for maintaining fine-grained details, and becomes larger when the level goes higher to capture more context information. The grid size is initialized with the same value as the original at the bottom level and gets smaller at higher levels to save computational resources. For each pyramid level, features of grid points are then aggregated by RoI-grid Attention from the features of Points of Interest. Finally, features of all pyramid levels are combined for boxes refinement.

(a) standard RoI-grid
(b) RoI-grid Pyramid
(c) object/context points in (a)
(d) object/context points in (b)
Figure 3: Illustration of the RoI-grid Pyramid. Red points in LABEL: are the RoI-grid points, and different colors represent different pyramid levels in LABEL:. In LABEL: and LABEL: red points are object points and blue points are context points captured by the RoI. Compared to the standard RoI-grid, our RoI-grid Pyramid can capture more context points while maintain fine-grained internal structures, and by looking at neighboring vehicle and traffic sign (blue context points) outside the RoI, the cluster of red object points is easier to be recognized as a car.

3.3 RoI-grid Attention

In this section, we introduce RoI-grid Attention, a novel RoI feature extraction operation that combines the state-of-the-art graph-based and attention-based point operators [37, 35, 44] into a unified framework, and RoI-grid Attention can serve as a better substitute for conventional pooling-based operations [28, 5, 29] in 3D detection models. We first discuss the formulas of pooling-based, graph-based and attention-based point operators, and then we derive the formulation of RoI-grid Attention.

Preliminary. Let be the coordinate of an RoI-grid point, and , be the coordinate and the corresponding feature vector of the Points of Interest near . RoI feature extraction operation aims to obtain the respective feature vector of the RoI-grid point , using the information of neighboring and .

Pooling-based Operators. The pooling-based operators are extensively applied for RoI feature extraction in most two-stage 3D detection models [28, 5, 29]. The neighboring feature and the relative location first go through a MLP layer to obtain the transformed feature vector: , where is the concatenation function, and then a maxpooling operation is applied upon all the transformed features to obtain the RoI-grid feature :

(3)

where means Points of Interest within the fixed radius of the RoI-grid point . The pooling-based operators only focus on the maximum channel response and this results in a loss of much semantic and geometric information.

Graph-based Operators. Graph-based operators can model the grid points and Points of Interest as a graph. The graph node represents the transformed feature of : , and the edge can be formulated as a linear projection of the location differences between two nodes: . For the graph node of a grid point , the feature is collected from adjacent nodes by a weighted combination operation. Following the same notations as Eq.3, the general formula can be represented as

(4)

where the function projects the graph edge embedding into the scalar or vector weight space, and denotes either the Hadamard product, dot product or scalar-vector product between learned weights and graph nodes.

Attention-based Operators. Attention-based operators can also be applied upon the grid points and Points of Interest. in Eq.4 can be viewed as the query embedding from the grid point to the point . is the value embedding obtained from the feature as Eq.4. The key embedding can be formulated as . Thus standard attention can be formulated as

(5)

Additional normalization function, softmax, is applied in . Recently proposed Point Transformer [44] extending the idea of standard attention and the formula can be represented as

(6)
Figure 4: Illustration of RoI-grid Attention. RoI-grid Attention introduces learnable gated functions to dynamically select the attention components, and it provides a unified formulation that includes the conventional graph and attention operators.

RoI-grid Attention. In our approach, we analyze the structural similarity of Eq.4, Eq.5 and Eq.6. We find that those formulas have common basic elements and operators. Thus it’s natural to merge those formulas into a unified framework with gated functions. We name this new formula RoI-grid Attention:

(7)

where is a learnable gated function which can be implemented by a linear projection of the respective embedding with a sigmoid activation output. RoI-grid Attention is a generalized formulation combining graph-based and attention-based operations. We can derive the graph operator Eq.4 from Eq.7 when , , , are , , , respectively. Similarly, we can derive the standard attention Eq.5 when , , , are , , , , or Point Transformer Eq.6 when , , , are , , , .

RoI-grid Attention is a flexible and effective operation for RoI feature extraction. With the learnable gated functions, RoI-grid Attention is able to learn which point is significant to the RoI-grid points, from both the geometric information and the semantic information , as well as their combinations adaptively. With , RoI-grid Attention can also learn to balance the ratio of geometric features and semantic features used in feature aggregation. Compared with the pooling-based methods, only a few linear projection layers are added in RoI-grid Attention, which maintains the computational efficiency. Replacing pooling-based operators with RoI-grid Attention consistently boosts the detection performance.

3.4 Density-Aware Radius Prediction

In this section, we investigate the learning problem of the radius , which determines the range of neighboring Points of Interest that participate in the feature extraction process. The radius is a hyper-parameter used in all the point operators in 3.3, and has to be determined by researchers in previous approaches. The fixed and predefined cannot adapt to the density changes of point clouds, and may lead to empty spherical ranges if not set properly. In this paper, we make the prediction of a fully-differentiable process and further propose the Density-Aware Radius Prediction (DARP) module, aiming at learning an adaptive neighborhood for RoI feature extraction. We first introduce the general formulation of RoI-grid Attention from a probabilistic perspective. Next, we propose a novel method to differentiate the learning of . Finally, we introduce the design of the DARP module.

RoI-grid Attention is composed of two steps: first selects Points of Interest within the radius , and next performs weighted combinations on those points. With the same notations in 3.3, we can reformulate the first step as sampling from a conditional distribution :

(8)

Then the second step can be represented as calculating the probabilistic expectation:

(9)

where denotes and denotes with a slight abuse of notations.

We propose a new probability distribution

as a substitute for , and should satisfy two requirements: i) should have similar characteristics as , which means that most points sampled from should be inside ; ii) should also leave a few points outside

, mainly for the exploration of the surrounding environment. Thus we formulate the probability

as:

(10)

where and is the temperature which controls the decay rate of probability. With a small , is close to when is inside , and is close to if outside, while near the spherical boundary the sampling probability is between and . With as a smooth approximation to , we want to compute the gradient of from the approximated RoI-grid Attention:

(11)

However, taking the derivative is still infeasible, since we cannot directly calculate the gradient of a parameterized distribution. The reparameterization trick [12] offers a possible solution to the problem. The key insight is sampling from a basic distribution and then move the original distribution parameters inside the expectation function as coefficients. The gradient of can be computed as:

(12)

where is the same as Eq.10, and the theoretical distribution means that the sampling probability is in the whole 3D space. In practical, considering the fact that is close to when , we apply an approximation and restrict the sampling range within a sphere with a radius slightly larger than , in our experiments. This approximation reduces the computational overhead to the same level as vanilla RoI-grid Attention. Since is a differentiable function , we are able to compute the gradient of in a differential manner using Eq.12. The new formulation of RoI-grid Attention can be represented as

(13)

Compared with vanilla RoI-grid Attention in Eq.7, a slightly larger sampling range is used and an coefficient is added into the original formula, which costs little additional resources. Although several approximations are applied, we found that they didn’t hamper the training but boost the performance in our experiments.

Figure 5: Illustration of dynamic radius predicted by the Density-Aware Radius Prediction module. For each RoI, an adaptive focusing radius is learned based on the sparsity conditions.

We further propose the DARP module based on Eq.13. For each pyramid level, a context embedding is obtained by summarizing the information of Points of Interest near this RoI, and then the embedding is utilized to predict the radius for all grid points in this level. is further transformed into an coefficient by and participates in the computation of RoI-grid Attention. Since the context embedding captures point cloud information, density, shape, , the predicted is able to adapt to the environmental changes, and is more robust than the human-defined counterpart.

4 Experiments

In this section, we evaluate our Pyramid R-CNN on the commonly used Waymo Open dataset [33] and the KITTI [7] dataset. We first introduce the experimental settings in 4.1 and then compare our approach with previous state-of-the-art methods on the Waymo Open dataset in 4.2 and the KITTI dataset in 4.3. Finally, we conduct ablation studies to evaluate the efficacy of each component in 4.4.

4.1 Experimental Setup

Methods LEVEL_1 LEVEL_2 LEVEL_1 3D mAP/mAPH by Distance
3D mAP/mAPH 3D mAP/mAPH 0-30m 30-50m 50m-Inf
PointPillars [14] 63.3/62.7 55.2/54.7 84.9/84.4 59.2/58.6 35.8/35.2
MVF [45] 62.93/- - 86.30/- 60.02/- 36.02/-
Pillar-OD [36] 69.8/- - 88.5/- 66.5/- 42.9/-
AFDet [6] 63.69/- - 87.38/- 62.19/- 29.27/-
LaserNet [22] 52.1/50.1 - 70.9/68.7 52.9/51.4 29.6/28.6
CVCNet [3] 65.2/- - 86.80/- 62.19/- 29.27/-
StarNet [23] 64.7/56.3 45.5/39.6 83.3/82.4 58.8/53.2 34.3/25.7
RCD [1] 69.0/68.5 - 87.2/86.8 66.5/66.1 44.5/44.0
Voxel R-CNN [5] 75.59/- 66.59/- 92.49/- 74.09/- 53.15/-
PointRCNN [30] 45.05/44.25 37.41/36.74 72.24/71.31 31.21/30.41 23.77/23.15
Pyramid-P (ours) 47.02/46.58 39.10/38.76 74.24/73.78 32.49/31.96 25.68/25.24
Part- Net [31] 71.69/71.16 64.21/63.70 91.83/91.37 69.99/69.37 46.26/45.41
Pyramid-V (ours) 75.83/75.29 66.77/66.28 92.63/92.20 74.46/73.84 53.40/52.44
PV-RCNN [28] 70.3/69.7 65.4/64.8 91.9/91.3 69.2/68.5 42.2/41.3
Pyramid-PV (ours) 76.30/75.68 67.23/66.68 92.67/92.20 74.91/74.21 54.54/53.45
Table 1: Performance comparison on the Waymo Open Dataset with 202 validation sequences for the vehicle detection. : re-implemented by ourselves with the official code.
Methods LEVEL_1 LEVEL_2 LEVEL_1 3D mAP/mAPH by Distance
3D mAP/mAPH 3D mAP/mAPH 0-30m 30-50m 50m-Inf
CenterPoint [42] 81.05/80.59 73.42/72.99 92.52/92.13 79.94/79.43 61.06/60,42
PV-RCNN [28] 81.06/80.57 73.69/73.23 93.40/92.98 80.12/79.57 61.22/60.47
Pyramid-PV (ours) 81.77/81.32 74.87/74.43 93.19/92.80 80.53/80.04 64.55/63.84
Table 2: Performance comparison on the Waymo Open Dataset test leaderboard for the vehicle detection. : test submissions are the modified version of original architectures. : We append another frame following [28] and use a larger voxel backbone.

Waymo Open Dataset. The Waymo Open Dataset contains sequences in total, including sequences (around point cloud samples) in the training set and 202 sequences (around

point cloud samples) in the validation set. The official evaluation metrics are standard 3D mean Average Precision (mAP) and mAP weighted by heading accuracy (mAPH). Both of the two metrics are based on an IoU threshold of 0.7 for vehicles and 0.5 for other categories. The testing samples are split in two ways. The first way is based on the distances of objects to the sensor:

, and . The second way is according to the difficulty levels: LEVEL_1 for boxes with more than five LiDAR points and LEVEL_2 for boxes with at least one LiDAR point.

KITTI Dataset. The KITTI dataset contains training samples and test samples, and the training samples are further divided into the train split ( samples) and the split ( samples). The official evaluation metric is mean Average Precision (mAP) with a rotated IoU threshold 0.7 for cars. On the test set mAP is calculated with recall positions by the official server. The results on the val set are calculated with 11 recall positions for a fair comparison with other approaches.

We provide architectures of Pyramid R-CNN, compatible with the point-based, the voxel-based and the point-voxel-based backbone, respectively. We would like readers to refer to [34] for the detailed design of those backbones.

Pyramid-P. Pyramid R-CNN for Points is built upon the point-based method PointRCNN [30]. In particular, we replace the Canonical 3D Box Refinement module of PointRCNN, with our proposed pyramid RoI head in Pyramid R-CNN, and we still use the sampled points in [30] as Points of Interest. The point cloud backbone and other configurations are kept the same for a fair comparison.

Pyramid-V. Pyramid R-CNN for Voxels is built upon the voxel-based method Part- Net [31]. Specifically, we replace the 3D sparse convolutional head of Part- Net, with our proposed pyramid RoI head in Pyramid R-CNN, and we still use the upsampled voxels as Points of Interest. The voxel-based backbone and other configurations are kept the same for a fair comparison.

Pyramid-PV. Pyramid R-CNN for Point-Voxels is designed upon the point-voxel-based method PV-RCNN [28]. In particular, we replace the RoI-grid Pooling module of PV-RCNN, with our proposed pyramid RoI head in Pyramid R-CNN, and we still use the keypoints as Points of Interest. The keypoints encoding process, the 3D sparse convolutional networks and other configurations are kept the same for a fair comparison.

Implementation Details. Here we only introduce the architecture of Pyramid-PV on the Waymo Open dataset. The implementations of other models are similar and can be found in the supplementary materials. In RoI-grid Attention, the number of attention heads is set to and each head contains feature channels. In the DARP module, the context embedding is extracted from the neighboring Points of Interest within two spheres with the radius and . The temperature starts from and exponentially decays to in the end. The RoI-grid Pyramid consists of levels, with the number of grid points as respectively, and for each pyramid level, a focusing radius is predicted and shared across all the grid points in this level. The enlarging ratio and are set to for the respective level of the RoI-grid Pyramid, and is set to in all pyramid levels. The maximum number of points that participate in RoI-grid Attention for each grid point is set to for the corresponding pyramid level.

Training and Inference Details. Our Pyramid R-CNN is trained from scratch with the ADAM optimizer. On the KITTI dataset, Pyramid-P, Pyramid-V and Pyramid-PV are trained with the same batch size , the learning rate respectively for epochs on V100 GPUs. On the Waymo Open dataset, we uniformly sample frames for training and use the full validation set for evaluation following [28]. Pyramid-P, Pyramid-V and Pyramid-PV are trained with the same batch size , the learning rate for epochs. The cosine annealing learning rate strategy is adopted for the learning rate decay. Other configurations are kept the same as the corresponding baselines [30, 31, 28] for a fair comparison.

4.2 Comparisons on the Waymo Open Dataset

We evaluate the performance of Pyramid R-CNN on the Waymo Open dataset. The validation results in Table 1 show that our Pyramid-P, Pyramid-V and Pyramid-PV significantly outperform the baseline methods with , and mAP gain respectively, and achieves superior mAP on all difficulty levels and all distance ranges, which demonstrates the effectiveness and generalizability of our approach. It is worth noting that Pyramid-V surpasses PV-RCNN by mAP in detecting objects that are , which indicates the adaptability of our approach to the extremely sparse conditions. Our Pyramid-PV outperforms all the previous approaches with a remarkable margin, and achieves the new state-of-the-art performance mAP and mAP for the LEVEL_1 and LEVEL_2 difficulty. In table 2, our Pyramid-PV achieves LEVEL_1 mAP, ranks on the Waymo vehicle detection leaderboard as of March 10th, 2021, and surpasses all the LiDAR-only approaches.

4.3 Comparisons on the KITTI Dataset

We evaluate our Pyramid R-CNN on the KITTI dataset. The test results in Table 3 show that our Pyramid-P, Pyramid-V and Pyramid-PV consistently outperform the baseline methods with , and mAP gain respectively on the moderate car class, and Pyramid-PV achieves mAP, becoming the new state-of-the-art. The validation results in Table 4 show that Pyramid-P, Pyramid-V and Pyramid-PV improve the baselines by , and mAP on the moderate car class, and , and mAP on the hard car class respectively. We note that the performance gains are mainly from the hard cases, which indicates the adaptability of our approach, and the observations on the KITTI dataset are consistent with those on the Waymo Open dataset.

Methods Modality (%)
Easy Mod. Hard
MV3D [4] R+L 74.97 63.63 54.00
AVOD-FPN [13] R+L 83.07 71.76 65.73
F-PointNet [25] R+L 82.19 69.79 60.59
MMF [16] R+L 88.40 77.43 70.22
3D-CVF [43] R+L 89.20 80.05 73.11
CLOCs [24] R+L 88.94 80.67 77.15
ContFuse [17] R+L 83.68 68.78 61.67
VoxelNet [46] L 77.47 65.11 57.73
PointPillars [14] L 82.58 74.31 68.99
SECOND [38] L 84.65 75.96 68.71
STD [40] L 87.95 79.71 75.09
Patches [15] L 88.67 77.20 71.82
3DSSD [39] L 88.36 79.57 74.55
SA-SSD [10] L 88.75 79.79 74.16
TANet [19] L 85.94 75.76 68.32
Voxel R-CNN [5] L 90.90 81.62 77.06
HVNet [41] L 87.21 77.58 71.79
PointGNN [32] L 88.33 79.47 72.29
PointRCNN [30] L 86.96 75.64 70.70
Pyramid-P (ours) L 87.03 80.30 76.48
Part- Net [31] L 87.81 78.49 73.51
Pyramid-V (ours) L 87.06 81.28 76.85
PV-RCNN [28] L 90.25 81.43 76.82
Pyramid-PV (ours) L 88.39 82.08 77.49
Table 3: Performance comparison on the KITTI test set with AP calculated by recall positions for the car category. R+L denotes the methods that combines RGB data and point clouds. L denotes LiDAR-only approaches.
Methods (%)
Easy Mod. Hard
PointRCNN [30] 88.88 78.63 77.38
Pyramid-P (ours) 88.47 83.10 78.44
Part- Net [31] 89.47 79.47 78.54
Pyramid-V (ours) 88.44 83.14 78.61
PV-RCNN [28] 89.35 83.69 78.70
Pyramid-PV (ours) 89.37 84.38 78.84
Table 4: Performance comparison on the KITTI val split with AP calculated by recall positions for the car category.

4.4 Ablation Studies

The effects of different components. As is shown in Table 5, on the Waymo validation set, the RoI-grid Pyramid of the Pyramid-PV model improves over the baseline by mAP, mainly because the RoI-grid Pyramid is able to capture large context information which benefits the detection of the hard cases. Based on the RoI-grid Pyramid, replacing RoI-grid Pooling with RoI-grid Attention boosts the performance by mAP, which indicates that RoI-grid Attention is a more effective operation than RoI-grid Pooling. Using the adaptive radius instead of the fixed radius boosts the performance by mAP, which demonstrates the efficacy of the DARP module.

The effects of different pyramid configurations. As is shown in Table 6, we found that the RoI-grid Pyramid with enhances the performance compared with the standard RoI-grid only with , mainly because placing some grid points outside RoIs encodes richer contexts. The total number of used grid points is , which is comparable to grid points used in [28].

Inference speed analysis. We test the inference speed of different frameworks under a single V100 GPU with batch size , and obtain the average running speed of all samples in KITTI val split. Table 7 shows that our models maintain computational efficiency compared to the baselines, and the pyramid RoI head only adds little latency per frame.

Methods R.P. D.A.R.P. R.A. LEVEL_1 mAP
PV-RCNN 70.30
PV-RCNN 74.06
(a) 75.26
(b) 75.63
(c) 75.77
(d) 76.30
Table 5: Effects of different components in Pyramid-PV on the Waymo dataset. R.P.: the RoI-grid Pyramid. D.A.R.P.: the Density-Aware Radius Prediction module. R.A.: RoI-grid Attention. : re-implemented by ourselves with the official code.
Methods grid size LEVEL_1 mAP
PV-RCNN [6, 6] [1, 1] 74.06
(a) [6,4,4] [1,1,2] 74.55
(b) [6,4,4,4] [1,1,2,4] 74.71
(c) [6,4,4,4,1] [1,1,1.5,2,4] 75.26
Table 6: Effects of different RoI pyramids in Pyramid-PV on the Waymo dataset. Each element in [] stands for the respective parameter of a pyramid level.
Methods Inference speed (Hz)
PointRCNN [30] 10.08
Pyramid-P (ours) 8.92
Part- Net [31] 11.75
Pyramid-V (ours) 9.68
PV-RCNN [28] 9.25
Pyramid-PV (ours) 7.86
Table 7: Comparisons on the inference speeds of different detection models on the KITTI dataset.

5 Conclusion

We present a general two-stage framework Pyramid R-CNN which can be applied upon diverse backbones. Our framework can handle the sparse and non-uniform distribution problems of point clouds by introducing the pyramid RoI head. For future work, we plan to optimize Pyramid R-CNN for efficient inference.

References

  • [1] A. Bewley, P. Sun, T. Mensink, D. Anguelov, and C. Sminchisescu (2020) Range conditioned dilated convolutions for scale invariant 3d object detection. arXiv preprint arXiv:2005.09927. Cited by: Table 1.
  • [2] Z. Cai and N. Vasconcelos (2018) Cascade r-cnn: delving into high quality object detection. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    ,
    pp. 6154–6162. Cited by: §1.
  • [3] Q. Chen, L. Sun, E. Cheung, and A. L. Yuille (2020) Every view counts: cross-view consistency in 3d object detection with hybrid-cylindrical-spherical voxelization. Advances in Neural Information Processing Systems 33. Cited by: Table 1.
  • [4] X. Chen, H. Ma, J. Wan, B. Li, and T. Xia (2017) Multi-view 3d object detection network for autonomous driving. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 1907–1915. Cited by: Table 3.
  • [5] J. Deng, S. Shi, P. Li, W. Zhou, Y. Zhang, and H. Li (2020) Voxel r-cnn: towards high performance voxel-based 3d object detection. arXiv preprint arXiv:2012.15712. Cited by: §1, §2, §3.3, §3.3, Table 1, Table 3.
  • [6] R. Ge, Z. Ding, Y. Hu, Y. Wang, S. Chen, L. Huang, and Y. Li (2020) Afdet: anchor free one stage 3d object detection. arXiv preprint arXiv:2006.12671. Cited by: Table 1.
  • [7] A. Geiger, P. Lenz, C. Stiller, and R. Urtasun (2013) Vision meets robotics: the kitti dataset. The International Journal of Robotics Research 32 (11), pp. 1231–1237. Cited by: §1, §4.
  • [8] R. Girshick, J. Donahue, T. Darrell, and J. Malik (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 580–587. Cited by: §1.
  • [9] R. Girshick (2015) Fast r-cnn. In Proceedings of the IEEE international conference on computer vision, pp. 1440–1448. Cited by: §1.
  • [10] C. He, H. Zeng, J. Huang, X. Hua, and L. Zhang (2020) Structure aware single-stage 3d object detection from point cloud. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11873–11882. Cited by: Table 3.
  • [11] K. He, G. Gkioxari, P. Dollár, and R. Girshick (2017) Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pp. 2961–2969. Cited by: §1.
  • [12] D. P. Kingma and M. Welling (2013) Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114. Cited by: §3.4.
  • [13] J. Ku, M. Mozifian, J. Lee, A. Harakeh, and S. L. Waslander (2018) Joint 3d proposal generation and object detection from view aggregation. In 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 1–8. Cited by: Table 3.
  • [14] A. H. Lang, S. Vora, H. Caesar, L. Zhou, J. Yang, and O. Beijbom (2019) Pointpillars: fast encoders for object detection from point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12697–12705. Cited by: §1, §2, Table 1, Table 3.
  • [15] J. Lehner, A. Mitterecker, T. Adler, M. Hofmarcher, B. Nessler, and S. Hochreiter (2019) Patch refinement–localized 3d object detection. arXiv preprint arXiv:1910.04093. Cited by: Table 3.
  • [16] M. Liang, B. Yang, Y. Chen, R. Hu, and R. Urtasun (2019) Multi-task multi-sensor fusion for 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7345–7353. Cited by: Table 3.
  • [17] M. Liang, B. Yang, S. Wang, and R. Urtasun (2018) Deep continuous fusion for multi-sensor 3d object detection. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 641–656. Cited by: Table 3.
  • [18] T. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie (2017) Feature pyramid networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2117–2125. Cited by: §3.2.
  • [19] Z. Liu, X. Zhao, T. Huang, R. Hu, Y. Zhou, and X. Bai (2020) Tanet: robust 3d object detection from point clouds with triple attention. In

    Proceedings of the AAAI Conference on Artificial Intelligence

    ,
    Vol. 34, pp. 11677–11684. Cited by: Table 3.
  • [20] J. Mao, M. Niu, C. Jiang, H. Liang, X. Liang, Y. Li, C. Ye, W. Zhang, Z. Li, J. Yu, et al. (2021) One million scenes for autonomous driving: once dataset. arXiv preprint arXiv:2106.11037. Cited by: §1.
  • [21] J. Mao, X. Wang, and H. Li (2019) Interpolated convolutional networks for 3d point cloud understanding. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1578–1587. Cited by: §2.
  • [22] G. P. Meyer, A. Laddha, E. Kee, C. Vallespi-Gonzalez, and C. K. Wellington (2019) Lasernet: an efficient probabilistic 3d object detector for autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12677–12686. Cited by: Table 1.
  • [23] J. Ngiam, B. Caine, W. Han, B. Yang, Y. Chai, P. Sun, Y. Zhou, X. Yi, O. Alsharif, P. Nguyen, et al. (2019) Starnet: targeted computation for object detection in point clouds. arXiv preprint arXiv:1908.11069. Cited by: Table 1.
  • [24] S. Pang, D. Morris, and H. Radha (2020) CLOCs: camera-lidar object candidates fusion for 3d object detection. arXiv preprint arXiv:2009.00784. Cited by: Table 3.
  • [25] C. R. Qi, W. Liu, C. Wu, H. Su, and L. J. Guibas (2018) Frustum pointnets for 3d object detection from rgb-d data. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 918–927. Cited by: Table 3.
  • [26] C. R. Qi, L. Yi, H. Su, and L. J. Guibas (2017) Pointnet++: deep hierarchical feature learning on point sets in a metric space. arXiv preprint arXiv:1706.02413. Cited by: §2.
  • [27] S. Ren, K. He, R. Girshick, and J. Sun (2016) Faster r-cnn: towards real-time object detection with region proposal networks. IEEE transactions on pattern analysis and machine intelligence 39 (6), pp. 1137–1149. Cited by: §1.
  • [28] S. Shi, C. Guo, L. Jiang, Z. Wang, J. Shi, X. Wang, and H. Li (2020) Pv-rcnn: point-voxel feature set abstraction for 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10529–10538. Cited by: §1, §1, §1, §2, §3.1, §3.3, §3.3, §4.1, §4.1, §4.4, Table 1, Table 2, Table 3, Table 4, Table 7.
  • [29] S. Shi, L. Jiang, J. Deng, Z. Wang, C. Guo, J. Shi, X. Wang, and H. Li (2021) PV-rcnn++: point-voxel feature set abstraction with local vector representation for 3d object detection. arXiv preprint arXiv:2102.00463. Cited by: §1, §2, §3.3, §3.3.
  • [30] S. Shi, X. Wang, and H. Li (2019) Pointrcnn: 3d object proposal generation and detection from point cloud. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 770–779. Cited by: §1, §1, §2, §3.1, §4.1, §4.1, Table 1, Table 3, Table 4, Table 7.
  • [31] S. Shi, Z. Wang, J. Shi, X. Wang, and H. Li (2020) From points to parts: 3d object detection from point cloud with part-aware and part-aggregation network. IEEE transactions on pattern analysis and machine intelligence. Cited by: §1, §1, §2, §3.1, §4.1, §4.1, Table 1, Table 3, Table 4, Table 7.
  • [32] W. Shi and R. Rajkumar (2020)

    Point-gnn: graph neural network for 3d object detection in a point cloud

    .
    In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 1711–1719. Cited by: §2, Table 3.
  • [33] P. Sun, H. Kretzschmar, X. Dotiwalla, A. Chouard, V. Patnaik, P. Tsui, J. Guo, Y. Zhou, Y. Chai, B. Caine, et al. (2020) Scalability in perception for autonomous driving: waymo open dataset. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2446–2454. Cited by: §4.
  • [34] O. D. Team (2020)

    OpenPCDet: an open-source toolbox for 3d object detection from point clouds

    .
    Note: https://github.com/open-mmlab/OpenPCDet Cited by: §4.1.
  • [35] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin (2017) Attention is all you need. arXiv preprint arXiv:1706.03762. Cited by: §3.3.
  • [36] Y. Wang, A. Fathi, A. Kundu, D. Ross, C. Pantofaru, T. Funkhouser, and J. Solomon (2020) Pillar-based object detection for autonomous driving. arXiv preprint arXiv:2007.10323. Cited by: §2, Table 1.
  • [37] Y. Wang, Y. Sun, Z. Liu, S. E. Sarma, M. M. Bronstein, and J. M. Solomon (2019) Dynamic graph cnn for learning on point clouds. Acm Transactions On Graphics (tog) 38 (5), pp. 1–12. Cited by: §3.3.
  • [38] Y. Yan, Y. Mao, and B. Li (2018) Second: sparsely embedded convolutional detection. Sensors 18 (10), pp. 3337. Cited by: §1, §2, Table 3.
  • [39] Z. Yang, Y. Sun, S. Liu, and J. Jia (2020) 3dssd: point-based 3d single stage object detector. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 11040–11048. Cited by: §2, Table 3.
  • [40] Z. Yang, Y. Sun, S. Liu, X. Shen, and J. Jia (2019) Std: sparse-to-dense 3d object detector for point cloud. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1951–1960. Cited by: §1, §2, Table 3.
  • [41] M. Ye, S. Xu, and T. Cao (2020) Hvnet: hybrid voxel network for lidar based 3d object detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 1631–1640. Cited by: Table 3.
  • [42] T. Yin, X. Zhou, and P. Krähenbühl (2020) Center-based 3d object detection and tracking. arXiv preprint arXiv:2006.11275. Cited by: §2, Table 2.
  • [43] J. H. Yoo, Y. Kim, J. S. Kim, and J. W. Choi (2020) 3d-cvf: generating joint camera and lidar features using cross-view spatial feature fusion for 3d object detection. arXiv preprint arXiv:2004.12636 3. Cited by: Table 3.
  • [44] H. Zhao, L. Jiang, J. Jia, P. Torr, and V. Koltun (2020) Point transformer. arXiv preprint arXiv:2012.09164. Cited by: §3.3, §3.3.
  • [45] Y. Zhou, P. Sun, Y. Zhang, D. Anguelov, J. Gao, T. Ouyang, J. Guo, J. Ngiam, and V. Vasudevan (2020) End-to-end multi-view fusion for 3d object detection in lidar point clouds. In Conference on Robot Learning, pp. 923–932. Cited by: Table 1.
  • [46] Y. Zhou and O. Tuzel (2018) Voxelnet: end-to-end learning for point cloud based 3d object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4490–4499. Cited by: §1, §2, Table 3.

Appendix A Approximation in Radius Prediction

In this section, we explain why is used to approximate . As is shown in Figure 6, can be viewed as a soft approximation to , and the sharpness of the curve is controlled by the temperature . When approaches , is more similar to . In this paper, we set the initial as for exploration, and gradually decrease to to obtain a better approximation.

Appendix B Implementation of the DARP Module

In this section, we provide the detailed implementation of the Density-Aware Radius Prediction (DARP) module. Inspired by the design of Deformable Convolutions which utilize standard convolutions to predict the deformable offsets, we first use a fixed sphere to aggregate the context embedding and then use this embedding to predict the dynamic radius offset for all grid points in a pyramid level. In particular, for each pyramid level in an RoI-grid Pyramid, we use two spheres centered at the RoI with radius and for context aggregation. Then the aggregated context embedding is fed into a MLP to predict the dynamic radius offset . A predefined radius added by the dynamic offset , , is utilized to obtain the coefficient in Eq.10, and with , the Points of Interest within are selected as for the computation of RoI-grid Attention in Eq.13, where is the temperature. The predefined in this paper is set to , , , , for the respective pyramid level. We note that all grid points in a pyramid level share the same , and the prediction of adds little computational overhead. It is worth noting that we can easily extend this idea to the settings where each grid point has its individual predicted radius, or we can additionally predict centers for the predicted spheres.

Figure 7: 3D Backbone of Pyramid-PV and Pyramid-PV.

Appendix C Backbones of Pyramid R-CNN

In this section, we provide additional information for some backbones of Pyramid R-CNN. We note that other backbones that are not mentioned are directly referred from the official source-code repositories. Pyramid RoI head is kept the same upon all the backbones in this paper.

Pyramid-P. We re-implement the backbone of PointRCNN on the Waymo Open dataset. Different from the original version on the KITTI dataset, we set the number of input sampled point clouds to , and the number of downsampled points is set to , , , for the respective layer. We note that this modification enlarges the number of kept points, since the number of input point clouds is larger compared with those on the KITTI dataset. Pyramid-P and our re-implemented PointRCNN share the same backbone configurations on the Waymo Open dataset.

Pyramid-PV. In Pyramid-PV we implement a larger backbone with the input voxel size as . The backbones of Pyramid-PV and vanilla Pyramid-PV (PV-RCNN) are shown in Figure 7. Our Pyramid R-CNN is compatible with a small backbone for a fair comparison with baseline methods, or a large backbone to further enhance the detection performance.

Appendix D Qualitative Results

In this section, we provide the qualitative results on the KITTI dataset in Figure 8, and the Waymo Open dataset in Figure 9. The figures show that our proposed Pyramid R-CNN can accurately detect 3D objects which are far away and have only a few points.

Figure 8: Visualization of detection results on the KITTI dataset. Blue boxes are the ground truth boxes, and red boxes are the boxes predicted by Pyramid-PV.
Figure 9: Visualization of detection results on the Waymo Open dataset. Blue boxes are the ground truth boxes, and red boxes are the boxes predicted by Pyramid-PV.