1 Introduction
3D object detection is a key component of perception systems for robotics and autonomous driving, aiming at detecting vehicles, pedestrians, and other objects with 3D point clouds as input. In this paper, we propose a general twostage 3D detection framework, named Pyramid RCNN, which can be applied on multiple 3D backbones to enhance the detection adaptability and performance.
Among the existing 3D detection frameworks, twostage detection models [40, 31, 28, 5, 29] surpass most singlestage 3D detectors [46, 38, 14, 20, 30] with remarkable margins owing to the RoI refinement stage. Different from the 2D counterparts [8, 9, 27, 11, 2] which apply RoIPool [9] or RoIAlign [11] to crop dense feature maps on the second stage, the 3D detection models generally perform various RoI feature extraction operations on the Points of Interest. For example, Point RCNN [30] utilizes a pointbased backbone to generate 3D proposals, treats the points near the proposals as Points of Interest and applies Region Pooling on those sparse points for box refinement; Part Net [31] utilizes a voxelbased backbone for proposal generation, uses the upsampled voxel points as Points of Interest, and applies sparse convolutions on those voxel points for each RoI; PVRCNN [28] encodes the whole scene into a set of keypoints, and utilizes keypoints as Points of Interest for RoIgrid Pooling. Those Points of Interest originate from raw point clouds and contain rich finegrained information, which is required for the RoI refinement stage.
However, the Points of Interest inevitably suffer from the sparsity and nonuniform distribution characteristics of input point clouds. As is demonstrated by the statistical results on the KITTI dataset [7] in Figure 1: 1) Point clouds can be quite sparse in certain objects. More than of total objects have less than 10 points, and their visualized shapes are mostly incomplete. Thus it is hard to identify their categories without enough context information. 2) The distribution of object points is extremely imbalanced. The number of object points ranges from less than to more than on KITTI, and current RoI operations cannot handle the imbalanced conditions effectively. 3) The number of Points of Interest only accounts for a small proportion of input points or voxels, keypoints in [28] relative to the total input points, which exacerbates the above problems.
To overcome the above limitations, we propose Pyramid RCNN, a general twostage 3D detection framework that can effectively detect objects and adapt along with environmental changes. Our main contribution lies in the design of a novel RoI feature extraction head, named pyramid RoI head, which can be applied on multiple 3D backbones and Points of Interest. pyramid RoI head consists of three key components. Firstly, we propose RoIgrid Pyramid. Given the observation that Points of Interest inside RoIs are too sparse for object recognition, our RoIgrid Pyramid captures more Points of Interest outside RoIs while still maintaining finegrained geometric details, by extending the standard onelevel RoIgrid to a pyramid structure. Secondly, we propose RoIgrid Attention, an effective operation to extract RoIgrid features from Points of Interest. RoIgrid Attention leverages the advantages of the graphbased and attentionbased point operators by combining those formulas into a unified formulation, and it can adapt to different sparsity situations by dynamically attending to the crucial Points of Interest near the RoIs. Thirdly, we propose the DensityAware Radius Prediction (DARP) module, which can predict the feature extraction radius of each RoI, conditioning on the neighboring distribution of Points of Interest. Thus we can address the imbalanced distribution problem by adaptively adjusting the focusing range for each RoI. Combining all the above components, the pyramid RoI head shows adaptability to different point cloud sparsity levels and can accurately detect the 3D objects with only a few points. Our Pyramid RCNN is compatible with the pointbased [30], voxelbased [31] and pointvoxelbased [28] frameworks, and significantly boosts the detection accuracy.
We summarize our key contributions as follows:
1) We propose Pyramid RCNN, a general twostage framework that can be applied on multiple backbones for accurate and robust 3D object detection.
2) We propose the pyramid RoI head, which combines the RoIgrid Pyramid, RoIgrid Attention, and the DensityAware Radius Prediction (DARP) module together to mitigate the sparsity and nonuniform distribution problems.
3) Pyramid RCNN consistently outperforms the baselines, achieves moderate car mAP on the KITTI dataset, and ranks among the LiDARonly methods on the Waymo test leaderboard for vehicle detection.
2 Related Work
Singlestage 3D Object Detection. Singlestage methods can be divided into streams, , pointbased, voxelbased and pillarbased. The pointbased singlestage detectors generally take the raw points as input, and apply set abstraction [26, 21] to obtain the point features for box prediction. 3DSSD [39] introduces FeatureFPS as a new sampling strategy for raw point clouds. PointGNN [32] proposes a graph operator to aggregate the points information for object detection. The voxelbased singlestage detectors typically rasterize point clouds into voxelgrids and then apply 2D and 3D CNN to generate 3D proposals. VoxelNet [46] divides points into voxels and leverages a 3D CNN to aggregate voxel features for proposal generation. SECOND [38] improves the voxel feature learning process by introducing 3D sparse convolutions. CenterPoints [42] proposes a centerbased assignment that can be applied on feature maps for accurate location prediction. Pillarbased approaches generally transform point clouds into BirdEyeView (BEV) pillars and apply 2D CNNs for 3D object detection. PointPillars [14] is the first work that introduces the pillar representation. Pillarbased network [36] extends the idea by proposing the cylindrical view projection. Unlike the twostage approaches, the singlestage methods cannot benefit from the finegrained point information, which is crucial for accurate box prediction.
Twostage 3D object detection. Twostage approaches can be divided into streams, based on the representation of Points of Interest, , pointbased, voxelbased and pointvoxelbased. Pointbased approaches treat the sampled point clouds as Points of Interest. PointRCNN [30] generates 3D proposals from raw point clouds and proposes Region Pooling to extract RoI features for the second stage refinement. STD [40] proposes a sparsetodense strategy and uses the PointsPool operation for RoI refinement. Voxelbased methods use the voxel points from 3D CNNs as Points of Interest. Part Net [31] applies 3D sparse convolutions on the upsampled voxel points for RoI refinement. Voxel RCNN [5] utilizes Voxel RoI Pooling to extract RoI features from voxels. PointVoxelbased approaches use the keypoints that encode the whole scene as Points of Interest. PVRCNN [28] designs RoIgrid Pooling to aggregate keypoint features near RoIs. PVRCNN++ [29]
proposes VectorPooling to efficiently collect the keypoint features from different orientations. Compared with the previous methods, our Pyramid RCNN shows better performance and robustness, and is compatible with all the representations of Points of Interest.
3 Pyramid RCNN
In this section, we detail the design of Pyramid RCNN, a general twostage framework for 3D object detection. We first introduce the overall architecture in 3.1. Then we introduce three key components in the pyramid RoI head: RoIgrid Pyramid in 3.2, RoIgrid Attention in 3.3, and the DensityAware Radius Prediction (DARP) module in 3.4.
3.1 Overall Architecture
Here, we present a new twostage framework for accurate and robust 3D object detection, named Pyramid RCNN, as shown in Figure 2. The framework can be compatible with multiple backbones, the pointbased backbone, the voxelbased backbone , or the pointvoxelbased backbone. On the first stage, those backbones output 3D proposals and corresponding Points of Interest: point clouds near RoIs in [30], upsampled voxels in [31], and keypoints in [28]. On the second stage, we propose a novel pyramid RoI head, which consists of three key components: the RoIgrid Pyramid, RoIgrid Attention, and the DensityAware Radius Prediction (DARP) module. For each RoI, we first build an RoIgrid Pyramid, by gradually enlarging the size of the original RoI in each pyramid level, and the coordinates of RoIgrid points are determined by the enlarged RoI size and the grid size. In each pyramid level, the focusing radius of the RoIgrid points is predicted from the global context vector through the DensityAware Radius Prediction module. Then RoIgrid Attention parameterized by is performed to aggregate the features of Points of Interest into the RoIgrids. Finally, the RoIgrid features are enhanced and fed into two individual heads for classification and regression. We will describe those key components in the following sections.
3.2 RoIgrid Pyramid
In this section, we present the RoIgrid Pyramid, a simple and effective module that captures rich context while still maintains internal structural information. Different from 2D feature pyramid [18] which hierarchically encodes context information upon dense backbone features, our RoIgrid Pyramid is applied on each RoI by gradually placing the grid points out of RoIs in a pyramid manner. The idea behind this design is based on the observation that image features inside RoIs generally contain sufficient semantic contexts, while point clouds inside RoIs contain quite limited information since object points are naturally sparse and incomplete. Even though each point has a large receptive field, the sparse compositional 3D shapes inside RoIs are hard to be recognized. In the following parts we will introduce detailed formulations.
RoI feature extraction generally relies on an RoIgrid for each RoI, and RoIgrid points collect the features of adjacent pixels or neighboring Points of Interest in the 2D or 3D cases respectively. Supposing we have an RoI with as width, length, and height and as the bottom left corner, in standard RoIgrid representation, the RoIgrid point location can be computed as:
(1) 
where are the grid sizes in three dimensions and all grid points are generated inside RoIs.
Utilizing features only inside the RoIs works well in the 2D detection models, mainly owing to two facts: the input feature map is dense and the collected pixels have large receptive fields. However, the cases are different in 3D models. As is shown in Figure 3, the Points of Interest are naturally sparse and nonuniformly distributed inside the RoIs, and the object shape is extremely incomplete. Thus it is hard to accurately infer the sizes and categories of objects by solely collecting the features of few individual points and not referring to enough neighboring points information.
To resolve the above problems, we propose the RoIgrid Pyramid which balances the finegrained and large context information. The detailed structure is in Figure 3. The key idea is to construct a pyramid grid structure that contains the RoIgrid points both inside and outside RoIs, so that the grid points inside RoIs can capture finegrained shape structures for accurate box refinement, while the grid points outside RoIs can obtain large context information to identify incomplete objects. The grid points for a pyramid level can be computed as:
(2) 
where is the enlarging ratio of the original RoI size. starts from at the bottom level for maintaining finegrained details, and becomes larger when the level goes higher to capture more context information. The grid size is initialized with the same value as the original at the bottom level and gets smaller at higher levels to save computational resources. For each pyramid level, features of grid points are then aggregated by RoIgrid Attention from the features of Points of Interest. Finally, features of all pyramid levels are combined for boxes refinement.
3.3 RoIgrid Attention
In this section, we introduce RoIgrid Attention, a novel RoI feature extraction operation that combines the stateoftheart graphbased and attentionbased point operators [37, 35, 44] into a unified framework, and RoIgrid Attention can serve as a better substitute for conventional poolingbased operations [28, 5, 29] in 3D detection models. We first discuss the formulas of poolingbased, graphbased and attentionbased point operators, and then we derive the formulation of RoIgrid Attention.
Preliminary. Let be the coordinate of an RoIgrid point, and , be the coordinate and the corresponding feature vector of the Points of Interest near . RoI feature extraction operation aims to obtain the respective feature vector of the RoIgrid point , using the information of neighboring and .
Poolingbased Operators. The poolingbased operators are extensively applied for RoI feature extraction in most twostage 3D detection models [28, 5, 29]. The neighboring feature and the relative location first go through a MLP layer to obtain the transformed feature vector: , where is the concatenation function, and then a maxpooling operation is applied upon all the transformed features to obtain the RoIgrid feature :
(3) 
where means Points of Interest within the fixed radius of the RoIgrid point . The poolingbased operators only focus on the maximum channel response and this results in a loss of much semantic and geometric information.
Graphbased Operators. Graphbased operators can model the grid points and Points of Interest as a graph. The graph node represents the transformed feature of : , and the edge can be formulated as a linear projection of the location differences between two nodes: . For the graph node of a grid point , the feature is collected from adjacent nodes by a weighted combination operation. Following the same notations as Eq.3, the general formula can be represented as
(4) 
where the function projects the graph edge embedding into the scalar or vector weight space, and denotes either the Hadamard product, dot product or scalarvector product between learned weights and graph nodes.
Attentionbased Operators. Attentionbased operators can also be applied upon the grid points and Points of Interest. in Eq.4 can be viewed as the query embedding from the grid point to the point . is the value embedding obtained from the feature as Eq.4. The key embedding can be formulated as . Thus standard attention can be formulated as
(5) 
Additional normalization function, softmax, is applied in . Recently proposed Point Transformer [44] extending the idea of standard attention and the formula can be represented as
(6) 
RoIgrid Attention. In our approach, we analyze the structural similarity of Eq.4, Eq.5 and Eq.6. We find that those formulas have common basic elements and operators. Thus it’s natural to merge those formulas into a unified framework with gated functions. We name this new formula RoIgrid Attention:
(7)  
where is a learnable gated function which can be implemented by a linear projection of the respective embedding with a sigmoid activation output. RoIgrid Attention is a generalized formulation combining graphbased and attentionbased operations. We can derive the graph operator Eq.4 from Eq.7 when , , , are , , , respectively. Similarly, we can derive the standard attention Eq.5 when , , , are , , , , or Point Transformer Eq.6 when , , , are , , , .
RoIgrid Attention is a flexible and effective operation for RoI feature extraction. With the learnable gated functions, RoIgrid Attention is able to learn which point is significant to the RoIgrid points, from both the geometric information and the semantic information , as well as their combinations adaptively. With , RoIgrid Attention can also learn to balance the ratio of geometric features and semantic features used in feature aggregation. Compared with the poolingbased methods, only a few linear projection layers are added in RoIgrid Attention, which maintains the computational efficiency. Replacing poolingbased operators with RoIgrid Attention consistently boosts the detection performance.
3.4 DensityAware Radius Prediction
In this section, we investigate the learning problem of the radius , which determines the range of neighboring Points of Interest that participate in the feature extraction process. The radius is a hyperparameter used in all the point operators in 3.3, and has to be determined by researchers in previous approaches. The fixed and predefined cannot adapt to the density changes of point clouds, and may lead to empty spherical ranges if not set properly. In this paper, we make the prediction of a fullydifferentiable process and further propose the DensityAware Radius Prediction (DARP) module, aiming at learning an adaptive neighborhood for RoI feature extraction. We first introduce the general formulation of RoIgrid Attention from a probabilistic perspective. Next, we propose a novel method to differentiate the learning of . Finally, we introduce the design of the DARP module.
RoIgrid Attention is composed of two steps: first selects Points of Interest within the radius , and next performs weighted combinations on those points. With the same notations in 3.3, we can reformulate the first step as sampling from a conditional distribution :
(8) 
Then the second step can be represented as calculating the probabilistic expectation:
(9) 
where denotes and denotes with a slight abuse of notations.
We propose a new probability distribution
as a substitute for , and should satisfy two requirements: i) should have similar characteristics as , which means that most points sampled from should be inside ; ii) should also leave a few points outside, mainly for the exploration of the surrounding environment. Thus we formulate the probability
as:(10) 
where and is the temperature which controls the decay rate of probability. With a small , is close to when is inside , and is close to if outside, while near the spherical boundary the sampling probability is between and . With as a smooth approximation to , we want to compute the gradient of from the approximated RoIgrid Attention:
(11) 
However, taking the derivative is still infeasible, since we cannot directly calculate the gradient of a parameterized distribution. The reparameterization trick [12] offers a possible solution to the problem. The key insight is sampling from a basic distribution and then move the original distribution parameters inside the expectation function as coefficients. The gradient of can be computed as:
(12) 
where is the same as Eq.10, and the theoretical distribution means that the sampling probability is in the whole 3D space. In practical, considering the fact that is close to when , we apply an approximation and restrict the sampling range within a sphere with a radius slightly larger than , in our experiments. This approximation reduces the computational overhead to the same level as vanilla RoIgrid Attention. Since is a differentiable function , we are able to compute the gradient of in a differential manner using Eq.12. The new formulation of RoIgrid Attention can be represented as
(13)  
Compared with vanilla RoIgrid Attention in Eq.7, a slightly larger sampling range is used and an coefficient is added into the original formula, which costs little additional resources. Although several approximations are applied, we found that they didn’t hamper the training but boost the performance in our experiments.
We further propose the DARP module based on Eq.13. For each pyramid level, a context embedding is obtained by summarizing the information of Points of Interest near this RoI, and then the embedding is utilized to predict the radius for all grid points in this level. is further transformed into an coefficient by and participates in the computation of RoIgrid Attention. Since the context embedding captures point cloud information, density, shape, , the predicted is able to adapt to the environmental changes, and is more robust than the humandefined counterpart.
4 Experiments
In this section, we evaluate our Pyramid RCNN on the commonly used Waymo Open dataset [33] and the KITTI [7] dataset. We first introduce the experimental settings in 4.1 and then compare our approach with previous stateoftheart methods on the Waymo Open dataset in 4.2 and the KITTI dataset in 4.3. Finally, we conduct ablation studies to evaluate the efficacy of each component in 4.4.
4.1 Experimental Setup
Methods  LEVEL_1  LEVEL_2  LEVEL_1 3D mAP/mAPH by Distance  
3D mAP/mAPH  3D mAP/mAPH  030m  3050m  50mInf  
PointPillars [14]  63.3/62.7  55.2/54.7  84.9/84.4  59.2/58.6  35.8/35.2 
MVF [45]  62.93/    86.30/  60.02/  36.02/ 
PillarOD [36]  69.8/    88.5/  66.5/  42.9/ 
AFDet [6]  63.69/    87.38/  62.19/  29.27/ 
LaserNet [22]  52.1/50.1    70.9/68.7  52.9/51.4  29.6/28.6 
CVCNet [3]  65.2/    86.80/  62.19/  29.27/ 
StarNet [23]  64.7/56.3  45.5/39.6  83.3/82.4  58.8/53.2  34.3/25.7 
RCD [1]  69.0/68.5    87.2/86.8  66.5/66.1  44.5/44.0 
Voxel RCNN [5]  75.59/  66.59/  92.49/  74.09/  53.15/ 
PointRCNN [30]  45.05/44.25  37.41/36.74  72.24/71.31  31.21/30.41  23.77/23.15 
PyramidP (ours)  47.02/46.58  39.10/38.76  74.24/73.78  32.49/31.96  25.68/25.24 
Part Net [31]  71.69/71.16  64.21/63.70  91.83/91.37  69.99/69.37  46.26/45.41 
PyramidV (ours)  75.83/75.29  66.77/66.28  92.63/92.20  74.46/73.84  53.40/52.44 
PVRCNN [28]  70.3/69.7  65.4/64.8  91.9/91.3  69.2/68.5  42.2/41.3 
PyramidPV (ours)  76.30/75.68  67.23/66.68  92.67/92.20  74.91/74.21  54.54/53.45 
Methods  LEVEL_1  LEVEL_2  LEVEL_1 3D mAP/mAPH by Distance  

3D mAP/mAPH  3D mAP/mAPH  030m  3050m  50mInf  
CenterPoint [42]  81.05/80.59  73.42/72.99  92.52/92.13  79.94/79.43  61.06/60,42 
PVRCNN [28]  81.06/80.57  73.69/73.23  93.40/92.98  80.12/79.57  61.22/60.47 
PyramidPV (ours)  81.77/81.32  74.87/74.43  93.19/92.80  80.53/80.04  64.55/63.84 
Waymo Open Dataset. The Waymo Open Dataset contains sequences in total, including sequences (around point cloud samples) in the training set and 202 sequences (around
point cloud samples) in the validation set. The official evaluation metrics are standard 3D mean Average Precision (mAP) and mAP weighted by heading accuracy (mAPH). Both of the two metrics are based on an IoU threshold of 0.7 for vehicles and 0.5 for other categories. The testing samples are split in two ways. The first way is based on the distances of objects to the sensor:
, and . The second way is according to the difficulty levels: LEVEL_1 for boxes with more than five LiDAR points and LEVEL_2 for boxes with at least one LiDAR point.KITTI Dataset. The KITTI dataset contains training samples and test samples, and the training samples are further divided into the train split ( samples) and the split ( samples). The official evaluation metric is mean Average Precision (mAP) with a rotated IoU threshold 0.7 for cars. On the test set mAP is calculated with recall positions by the official server. The results on the val set are calculated with 11 recall positions for a fair comparison with other approaches.
We provide architectures of Pyramid RCNN, compatible with the pointbased, the voxelbased and the pointvoxelbased backbone, respectively. We would like readers to refer to [34] for the detailed design of those backbones.
PyramidP. Pyramid RCNN for Points is built upon the pointbased method PointRCNN [30]. In particular, we replace the Canonical 3D Box Refinement module of PointRCNN, with our proposed pyramid RoI head in Pyramid RCNN, and we still use the sampled points in [30] as Points of Interest. The point cloud backbone and other configurations are kept the same for a fair comparison.
PyramidV. Pyramid RCNN for Voxels is built upon the voxelbased method Part Net [31]. Specifically, we replace the 3D sparse convolutional head of Part Net, with our proposed pyramid RoI head in Pyramid RCNN, and we still use the upsampled voxels as Points of Interest. The voxelbased backbone and other configurations are kept the same for a fair comparison.
PyramidPV. Pyramid RCNN for PointVoxels is designed upon the pointvoxelbased method PVRCNN [28]. In particular, we replace the RoIgrid Pooling module of PVRCNN, with our proposed pyramid RoI head in Pyramid RCNN, and we still use the keypoints as Points of Interest. The keypoints encoding process, the 3D sparse convolutional networks and other configurations are kept the same for a fair comparison.
Implementation Details. Here we only introduce the architecture of PyramidPV on the Waymo Open dataset. The implementations of other models are similar and can be found in the supplementary materials. In RoIgrid Attention, the number of attention heads is set to and each head contains feature channels. In the DARP module, the context embedding is extracted from the neighboring Points of Interest within two spheres with the radius and . The temperature starts from and exponentially decays to in the end. The RoIgrid Pyramid consists of levels, with the number of grid points as respectively, and for each pyramid level, a focusing radius is predicted and shared across all the grid points in this level. The enlarging ratio and are set to for the respective level of the RoIgrid Pyramid, and is set to in all pyramid levels. The maximum number of points that participate in RoIgrid Attention for each grid point is set to for the corresponding pyramid level.
Training and Inference Details. Our Pyramid RCNN is trained from scratch with the ADAM optimizer. On the KITTI dataset, PyramidP, PyramidV and PyramidPV are trained with the same batch size , the learning rate respectively for epochs on V100 GPUs. On the Waymo Open dataset, we uniformly sample frames for training and use the full validation set for evaluation following [28]. PyramidP, PyramidV and PyramidPV are trained with the same batch size , the learning rate for epochs. The cosine annealing learning rate strategy is adopted for the learning rate decay. Other configurations are kept the same as the corresponding baselines [30, 31, 28] for a fair comparison.
4.2 Comparisons on the Waymo Open Dataset
We evaluate the performance of Pyramid RCNN on the Waymo Open dataset. The validation results in Table 1 show that our PyramidP, PyramidV and PyramidPV significantly outperform the baseline methods with , and mAP gain respectively, and achieves superior mAP on all difficulty levels and all distance ranges, which demonstrates the effectiveness and generalizability of our approach. It is worth noting that PyramidV surpasses PVRCNN by mAP in detecting objects that are , which indicates the adaptability of our approach to the extremely sparse conditions. Our PyramidPV outperforms all the previous approaches with a remarkable margin, and achieves the new stateoftheart performance mAP and mAP for the LEVEL_1 and LEVEL_2 difficulty. In table 2, our PyramidPV achieves LEVEL_1 mAP, ranks on the Waymo vehicle detection leaderboard as of March 10th, 2021, and surpasses all the LiDARonly approaches.
4.3 Comparisons on the KITTI Dataset
We evaluate our Pyramid RCNN on the KITTI dataset. The test results in Table 3 show that our PyramidP, PyramidV and PyramidPV consistently outperform the baseline methods with , and mAP gain respectively on the moderate car class, and PyramidPV achieves mAP, becoming the new stateoftheart. The validation results in Table 4 show that PyramidP, PyramidV and PyramidPV improve the baselines by , and mAP on the moderate car class, and , and mAP on the hard car class respectively. We note that the performance gains are mainly from the hard cases, which indicates the adaptability of our approach, and the observations on the KITTI dataset are consistent with those on the Waymo Open dataset.
Methods  Modality  (%)  

Easy  Mod.  Hard  
MV3D [4]  R+L  74.97  63.63  54.00 
AVODFPN [13]  R+L  83.07  71.76  65.73 
FPointNet [25]  R+L  82.19  69.79  60.59 
MMF [16]  R+L  88.40  77.43  70.22 
3DCVF [43]  R+L  89.20  80.05  73.11 
CLOCs [24]  R+L  88.94  80.67  77.15 
ContFuse [17]  R+L  83.68  68.78  61.67 
VoxelNet [46]  L  77.47  65.11  57.73 
PointPillars [14]  L  82.58  74.31  68.99 
SECOND [38]  L  84.65  75.96  68.71 
STD [40]  L  87.95  79.71  75.09 
Patches [15]  L  88.67  77.20  71.82 
3DSSD [39]  L  88.36  79.57  74.55 
SASSD [10]  L  88.75  79.79  74.16 
TANet [19]  L  85.94  75.76  68.32 
Voxel RCNN [5]  L  90.90  81.62  77.06 
HVNet [41]  L  87.21  77.58  71.79 
PointGNN [32]  L  88.33  79.47  72.29 
PointRCNN [30]  L  86.96  75.64  70.70 
PyramidP (ours)  L  87.03  80.30  76.48 
Part Net [31]  L  87.81  78.49  73.51 
PyramidV (ours)  L  87.06  81.28  76.85 
PVRCNN [28]  L  90.25  81.43  76.82 
PyramidPV (ours)  L  88.39  82.08  77.49 
Methods  (%)  

Easy  Mod.  Hard  
PointRCNN [30]  88.88  78.63  77.38 
PyramidP (ours)  88.47  83.10  78.44 
Part Net [31]  89.47  79.47  78.54 
PyramidV (ours)  88.44  83.14  78.61 
PVRCNN [28]  89.35  83.69  78.70 
PyramidPV (ours)  89.37  84.38  78.84 
4.4 Ablation Studies
The effects of different components. As is shown in Table 5, on the Waymo validation set, the RoIgrid Pyramid of the PyramidPV model improves over the baseline by mAP, mainly because the RoIgrid Pyramid is able to capture large context information which benefits the detection of the hard cases. Based on the RoIgrid Pyramid, replacing RoIgrid Pooling with RoIgrid Attention boosts the performance by mAP, which indicates that RoIgrid Attention is a more effective operation than RoIgrid Pooling. Using the adaptive radius instead of the fixed radius boosts the performance by mAP, which demonstrates the efficacy of the DARP module.
The effects of different pyramid configurations. As is shown in Table 6, we found that the RoIgrid Pyramid with enhances the performance compared with the standard RoIgrid only with , mainly because placing some grid points outside RoIs encodes richer contexts. The total number of used grid points is , which is comparable to grid points used in [28].
Inference speed analysis. We test the inference speed of different frameworks under a single V100 GPU with batch size , and obtain the average running speed of all samples in KITTI val split. Table 7 shows that our models maintain computational efficiency compared to the baselines, and the pyramid RoI head only adds little latency per frame.
Methods  R.P.  D.A.R.P.  R.A.  LEVEL_1 mAP 

PVRCNN  70.30  
PVRCNN  74.06  
(a)  75.26  
(b)  75.63  
(c)  75.77  
(d)  76.30 
Methods  grid size  LEVEL_1 mAP  

PVRCNN  [6, 6]  [1, 1]  74.06 
(a)  [6,4,4]  [1,1,2]  74.55 
(b)  [6,4,4,4]  [1,1,2,4]  74.71 
(c)  [6,4,4,4,1]  [1,1,1.5,2,4]  75.26 
5 Conclusion
We present a general twostage framework Pyramid RCNN which can be applied upon diverse backbones. Our framework can handle the sparse and nonuniform distribution problems of point clouds by introducing the pyramid RoI head. For future work, we plan to optimize Pyramid RCNN for efficient inference.
References
 [1] (2020) Range conditioned dilated convolutions for scale invariant 3d object detection. arXiv preprint arXiv:2005.09927. Cited by: Table 1.

[2]
(2018)
Cascade rcnn: delving into high quality object detection.
In
Proceedings of the IEEE conference on computer vision and pattern recognition
, pp. 6154–6162. Cited by: §1.  [3] (2020) Every view counts: crossview consistency in 3d object detection with hybridcylindricalspherical voxelization. Advances in Neural Information Processing Systems 33. Cited by: Table 1.
 [4] (2017) Multiview 3d object detection network for autonomous driving. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 1907–1915. Cited by: Table 3.
 [5] (2020) Voxel rcnn: towards high performance voxelbased 3d object detection. arXiv preprint arXiv:2012.15712. Cited by: §1, §2, §3.3, §3.3, Table 1, Table 3.
 [6] (2020) Afdet: anchor free one stage 3d object detection. arXiv preprint arXiv:2006.12671. Cited by: Table 1.
 [7] (2013) Vision meets robotics: the kitti dataset. The International Journal of Robotics Research 32 (11), pp. 1231–1237. Cited by: §1, §4.
 [8] (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 580–587. Cited by: §1.
 [9] (2015) Fast rcnn. In Proceedings of the IEEE international conference on computer vision, pp. 1440–1448. Cited by: §1.
 [10] (2020) Structure aware singlestage 3d object detection from point cloud. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11873–11882. Cited by: Table 3.
 [11] (2017) Mask rcnn. In Proceedings of the IEEE international conference on computer vision, pp. 2961–2969. Cited by: §1.
 [12] (2013) Autoencoding variational bayes. arXiv preprint arXiv:1312.6114. Cited by: §3.4.
 [13] (2018) Joint 3d proposal generation and object detection from view aggregation. In 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 1–8. Cited by: Table 3.
 [14] (2019) Pointpillars: fast encoders for object detection from point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12697–12705. Cited by: §1, §2, Table 1, Table 3.
 [15] (2019) Patch refinement–localized 3d object detection. arXiv preprint arXiv:1910.04093. Cited by: Table 3.
 [16] (2019) Multitask multisensor fusion for 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7345–7353. Cited by: Table 3.
 [17] (2018) Deep continuous fusion for multisensor 3d object detection. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 641–656. Cited by: Table 3.
 [18] (2017) Feature pyramid networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2117–2125. Cited by: §3.2.

[19]
(2020)
Tanet: robust 3d object detection from point clouds with triple attention.
In
Proceedings of the AAAI Conference on Artificial Intelligence
, Vol. 34, pp. 11677–11684. Cited by: Table 3.  [20] (2021) One million scenes for autonomous driving: once dataset. arXiv preprint arXiv:2106.11037. Cited by: §1.
 [21] (2019) Interpolated convolutional networks for 3d point cloud understanding. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1578–1587. Cited by: §2.
 [22] (2019) Lasernet: an efficient probabilistic 3d object detector for autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12677–12686. Cited by: Table 1.
 [23] (2019) Starnet: targeted computation for object detection in point clouds. arXiv preprint arXiv:1908.11069. Cited by: Table 1.
 [24] (2020) CLOCs: cameralidar object candidates fusion for 3d object detection. arXiv preprint arXiv:2009.00784. Cited by: Table 3.
 [25] (2018) Frustum pointnets for 3d object detection from rgbd data. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 918–927. Cited by: Table 3.
 [26] (2017) Pointnet++: deep hierarchical feature learning on point sets in a metric space. arXiv preprint arXiv:1706.02413. Cited by: §2.
 [27] (2016) Faster rcnn: towards realtime object detection with region proposal networks. IEEE transactions on pattern analysis and machine intelligence 39 (6), pp. 1137–1149. Cited by: §1.
 [28] (2020) Pvrcnn: pointvoxel feature set abstraction for 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10529–10538. Cited by: §1, §1, §1, §2, §3.1, §3.3, §3.3, §4.1, §4.1, §4.4, Table 1, Table 2, Table 3, Table 4, Table 7.
 [29] (2021) PVrcnn++: pointvoxel feature set abstraction with local vector representation for 3d object detection. arXiv preprint arXiv:2102.00463. Cited by: §1, §2, §3.3, §3.3.
 [30] (2019) Pointrcnn: 3d object proposal generation and detection from point cloud. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 770–779. Cited by: §1, §1, §2, §3.1, §4.1, §4.1, Table 1, Table 3, Table 4, Table 7.
 [31] (2020) From points to parts: 3d object detection from point cloud with partaware and partaggregation network. IEEE transactions on pattern analysis and machine intelligence. Cited by: §1, §1, §2, §3.1, §4.1, §4.1, Table 1, Table 3, Table 4, Table 7.

[32]
(2020)
Pointgnn: graph neural network for 3d object detection in a point cloud
. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 1711–1719. Cited by: §2, Table 3.  [33] (2020) Scalability in perception for autonomous driving: waymo open dataset. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2446–2454. Cited by: §4.

[34]
(2020)
OpenPCDet: an opensource toolbox for 3d object detection from point clouds
. Note: https://github.com/openmmlab/OpenPCDet Cited by: §4.1.  [35] (2017) Attention is all you need. arXiv preprint arXiv:1706.03762. Cited by: §3.3.
 [36] (2020) Pillarbased object detection for autonomous driving. arXiv preprint arXiv:2007.10323. Cited by: §2, Table 1.
 [37] (2019) Dynamic graph cnn for learning on point clouds. Acm Transactions On Graphics (tog) 38 (5), pp. 1–12. Cited by: §3.3.
 [38] (2018) Second: sparsely embedded convolutional detection. Sensors 18 (10), pp. 3337. Cited by: §1, §2, Table 3.
 [39] (2020) 3dssd: pointbased 3d single stage object detector. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 11040–11048. Cited by: §2, Table 3.
 [40] (2019) Std: sparsetodense 3d object detector for point cloud. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1951–1960. Cited by: §1, §2, Table 3.
 [41] (2020) Hvnet: hybrid voxel network for lidar based 3d object detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 1631–1640. Cited by: Table 3.
 [42] (2020) Centerbased 3d object detection and tracking. arXiv preprint arXiv:2006.11275. Cited by: §2, Table 2.
 [43] (2020) 3dcvf: generating joint camera and lidar features using crossview spatial feature fusion for 3d object detection. arXiv preprint arXiv:2004.12636 3. Cited by: Table 3.
 [44] (2020) Point transformer. arXiv preprint arXiv:2012.09164. Cited by: §3.3, §3.3.
 [45] (2020) Endtoend multiview fusion for 3d object detection in lidar point clouds. In Conference on Robot Learning, pp. 923–932. Cited by: Table 1.
 [46] (2018) Voxelnet: endtoend learning for point cloud based 3d object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4490–4499. Cited by: §1, §2, Table 3.
Appendix A Approximation in Radius Prediction
In this section, we explain why is used to approximate . As is shown in Figure 6, can be viewed as a soft approximation to , and the sharpness of the curve is controlled by the temperature . When approaches , is more similar to . In this paper, we set the initial as for exploration, and gradually decrease to to obtain a better approximation.
Appendix B Implementation of the DARP Module
In this section, we provide the detailed implementation of the DensityAware Radius Prediction (DARP) module. Inspired by the design of Deformable Convolutions which utilize standard convolutions to predict the deformable offsets, we first use a fixed sphere to aggregate the context embedding and then use this embedding to predict the dynamic radius offset for all grid points in a pyramid level. In particular, for each pyramid level in an RoIgrid Pyramid, we use two spheres centered at the RoI with radius and for context aggregation. Then the aggregated context embedding is fed into a MLP to predict the dynamic radius offset . A predefined radius added by the dynamic offset , , is utilized to obtain the coefficient in Eq.10, and with , the Points of Interest within are selected as for the computation of RoIgrid Attention in Eq.13, where is the temperature. The predefined in this paper is set to , , , , for the respective pyramid level. We note that all grid points in a pyramid level share the same , and the prediction of adds little computational overhead. It is worth noting that we can easily extend this idea to the settings where each grid point has its individual predicted radius, or we can additionally predict centers for the predicted spheres.
Appendix C Backbones of Pyramid RCNN
In this section, we provide additional information for some backbones of Pyramid RCNN. We note that other backbones that are not mentioned are directly referred from the official sourcecode repositories. Pyramid RoI head is kept the same upon all the backbones in this paper.
PyramidP. We reimplement the backbone of PointRCNN on the Waymo Open dataset. Different from the original version on the KITTI dataset, we set the number of input sampled point clouds to , and the number of downsampled points is set to , , , for the respective layer. We note that this modification enlarges the number of kept points, since the number of input point clouds is larger compared with those on the KITTI dataset. PyramidP and our reimplemented PointRCNN share the same backbone configurations on the Waymo Open dataset.
PyramidPV. In PyramidPV we implement a larger backbone with the input voxel size as . The backbones of PyramidPV and vanilla PyramidPV (PVRCNN) are shown in Figure 7. Our Pyramid RCNN is compatible with a small backbone for a fair comparison with baseline methods, or a large backbone to further enhance the detection performance.
Appendix D Qualitative Results
In this section, we provide the qualitative results on the KITTI dataset in Figure 8, and the Waymo Open dataset in Figure 9. The figures show that our proposed Pyramid RCNN can accurately detect 3D objects which are far away and have only a few points.