3D object detection is critical and indispensable to lots of real-world applications, such as autonomous driving, intelligent traffic system and robotics, etc. The most common used data representation for 3D object detection is point clouds, which could be generated by the depth sensors (e.g., LiDAR sensors and RGB-D cameras) for capturing 3D scene information. The sparse property and irregular data format of point clouds have brought great challenges for extending 2D detection methods [15, 57, 39, 55, 37, 19, 38] to 3D object detection from point clouds.
To learn discriminative features from these sparse and spatially-irregular points, several methods transform them to regular grid representations by voxelization [61, 8, 91, 27, 78, 35, 77, 28, 76, 36], which could be efficiently processed by conventional Convolution Neural Networks (CNNs) and well-studied 2D detection heads [57, 39] for 3D object detection. But the inevitable information loss from quantization degrades the fine-grained localization accuracy of these voxel-based methods. In contrast, powered by the pioneer works PointNet [51, 53], some other methods directly learn effective features from raw point clouds and predict 3D boxes around the foreground points [52, 58, 73, 81]. These point-based methods naturally preserve accurate point location and own flexible receptive filed with radius-based local feature aggregation (like set abstraction ), but are generally computationally intensive.
For the sake of high-quality 3D box prediction, we observe that, the voxel-based representation with intensive predefined anchors could generate better 3D box proposals with higher recall rate , while performing the proposal refinement on the 3D point-wise features could naturally benefit from more fine-grained point locations. Motivated by these observations, we argue that the performance of 3D object detection can be boosted by intertwining diverse local feature learning strategies from both voxel-based and point-based representations. Thus, we propose the Point-Voxel Region based Convolutional Neural Networks (PV-RCNNs) to incorporate the best from both point-based and voxel-based strategies for 3D object detection from point clouds.
First, we propose a novel two-stage detection network, PV-RCNN-v1, for accurate 3D object detection through a two-step strategy of point-voxel feature aggregation. In the step-I where voxel-to-keypoint scene encoding is performed, a voxel CNN with sparse convolution is adopted for voxel-wise feature learning and accurate proposal generation. The voxel-wise features of the whole scene are then summarized into a small set of keypoints by PointNet-based set abstraction , where the keypoints with accurate point locations are sampled by furthest point sampling (FPS) from raw point clouds. The keypoint-to-grid RoI feature abstraction is conducted in step-II, where we propose the RoI-grid pooling module to aggregate the above keypoint features to the RoI-grids of each proposal. It encodes multi-scale context information and forms regular grid features for the following accurate proposal refinement.
Second, we propose an advanced two-stage detection network, PV-RCNN-v2, on top of PV-RCNN-v1, for achieving more accurate, efficient and practical 3D object detection. Compared with PV-RCNN-v1, the improvements of PV-RCNN-v2 mainly lies in two aspects.
We first propose a sectorized proposal-centric keypoint sampling strategy, that concentrates the limited keypoints to be around 3D proposals to encode more effective features for scene encoding and proposal refinement. Meanwhile, the sectorized FPS is conducted to parallelly sample keypoints in different sectors to keep the uniformly distributed property of keypoints while accelerating the keypoint sampling process. Our proposed keypoint sampling strategy is much faster and effective than the naive FPS (which has quadratic complexity), which makes the whole network much more efficient and is particularly important for large-scale 3D scenes with millions of points. Moreover, we propose a novel local feature aggregation operation, VectorPool aggregation, for more effective and efficient local feature encoding from local neighborhoods. We argue that the relative point locations of local point clouds are robust, effective and discriminative information for describing local geometry. We propose to split the 3D local space into regular, compact and dense voxels, which are sequentially assigned to form a hyper feature vector for encoding local geometric information, while the voxel-wise features in different locations are encoded with different kernel weights to generate position-sensitive local features. Hence, different resulted feature channels encode the local features of specific local locations.
Compared with the previous set abstraction operation for local feature aggregation, our proposed VectorPool aggregation can efficiently handle very large numbers of centric points for local feature aggregation due to our compact local feature representation. Equipped with the VectorPool aggregation in both the voxel-based backbone and RoI-grid pooling module, our proposed PV-RCNN-v2 framework could be much more memory-efficient and faster than previous counterparts while keeping comparable or even better performance to establish a more practical 3D detector for resource-limited devices.
In summary, our proposed Point-Voxel Region based Convolution Neural Networks (PV-RCNNs) have three major contributions. (1) Our proposed PV-RCNN-v1 adopts two novel operations, voxel-to-keypoint scene encoding and keypoint-to-grid RoI feature abstraction, to deeply incorporate the advantages of both point-based and voxel-based features learning strategies into a unified 3D detection framework. (2) Our proposed PV-RCNN-v2 takes a step further to more practical 3D detection with better performance, less resource consumption and faster running speed, which is enabled by our proposed sectorized proposal-centric keypoint sampling strategy to obtain more representative keypoints with faster speed, and also powered by our novel VectorPool aggregation for achieving local aggregation on a very large number of central points with less resource consumption and more effective representation. (3) Our proposed 3D detectors surpass all published methods with remarkable margins on multiple challenging 3D detection benchmarks, and our PV-RCNN-v2 could achieve new state-of-the-art results on the Waymo Open dataset with 10 FPS inference speed for the detection region. The source code will be available at https://github.com/open-mmlab/OpenPCDet .
2 Related Work
2.1 2D/3D Object Detection with RGB images
2D Object Detectors. We summarize the 2D object detectors into anchor-based and anchor-free directions. The approaches following the anchor-based paradigm advocate the empirically pre-defined anchor boxes to perform detection. In this direction, object detectors are further divided into two-stage [57, 12, 2, 5, 48] and one-stage [39, 56, 38, 30] categories. Two-stage approaches are characterized by suggesting a series of region proposals as candidates to make per-region refinement, while one-stage ones directly perform detection on feature maps in a fully convolutional manner. On the other hand, studies of the anchor-free direction mainly fall into keypoints-based [29, 89, 79, 13] and center-based [20, 67, 88, 26] paradigms. The keypoints-based methods represent bounding boxes as keypoints, i.e., corner/extreme points, grid points and a set of bounded points, while the center-based approaches predict the bounding box from foreground points inside objects. Besides, the recently proposed DETR 
leverages widely adopted transformers from natural language processing to detect objects with attention mechanism and self-learnt object queries, which also gets rid of anchor boxes.
3D Object Detectors with RGB images.
Image-based 3D Object Detection aims to estimate 3D bounding boxes from a monocular image or stereo images. The early work Mono3D first generates 3D region proposals with ground-plane assumption, and then exploits semantic knowledge from monocular images for box scoring. The following works [45, 31] incorporate the relations between 2D and 3D boxes as geometric constraint. M3D-RPN  introduces an end-to-end 3D region proposal network with depth-aware convolutions. [85, 44, 4, 46, 42] develop monocular 3D object detectors based on a wire-frame template obtained from CAD models. RTM3D  extends the CAD based methods and performs coarse keypoints detection to localize 3d objects in real-time. On the stereo side, Stereo R-CNN [32, 54] capitalizes on a Stereo RPN to associate proposals from left and right images and further refines 3D region of interest by a region-based photometric alignment. DSGN  introduces the differentiable 3D geometric volume to simultaneously learn depth information and semantic cues in an end-to-end optimized pipelines. Another cut-in point for detecting 3D objects from RGB images is to convert the image pixels with estimated depth maps into artificial point clouds, i.e., pseudo-LiDAR [70, 54, 84], where the LiDAR-based detection models can operate on pseudo-LiDAR for 3D box estimation. These image-based 3D detection methods suffer from inaccurate depth estimation and can only generate coarse 3D bounding boxes.
2.2 3D Object Detection with Point Clouds
Grid-based 3D Object Detectors. To tackle the irregular data format of point clouds, most existing works project the point clouds to regular grids to be processed by 2D or 3D CNN. The pioneer work MV3D  projects the point clouds to 2D bird view grids and places lots of predefined 3D anchors for generating 3D bounding boxes, and the following works [27, 35, 36, 68, 83, 22] develop better strategies for multi-sensor fusion. [78, 77, 28] introduce more efficient frameworks with bird-eye view representation while  proposes to fuse grid features of multiple scales. MVF  integrates 2D features from bird-eye view and perspective view before projecting points into pillar representations . Some other works [61, 91] divide the point clouds into 3D voxels to be processed by 3D CNN, and 3D sparse convolution  is introduced  for efficient 3D voxel processing. [69, 92] utilize multiple detection heads while  explores the object part locations for improving the performance. In addition, [71, 6] predicts bounding box parameters following the anchor-free paradigm. These grid-based methods are generally efficient for accurate 3D proposal generation but the receptive fields are constraint by the kernel size of 2D/3D convolutions.
Point-based 3D Object Detectors. F-PointNet  first proposes to apply PointNet [51, 53] for 3D detection from the cropped point clouds based on the 2D image bounding boxes. PointRCNN  generates 3D proposals directly from the whole point clouds instead of 2D images for 3D detection with point clouds only, and the following work STD  proposes the sparse to dense strategy for better proposal refinement.  proposes the hough voting strategy for better object feature grouping. 3DSSD  introduces F-FPS as a complement of D-FAS and develops the first one stage object detector operating on raw point clouds. These point-based methods are mostly based on the PointNet series, especially the set abstraction operation , which enables flexible receptive fields for point cloud feature learning.
Representation Learning on Point Clouds. Recently representation learning on point clouds has drawn lots of attention for improving the performance of point cloud classification and segmentation [51, 53, 91, 72, 21, 86, 34, 62, 74, 23, 24, 66, 11, 40, 75]. In terms of 3D detection, previous methods generally project the point clouds to regular bird view grids [8, 78] or 3D voxels [91, 9] for processing point clouds with 2D/3D CNN. 3D sparse convolution [17, 16] are adopted in [76, 59] to effectively learn sparse voxel-wise features from the point clouds. Qi et al. [51, 53] proposes the PointNet to directly learn point-wise features from the raw point clouds, where set abstraction operation enables flexible receptive fields by setting different search radii.  combines both voxel-based CNN and point-based shared-parameter multi-layer percetron (MLP) network for efficient point cloud feature learning. In comparison, our PV-RCNN-v1 takes advantages from both the voxel-based feature learning (i.e., 3D sparse convolution) and PointNet-based feature learning (i.e., set abstraction operation) to enable both high-quality 3D proposal generation and flexible receptive fields for improving the 3D detection performance. In this paper, we propose the VectorPool aggregation operation to aggregate more effective local features from point clouds with much less resource consumption and faster running speed than the commonly used set abstraction, and it is adopted in both the VSA module and RoI-grid pooling module of our new proposed PV-RCNN-v2 framework for much more efficient and accurate 3D object detection.
State-of-the-art 3D object detectors mostly adopt two-stage frameworks that generally achieve higher performance by splitting the complex detection problem into the region proposal generation and per-proposal refinement two stages. In this section, we briefly introduce our chosen strategy for the fundamental feature extraction and proposal generation stage, and then discuss the challenges of straightforward methods for the second 3D proposal refinement stage.
3D voxel CNN and proposal generation. Voxel CNN with 3D sparse convolution [17, 16] is a popular choice of state-of-the-art 3D detectors [76, 59, 18] for its efficiency of converting irregular point clouds into 3D sparse feature volumes. The input points are first divided into small voxels with spatial resolution of , where non-empty voxel-wise features are directly calculated by averaging the features of all within points (the commonly used features are 3D coordinates and reflectance intensities). The network utilizes a series of 3D sparse convolution to gradually convert the point clouds into feature volumes with , , downsampled sizes. Such sparse feature volumes at each level could be viewed as a set of sparse voxel-wise feature vectors. The Voxel CNN backbone could be naturally combined with the detection heads of 2D detectors [39, 57] by converting the encoded downsampled 3D feature volumes into 2D bird-view feature maps. Specifically, we follow  to stack the 3D feature volumes along the axis to obtain the bird-view feature maps. The anchor-based SSD head  is appended to this bird-view feature maps for high quality 3D proposal generation. As shown in Table I, our adopted 3D voxel CNN backbone with anchor-based scheme achieves higher recall performance than the PointNet-based approaches, which establishes a strong backbone network and generates robust proposal boxes for the following proposal refinement stage.
Discussions on 3D proposal refinement.
In the proposal refinement stage, the proposal-specific features are required to be extracted from the resulting 3D feature volumes or 2D maps. Intuitively, the proposal feature extraction should be conducted in the 3D space instead of the 2D feature maps to learn more fine-grained features for 3D proposal refinement. However, these 3D feature volumes from the 3D voxel CNN have major limitations in the following aspects. (i) These feature volumes are generally of low spatial resolution as they are downsampled by up to 8 times, which hinders accurate localization of objects in the input scene. (ii) Even if one can upsample to obtain feature volumes/maps with larger spatial sizes, they are generally still quite sparse. The commonly used trilinear or bilinear interpolation in the RoIPool/RoIAlign operations can only extract features from very small neighborhoods (i.e., 4 and 8 nearest neighbors for bilinear and trilinear interpolation respectively), that would therefore obtain features with mostly zeros and waste much computation and memory for proposal refinement.
On the other hand, the point-based local feature aggregation methods  have shown strong capability of encoding sparse features from local neighborhoods with arbitrary scales. We therefore propose to incorporate a 3D voxel CNN with the point-based local feature aggregation operation for conducting accurate and robust proposal refinement.
4 PV-RCNN-v1: Point-Voxel Feature Set Abstraction for 3D Object Detection
To learn effective features from sparse point clouds, state-of-the-art 3D detection approaches are based on either 3D voxel CNNs with sparse convolution or PointNet-based operators. Generally, the 3D voxel CNNs with sparse convolutional layers are more efficient and are able to generate high-quality 3D proposals, while the PointNet-based operators naturally preserve accurate point locations and can capture rich context information with flexible receptive fields.
We propose a novel two-stage 3D detection framework, PV-RCNN-v1, to deeply integrate the advantages of two types of operators for more accurate 3D object detection from point clouds. As shown in Fig. 1, PV-RCNN-v1 consists of a 3D voxel CNN with sparse convolution as the backbone for efficient feature encoding and proposal generation. Given each 3D proposal, we propose to encode proposal specific features in two novel steps: the voxel-to-keypoint scene encoding, which summarizes all the voxel feature volumes of the overall scene into a small number of feature keypoints, and the keypoint-to-grid RoI feature abstraction, which aggregates the scene keypoint features to RoI grids to generate proposal-aligned features for confidence estimation and location refinement.
4.1 Voxel-to-keypoint scene encoding via voxel set abstraction
Our proposed PV-RCNN-v1 first aggregates the voxel-wise scene features at multiple neural layers of 3D voxel CNN into a small number of keypoints, which bridge the 3D voxel CNN feature encoder and the proposal refinement network.
Keypoints Sampling. In PV-RCNN-v1, we simply adopt the Furtherest Point Sampling (FPS) algorithm to sample a small number of keypoints from the point clouds , where is a hyper-parameter and is about of the point number of . Such a strategy encourages that the keypoints are uniformly distributed around non-empty voxels and can be representative to the overall scene.
Voxel Set Abstraction Module. We propose the Voxel Set Abtraction (VSA) module to encode multi-scale semantic features from 3D feature volumes to the keypoints. The set abstraction  is adopted for aggregating voxel-wise feature volumes. Different with the original set abstraction, the surrounding local points are now regular voxel-wise semantic features from 3D voxel CNN, instead of the neighboring raw points with features learned by PointNet.
Specifically, denote as the set of voxel-wise feature vectors in the -th level of 3D voxel CNN, as their corresponding 3D coordinates in the uniform 3D metric space, where is the number of non-empty voxels in the -th level. For each keypoint , we first identify its neighboring non-empty voxels at the -th level within a radius to retrieve the set of neighboring voxel-wise feature vectors as
where we concatenate the local relative position to indicate the relative location of voxel feature . The features within neighboring set are then aggregated by a PointNet-block  to generate the keypoint feature as
where denotes randomly sampling at most voxels from the neighboring set for saving computations,
denotes a multi-layer perceptron network to encode voxel-wise features and relative locations. The operationmaps diverse number of neighboring voxel features to a single keypoint feature . Here multiple radii at each layer are utilized to capture richer contextual information.
The above voxel feature aggregation is performed at different scales of the 3D voxel CNN, and the aggregated features from different scales are concatenated to obtain the multi-scale semantic feature for keypoint as
where the generated feature incorporates both 3D CNN-based voxel-wise feature and the PointNet-based features as Eq. (2). Moreover, the 3D coordinates of also naturally preserves accurate location information.
Further Enriching Keypoint Features. We further enriching the keypoint features with the raw point cloud and with the downsampled 2D bird-view feature maps, where the raw point cloud can partially make up the quantization loss of the initial point-cloud voxelization while the 2D bird-view maps have larger receptive fields along the axis. Specifically, the raw point-cloud feature is also aggregated as that in Eq. (2), while the bird-view features are obtained by performing by bilinear interpolation with projected 2D keypoints on the 2D feature maps. Hence, the keypoint features for is further enriched by concatenating all its associated features as
which have the strong capacity of preserving 3D structural information of the entire scene for the following fine-grained proposal refinement step.
Predicted Keypoint Weighting. As mentioned before, the keypoints are chosen by FPS algorithm and some of them might only represent the background regions. Intuitively, keypoints belonging to the foreground objects should contribute more to the accurate refinement of the proposals, while the ones from the background regions should contribute less. Hence, we propose a Predicted Keypoint Weighting (PKW) module to re-weight the keypoint features with extra supervisions from point-cloud segmentation. The segmentation labels are free-of-charge and can be directly generated from the 3D box annotations, i.e., by checking whether each keypoint is inside or outside of a ground-truth 3D box since the 3D objects in autonomous driving scenes are naturally separated in the 3D space. The predicted keypoint feature weighting can be formulated as
is a three-layer multi-layer perceptron (MLP) network with a sigmoid function to predict foreground confidence. The PKW module is trained with focal loss with its default hyper-parameters to handle the imbalanced foreground/background points of the training set.
4.2 Keypoint-to-grid RoI feature abstraction for proposal refinement
With the above voxel-to-keypoint scene encoding step, we can summarize the multi-scale semantic features into a small number of keypoints. In this step, we propose keypoint-to-grid RoI feature abstraction to generate accurate proposal-aligned features from the keypoint features for fine-grained proposal refinement.
RoI-grid Pooling via Set Abstraction. Given each 3D proposal, as shown in Fig. 2, we propose an RoI-grid pooling operation to aggregate the keypoint features to the RoI-grid points with multiple receptive fields. We uniformly sample grid points within each 3D proposal, which are then flattend and denoted as . We utilize set abstraction to obtain features of grid points via aggregating the keypoint features. Specifically, we firstly identify the neighboring keypoints of grid point as
where is appended to indicate the local relative location within the ball of radius . A PointNet-block  is then adopted to aggregate the neighboring keypoint feature set to obtain the feature for grid point as
where and are defined in the same way as Eq. (2). We set multiple radii and aggregate keypoint features with different receptive fields, which are concatenated together for capturing richer multi-scale contextual information.
After obtaining each grid’s aggregated features from its surrounding keypoints, all RoI-grid features of the same RoI can be vectorized and transformed by a two-layer MLP with feature dimensions to represent the overall proposal.
Our proposed RoI-grid pooling operation could aggregate much richer contextual information than the previous RoI-pooling/RoI-align operation [58, 81, 59]. This is because a single keypoint could contribute to multiple RoI-grid points due to the overlapped neighboring balls of RoI-grid points, and their receptive fields are even beyond the RoI boundaries by capturing the contextual keypoint features outside the 3D RoI. In contrast, the previous state-of-the-art methods either simply average all point-wise features within the proposal as the RoI feature , or pool many uninformative zeros as the RoI features because of the very sparse point-wise features [59, 81].
Proposal Refinement and Confidence Prediction. Given the RoI feature extracted by the above RoI-grid pooling module, the refinement network learns to predict the size and location (i.e. center, size and orientation) residuals relative to the 3D proposal box. Separate two-layer MLP sub-networks are employed for confidence prediction and proposal refinement respectively. We follow  to conduct the IoU-based confidence prediction, where the confidence training target is normalized to be between as
where is the Intersection-over-Union (IoU) of the -th RoI w.r.t. its ground-truth box. The binary cross-entropy loss is adopted to optimize the IoU branch while the box residuals are optimized with smooth-L1 loss.
4.3 Training losses
The proposed PV-RCNN framework is trained end-to-end with the region proposal loss , keypoint segmentation loss and the proposal refinement loss . (i) We adopt the same region proposal loss as that in ,
where , the anchor classification loss is calculated with the focal loss (default hyper-parameters) , is a binary classification loss for orientation to eliminate the ambiguity of as in , and smooth-L1 loss is for anchor box regression with the predicted residual and regression target . Loss weights are set as and in the training process. (ii) The keypoint segmentation loss is also formulated by the focal loss as mentioned in Sec. 4.1. (iii) The proposal refinement loss includes the IoU-guided confidence prediction loss and the box refinement loss as
where is the predicted box residual and is the proposal regression target that are encoded same with . The overall training loss are then the sum of these three losses with equal weights.
4.4 Implementation details
As shown in Fig. 1, the 3D voxel CNN has four levels with feature dimensions , , , , respectively. Their two neighboring radii of each level in the VSA module are set as , , , , and the neighborhood radii of set abstraction for raw points are . For the proposed RoI-grid pooling operation, we uniformly sample grid points in each 3D proposal and the two neighboring radii of each grid point are .
5 PV-RCNN-v2: More accurate and efficient 3D detection with VectorPool aggregation
As discussed above, our PV-RCNN-v1 deeply incorporates point-based and voxel-based feature encoding operations in voxel-to-keypoint and keypoint-to-grid feature set abstraction, which significantly boost the performance of 3D object detection from point clouds. To make the framework more practical for real-world applications, we propose a new detection framework, PV-RCNN-v2, for more accurate and efficient 3D object detection with less resources. As shown in Fig. 3, Sectorized Proposal-Centric keypoint sampling and VectorPool aggregation are presented to replace their counterparts in the v1 framework. The Sectorized Proposal-Centric keypoint sampling strategy is much faster than the Furtherest Point Sampling (FPS) used in PV-RCNN-v1 and achieves better performance with more representative keypoint distribution. Then, to handle large-scale local feature aggregation from point clouds, we propose a more effective and efficient local feature aggregation operation, VectorPool aggregation, to explicitly encode the relative point positions with spatially vectorized representation. It is integrated into our PV-RCNN-v2 framework in both the VSA and the RoI-grid pooling to significantly reduce the memory/computation consumptions while also achieving comparable or even better detection performance. In this section, the above two novel operations are introduced in Sec. 5.1 and Sec. 5.2, respectively.
5.1 Sectorized Proposal-Centric Sampling for Efficient and Representative Keypoint Sampling
The keypoint sampling is critical for our proposed detection framework since keypoints bridge the point-voxel representations and influence the performance of proposal refinement network. As mentioned before, in our PV-RCNN-v1 framework, we subsample a small number of keypoints from raw point clouds with the Furthest Point Sampling (FPS) algorithm, which mainly has two drawbacks. (i) The FPS algorithm is time-consuming due to its quadratic complexity, which hinders the training and inference speed of our framework, especially for keypoint sampling of large-scale point clouds. (ii) The FPS keypoint sampling algorithm would generate a large number of background keypoints that do not contribute to the proposal refinement step since only the keypoints around the proposals could be retrieved by the RoI-grid pooling operation. To mitigate these drawbacks, we propose a more efficient and effective keypoint sampling algorithm for 3D object detection.
Sectorized Proposal-Centric (SPC) keypoint sampling. As discussed above, the number of keypoints is limited and FPS keypoint sampling algorithm would generate wasteful keypoints in the background regions, which decrease the capability of keypoints to well representing objects for proposal refinement. Hence, as shown in Fig. 4, we propose the Sectorized Proposal-Centric (SPC) keypoint sampling operation to uniformly sample keypoints from more concentrated neighboring regions of proposals while also being much faster than the FPS sampling algorithm,
Specifically, denote the raw point clouds as , and denote the centers and sizes of 3D proposal as and , respectively. To better generate the set of restricted keypoints, we first restrict the keypoint candidates to the neighboring point sets of all proposals as
is a hyperparameter indicating the maximum extended radius of the proposals, andare the sizes of the 3D proposal. Through this proposal-centric filtering for keypoints, the candidate number of keypoint sampling is greatly reduced from to , which not only reduces the complexity of the follow-up keypoint sampling, but also concentrates the limited number of keypoints to encode the neighboring regions of proposals.
To further parallelize the keypoint sampling process for acceleration, as shown in Fig. 4, we distribute the proposal-centric point set into sectors centered at the scene center, and the point set of the -th sector could be represented as
where , and indicate the angle between the positive axis and the ray ended with . Through this process, we divide sampling keypoints into subtasks of local keypoint sampling, where the -th sector samples keypoints from . These subtasks are eligible to be executed in parallel on GPUs, while the scale of keypoint sampling is further reduced from to .
Hence, our proposed SPC keypoint sampling greatly reduces the scale of keypoint sampling from to the much smaller , which not only effectively accelerates the keypoint sampling process, but also increases the capability of keypoint feature representation by concentrating the keypoints to the neighborhoods of 3D proposals.
Note that as shown in Fig. 6, our proposed keypoint sampling with sector-based parallelization is different from the FPS point sampling with random division as that in , since our strategy reserves the uniform distribution of sampled keypoints while the random division may destroy the uniform distribution property of FPS algorithm.
Comparison of local keypoint sampling strategies. To sample a specific number of keypoints from each , there are several alternative options, including FPS, random parallel FPS, random sampling, voxelized sampling, coverage sampling , which result in very different keypoint distributions (see Fig. 6). We observe that FPS algorithm generates uniformly distributed keypoints covering the whole regions while other algorithms cannot generate similar keypoint distribution patterns. Table VII shows that the uniformly distribution of FPS achieves obviously better performance than other keypoint sampling algorithms and is critical for the final detection performance. Note that even though we utilize FPS for local keypoint sampling, it is still much faster than FPS-based keypoint sampling on the whole point cloud, since our SPC operation significantly reduces the candidates for keypoint sampling.
5.2 Local vector representation for structure-preserved local feature learning from point clouds
How to aggregate informative features from local point clouds is critical in our proposed point-voxel-based object detection system. As discussed in Sec. 4, PV-RCNN-v1 adopts the set abstraction for local feature aggregation in both VSA and RoI grid-pooling. However, we observe that the set abstraction operation can be extremely time- and resource-consuming for large-scale local feature aggregation of point clouds. Hence, in this section, we first analyze the limitations of set abstraction for local feature aggregation, and then propose the VectorPool aggregation operation for local feature aggregation from point clouds, which is integrated into our PV-RCNN-v2 framework for more accurate and efficient 3D object detection.
Limitations of set abstraction for RoI grid pooling. Specifically, as shown in Eqs. (2) and (7), the set abstraction operation samples point-wise features from each local neighborhood, which are encoded separately by a shared-parameter MLP for local feature encoding. Suppose that there are a total of local point-cloud neighborhoods and the feature dimensions of each point is , then point-wise features with channels should be encoded by the shared-parameter MLP to generate point-wise features of size . Both the space complexity and computations would be significant when the number of local neighborhoods are quite large.
For instance, in our RoI-grid pooling module, the number of RoI-grid points could be very large (
) with 100 proposals and grid size 6. This module is therefore slow and also consumes much GPU memory in our PV-RCNN-v1, which restricts its capability to be run on lightweight devices with limited computation and memory resources. Moreover, the max-pooling operation in set abstraction abandons the spatial distribution information of local point clouds and harms the representation capability of locally aggregated features from point clouds.
Local vector representation for structure-preserved local feature learning. To extract more informative features from local point-cloud neighborhoods, we propose a novel local feature aggregation operation, VectorPool aggregation, which preserves spatial point distributions of local neighborhoods and also costs less memory/computation resources than the commonly used set abstraction. We propose to generate position-sensitive features in different local neighborhoods by encoding them with separate kernel weights and separate feature channels, which are then concatenated to explicitly represent the spatial structures of local point features.
Specifically, denote the input point coordinates and features as , and the centers of local neighborhoods as . We are going to extract local point-wise features with channels for each point of .
To reduce the parameter size, computational and memory consumption of our VectorPool aggregation, motivated by , we first summarize the point-wise features to more compact representations with a parameter-free scheme as
where is the number of output feature channels and . Eq. (13) sums up every input feature channels into one output feature channel to reduce the feature channels by times, which could effectively reduce the resource consumptions of the follow-up processing.
To generate position-sensitive features for a local cubic neighborhood centered at , we split its spatial neighboring space into dense voxels, and the above point-wise features are grouped to different voxels as follows:
where indicates the relative positions of point in this local cubic neighborhood, is the side length of the neighboring cubic space centered at , and are the voxel indices along the , , axes respectively, to indicates a specific voxel of this local cubic space. Then the relative coordinates and point-wise features of the points within each voxel are averaged as the position-specific features of this local voxel as follows:
where , , and indicates the number of inside points in position of this local neighborhood. The resulted features encodes the relative coordinates and local features of the specific voxel .
Those features in different positions may represent very different local features due to the different point distributions of different local voxels. Hence, instead of encoding the local features with a shared-parameter MLP as in set abstraction , we propose to encode different local voxel features with separate local kernel weights for capturing position-sensitive features as
where , , and is an operation combining the relative position and features , which have several choices (including concatenation (our default setting), PosPool ( or ) ) to be explored (as listed in Table X). is the learnable kernel weights for encoding the specific features of local voxel with channel , and different positions have different kernel weights for encoding position-sensitive local features.
Finally, we directly sort the local voxel features according to their spatial order , and their features are sequentially concatenated to generate the final local vector representation as
where encodes the structure-preserved local features by simply assigning the features of different locations to their corresponding feature channels, which naturally preserves the spatial structures of local features in the neighboring space centered at , This local vector representation would be finally processed with another MLP to encode the local features to feature channels for follow-up processing.
Note that our feature channel summation and local volume feature encoding of VectorPool aggregation could reduce the feature dimensions from to , that greatly saves the needed computations and memory resources of our VectorPool operation. Morever, instead of conducting max-pooling on local point-wise features as in the set abstraction, our proposed spatial-structure-preserved local vector representation could encode the position-sensitive local features with different feature channels.
PV-RCNN-v2 with local vector representation for local feature aggregation. Our proposed VectorPool aggregation is integrated in our PV-RCNN-v2 detection framework, to replace the set abstraction operation in both the VSA layer and the RoI-grid pooling module. Thanks to our VectorPool local feature aggregation operation, compared with PV-RCNN-v1 framework, our PV-RCNN-v2 not only consumes much less memory and computation resources, but also achieves better 3D detection performance with faster running speed.
|LEVEL 1 (3D)||LEVEL 2 (3D)|
|Baseline (RPN only)||-||-||-||68.03||67.44||59.57||59.04|
|PV-RCNN-v1 (-)||-||UNet-decoder||RoI-grid pool (SA)||73.84||73.18||64.76||64.15|
|PV-RCNN-v1||FPS||VSA (SA)||RoI-grid pool (SA)||74.06||73.38||64.99||64.38|
|PV-RCNN-v1 (+)||SPC-FPS||VSA (SA)||RoI-grid pool (SA)||74.94||74.27||65.81||65.21|
|PV-RCNN-v1 (++)||SPC-FPS||VSA (SA)||RoI-grid pool (VP)||75.21||74.56||66.13||65.54|
|PV-RCNN-v2||SPC-FPS||VSA (VP)||RoI-grid pool (VP)||75.37||74.73||66.29||65.72|
5.3 Implementation details
Training losses. The overall training loss of PV-RCNN-v2 is exactly the same with PV-RCNN-v1, which has already been discussed in Sec. 4.3.
Network Architecture. For the SPC keypoint sampling of PV-RCNN-v2, we set the maximum extended radius , and each scene is split into 6 sectors for parallel keypoint sampling. Two VectorPool aggregation operations are adopted to the and feature volumes of the VSA module with the side lengths and respectively, and both of them have local voxels and channel reduced factor . One VectorPool aggregation operation is adopted to the raw points with local voxels and without channel reduction, and the side length . All VectorPool aggregation utilize the concatenation operation as for encoding relative positions and point-wise features.
For RoI-grid pooling, we adopt a single VectorPool aggregation with local voxels , channel reduced factor and side length . We increase the number of sampled RoI-grid points to for each proposal. Since our VectorPool aggregation consumes much less computation and memory resources, we can afford such a large number of RoI-grid points while still being faster than previous set abstraction (see Table VIII).
In this section, we evaluate our proposed framework in the large-scale Waymo Open Dataset  and the highly-competitive KITTI dataset . In Sec. 6.1, we first introduce our experimental setup and implementation details. In Sec. 6.2, we conduct extensive ablation experiments and analysis to investigate the individual components of both our PV-RCNN-v1 and PV-RCNN-v2 frameworks. In Sec. 6.3, we present the main results of our PV-RCNN framework and compare our performance with previous state-of-the-art methods on both the Waymo dataset and the KITTI dataset.
6.1 Experimental Setup
Datasets and evaluation metrics.
Datasets and evaluation metrics.Our PV-RCNN frameworks are evaluated on the following two datasets.
Waymo Open Dataset  is currently the largest dataset with LiDAR point clouds for autonomous driving. There are totally training sequences with around k LiDAR samples, and validation sequences with k LiDAR samples. It annotated the objects in the full field instead of as in KITTI dataset. The evaluation metrics are calculated by the official evaluation tools, where the mean average precision (mAP) and the mean average precision weighted by heading (mAPH) are used for evaluation. The 3D IoU threshold is set as for vehicle detection and for pedestrian/cyclist detection. We present the comparison in terms of two ways. The first way is based on objects’ different distances to the sensor: , and . The second way is to split the data into two difficulty levels, where the LEVEL 1 denotes the ground-truth objects with at least 5 inside points while the LEVEL 2 denotes the ground-truth objects with at least 1 inside points. As utilized by the official Waymo evaluation server, the mAPH of LEVEL 2 difficulty is the most important evaluate metric for all experiments.
KITTI Dataset  is one of the most popular datasets of 3D detection for autonomous driving. There are training samples and test samples, where the training samples are generally divided into the train split ( samples) and the val split ( samples). We compare PV-RCNNs with state-of-the-art methods on this highly-competitive 3D detection learderboard . The evaluation metrics are calculated by the official evaluation tools, where the mean average precision (mAP) is calculated with recall positions on three difficulty levels. As utilized by the KITTI evaluation server, the 3D mAP of moderate difficulty level is the most important metric for all experiments.
Training and inference details. Both PV-RCNN-v1 and PV-RCNN-v2 frameworks are trained from scratch in an end-to-end manner with ADAM optimizer, learning rate 0.01 and cosine annealing learning rate decay strategy. To train the proposal refinement stage, we randomly sample 128 proposals with 1:1 ratio for positive and negative proposals, where a proposal is considered as positive sample if it has at least 0.55 3D IoU with the ground-truth boxes, otherwise it is treated as a negative sample.
During training, we adopt the widely used data augmentation strategies for 3D object detection, including the random scene flipping, global scaling with a random scaling factor sampled from , global rotation around axis with a random angle sampled from , and the ground-truth sampling augmentation to randomly ”paste” some new objects from other scenes to current training scene for simulating objects in various environments.
For the Waymo Open dataset, the detection range is set as for both the and axes, and for the axis, while the voxel size is set as
. We train the PV-RCNN-v1 with batch size 64 for 50 epochs on 32 GPUs, while training the PV-RCNN-v2 with batch size 64 for 50 epochs on 16 GPUs since our v2 version consumes much less GPU memory.
For the KITTI dataset, the detection range is set as for axis, for axis and for the axis, which is voxelized with voxel size in each axis. We train PV-RCNN-v1 with batch size 16 for 80 epochs, and train PV-RCNN-v2 with batch size 32 for 80 epochs on 8 GPUs.
For the inference speed, our final PV-RCNN-v2 framework can achieve state-of-the-art performance with 10 FPS for detection region on the Waymo Open Dataset, while achieving state-of-the-art performance with 16 FPS for detection region on the KITTI dataset. Both of them are profiling on a single TITAN RTX GPU card.
6.2 Ablation studies for PV-RCNN framework
In this section, we investigate the individual components of our proposed PV-RCNN 3D detection frameworks with extensive ablation experiments. Unless mentioned otherwise, we conduct most experiments on the Waymo Open Dataset with detection range for more comprehensive evaluation, and the input point clouds are generated by the first return from the Waymo LiDAR sensor. For efficiently conducting the ablation experiments on the Waymo Open dataset, we generate a small representative training set by uniformly sampling frames (about frames) from the training set, and all results are evaluated on the full validation set (about frames) with the official evaluation tool of Waymo dataset. All models are trained with 30 epochs on a single GPU.
6.2.1 The component analysis of PV-RCNN-v1
|✗||RoI-grid Pooling||IoU-guided scoring||92.09||82.95||81.93|
|✓||RoI-aware Pooling||IoU-guided scoring||92.54||82.97||80.30|
|✓||RoI-grid Pooling||IoU-guided Scoring||92.57||84.83||82.69|
Effects of voxel-to-keypoint scene encoding. In Sec. 4.1, we propose the voxel-to-keypoint scene encoding strategy to encode the global scene features to a small set of keypoints, which serves as a bridge between the backbone network and the proposal refinement network. As shown in and rows of Table II, our proposed voxel-to-keypoint encoding strategy achieves slightly better performance than the UNet-based decoder network. This might benefit from the fact that the keypoint features are aggregated from multi-scale feature volumes and raw point clouds with large receptive fields while also keeping the accurate point-wise location information. Note that our method achieves such a performance by using much less point-wise features than the UNet-based decoder network. For instance, in the setting for Waymo dataset, our VSA module encodes the whole scene to around keypoints for feeding into the RoI-grid pooling module, while the UNet-based decoder network summarizes the scene features to around point-wise features, which further validates the effectiveness of our proposed voxel-to-keypoint scene encoding strategy.
Effects of different features for VSA module. As mentioned in Sec. 4.1, our proposed VSA module incorporates multiple feature components, and their effects are explored in Table IV. We could summarize the observations as follows: The performance drops a lot if we only aggregate features from high level bird-view semantic features () or accurate point locations (), since neither 2D-semantic-only nor point-only are enough for the proposal refinement. As shown in 5 row of Table IV, and contain both 3D structure information and high level semantic features, which could improve the perofrmance a lot by combining with the bird-view semantic features and the raw point locations . The shallow semantic features and could slightly improve the performance and the best performance is achieved with all the feature components as the keypoint features.
|VSA Input Feature||LEVEL 1 (3D)||LEVEL 2 (3D)|
|Use PKW||LEVEL 1 (3D)||LEVEL 2 (3D)|
Effects of PKW module. We propose the predicted keypoint weighting (PKW) module in Sec. 4.1 to re-weight the point-wise features of keypoint with extra keypoint segmentation supervision. The experiments on both KITTI dataset ( and rows of Table III) and Waymo dataset (Table V) show that the performance drops a bit after removing the PKW module drops, which demonstrates that the PKW module enables better multi-scale feature aggregation by focusing more on the foreground keypoints, since they are more important for the succeeding proposal refinement network.
Effects of RoI-grid pooling module. RoI-grid pooling module is proposed in Sec. 4.2 for aggregating RoI features from very sparse keypoints. Here we investigate the effects of RoI-grid pooling module by replacing it with the RoI-aware pooling  and keeping other modules consistent. The experiments on both KITTI dataset ( and rows of Table III) and Waymo dataset (Table VI) show that the performance drops significantly when replacing RoI-grid pooling module. It validates that our proposed RoI-grid pooling module could aggregate much richer contextual information to generate more discriminative RoI features, which benefits from the large receptive field of our set abstraction based RoI-grid pooling module.
Compared with the previous RoI-aware pooling module, our proposed RoI-grid pooling module could generate denser grid-wise feature representation by supporting different overlapped ball areas among different grid points, while RoI-aware pooling module may generate lots of zeros due to the sparse inside points of RoIs. That means our proposed RoI-grid pooling module is especially effective for aggregating local features from very sparse point-wise features, such as in our PV-RCNN framework to aggregate features from very small number of keypoints.
|LEVEL 1 (3D)||LEVEL 2 (3D)|
|Keypoint Sampling Algorithm||Running Time||LEVEL 1 (3D)||LEVEL 2 (3D)|
|PC-Filter + FPS||27ms||75.05||74.41||65.98||65.40|
|PC-Filter + Random Sampling||1ms||69.85||69.23||60.97||60.42|
|PC-Filter + Coverage Sampling||21ms||73.97||73.30||64.94||64.34|
|PC-Filter + Voxelized-FPS||17ms||74.12||73.46||65.06||64.47|
|PC-Filter + RandomParallel-FPS||2ms||73.94||73.28||64.90||64.31|
|PC-Filter + Sectorized-FPS||9ms||74.94||74.27||65.81||65.21|
6.2.2 The component analysis of PV-RCNN-v2
Analysis of keypoint sampling strategies. In Sec. 5.1, we propose the SPC keypoint sampling strategy that is composed of proposal-centric keypoint filtering and sectorized FPS keypoint filtering. In the and rows of Table VII, we first investigate the effectiveness of our proposed proposal-centric keypoint filtering, where we could see that compared with the strong baseline PV-RCNN-v1, our proposal-centric keypoint filtering could further improve the detection performance by about mAP/mAPH in both LEVEL 1 and LEVEL 2 difficulties. It validates our argument that our proposed SPC keypoint sampling strategy could generate more representative keypoints by concentrating the small number of keypoints to the more informative neighboring regions of proposals. Moreover, improved by our proposal-centric keypoint filtering, the keypoint sampling algorithm could be about twice faster than the baseline FPS algorithm by reducing the number of potential keypoints.
Besides the original FPS strategy and our proposed Sectorized-FPS algorithm, we further explore four alternative strategies for accelerating the keypoint sampling strategy, which are as follows: Random sampling indicates randomly choosing the keypoints from raw points. Proposal coverage sampling is inspired by , which first randomly chooses the required number of keypoints, then the chosen keypoints are randomly replaced with other raw points based on the coverage cost to cover more neighboring space of proposals. Voxelized-FPS first voxelizes the raw points to reduce the number of raw points then applies the FPS for keypoint sampling. RandomParallel-FPS first randomly split the raw points into several groups, then FPS algorithm is utilized to these groups in parallel for faster keypoint sampling. As shown in Table VII, compared with the FPS algorithm ( row) for keypoint sampling, the detection performances of all four alternative strategies drop a lot. In contrast, the performance of our proposed Sectorized-FPS is on par with the FPS algorithm while being more than 10 times faster than the FPS algorithm.
We argue that the uniformly distributed keypoints are important for the following proposal refinement. As shown in Fig. 6, only our proposed Sectorized-FPS could sample uniformly distributed keypoints like FPS while all other strategies have shown their own problems. The random sampling and proposal coverage sampling strategies generate messy keypoint distribution randomly, even though the coverage cost based relacement is adopted. The voxelized-FPS generates regularly scattered keypoints that lose accurate point locations due to the quantization. The RandomParallel-FPS generates small clustered keypoints since the nearby raw points could be divided into different groups, and all of them could be sampled as keypoints from different groups. In contrast, our proposed Sectorized-FPS could generate uniformly distributed keypoints by splitting raw points into different groups based on the structure relationship. There may still exist a very small number of clustered keypoints in the margins of different groups, but the experiments show that they have negligible effect on the performance.
Hence, as shown in Table VII, our proposed SPC keypoint sampling strategy outperforms previous FPS method significantly while still being more than 10 times faster than it.
Effects of VectorPool aggregation. In Sec. 5.2, we propose the VectorPool aggregation operation to effectively and efficiently summarize the structure-preserved local features from point clouds. As shown in Tabel II, our proposed VectorPool aggregation operation is utilized for local feature aggreagtion in both the VSA module and the RoI-grid pooling module, where the performance is improved consistently by adopting our VectorPool aggregation. The final PV-RCNN-v2 framework benefits from the structure-preserved spatial features from our VectorPool aggregation, which is critical for the following fine-grained proposal refinement.
Moreover, our VectorPool aggregation consumes much less computations and GPU memory than the original set abstraction operation. As shown in Table VIII, we record the actual computations of our PV-RCNN-v1 and PV-RCNN-v2 frameworks in both the VSA module and the RoI-grid pooling module. It shows that with our proposed VectorPool aggregation, the VSA module only consumes as low as 37.5% computations of the original set abstraction, which could greatly speed up the local feature aggregation, especially for the large-scale local feature aggregation. Meanwhile, the computations in the RoI-grid pooling module decrease as much as 56.6% of the counterpart of PV-RCNN-v1, even though our PV-RCNN-v2 utilizes large RoI-grid size . The lower computational consumption also indicates lower GPU memory consumption, which facilitates more practical usage of our proposed PV-RCNN-v2 framework.
Effects of separate local kernel weights in VectorPool aggregation. We have demonstrated the fact in Eq. (16) that our proposed VectorPool aggregation generates position-sensitive features by encoding relative position features with separate local kernel weights. The and rows of Table IX show that the performance drops a lot if we remove the separate local kernel weights and adopt shared kernel weights for relative position encoding. It validates that the separate local kernel weights are better than previous shared-parameter MLP based local feature encoding, and it is important in our VectorPool aggregation operation.
Effects of dense voxel numbers in VectorPool aggregation. We investigate the number of dense voxels (see Eq. (5.2)) in VectorPool aggregation for VSA module and RoI-grid pooling module, where we could see that the setting of achieves best performance, and both larger or smaller numbers of voxels lead to slightly worse performance. We consider that the required number of dense voxels depend on the sparsity of the point clouds, since smaller number of voxels could not well preserve the spatial local features while larger number of voxels may contain too many empty voxels in the resulted local vector representation. We empirically choose the setting of dense voxel representation in both VSA module and RoI-grid pooling module of our PV-RCNN-v2 framework.
|LEVEL 1 (3D)||LEVEL 2 (3D)|
|LEVEL 1 (3D)||LEVEL 2 (3D)|
|PosPool (xyz) ||75.15||74.68||66.21||65.56|
|PosPool (cos/sin) ||75.13||74.65||66.19||65.53|
|Number of Keypoints||LEVEL 1 (3D)||LEVEL 2 (3D)|
|RoI-grid Size||LEVEL 1 (3D)||LEVEL 2 (3D)|
|Method||Reference||Vehicle (LEVEL 1)||Vehicle (LEVEL 2)||Ped. (LEVEL 1)||Ped. (LEVEL 2)||Cyc. (LEVEL 1)||Cyc. (LEVEL 2)|
|SECOND ||Sensors 2018||72.27||71.69||63.85||63.33||68.70||58.18||60.72||51.31||60.62||59.28||58.34||57.05|
|StarNet ||NeurIPSw 2019||53.70||-||-||-||66.80||-||-||-||-||-||-||-|
|PointPillar ||CVPR 2019||56.62||-||-||-||59.25||-||-||-||-||-||-||-|
|MVF ||CoRL 2019||62.93||-||-||-||65.33||-||-||-||-||-||-||-|
|Pillar-based ||ECCV 2020||69.80||-||-||-||72.51||-||-||-||-||-||-||-|
|Part-A2-Net ||TPAMI 2020||77.05||76.51||68.47||67.97||75.24||66.87||66.18||58.62||68.60||67.36||66.13||64.93|
|SECOND ||Sensors 2018||70.27||69.72||62.51||62.00||61.74||52.16||53.60||45.21||59.52||58.17||57.16||55.87|
|Part-A2-Net ||TPAMI 2020||74.82||74.32||65.88||65.42||71.76||63.64||62.53||55.30||67.35||66.15||65.05||63.89|
: re-implemented by ourselves with their open source code.: only the first return of the Waymo LiDAR sensor is used for training and testing.
|Method||Vehicle 3D mAP (IoU=0.7)||Vehicle mAP (IoU=0.7)|
Effects of different strategies for encoding relative positions. As shown in Eq. (16) of VectorPool aggregation, we utilize the operator to encode the relative position and features to generate the position-sensitive features. As shown in Table X, we compare the simple concatenation operation with two variants of PosPool operations proposed by . Table X shows that all three operations achieve similar performance while concatenation method is slightly better than the PosPool operations. Thus, we finally utilize the simple concatenation operation in our VectorPool aggregation for generating position-sensitive features.
Effects of the number of keypoints. In Table XII, we investigate the effects of the number of keypoints for encoding the scene features. Table XII shows that our proposed method achieves similar performance when using more than 4,096 keypoints, while the performance drops significantly along with smaller number of keypoints (2,048 keypoints and 1,024 keypoints), especially on the LEVEL 1 difficulty level. Hence, to balance the performance and computation cost, we empirically choose to encode the whole scene to 4,096 keypoints for the Waymo dataset (2,048 keypoints for the KITTI dataset since it only needs to detect the frontal-view areas). The above experiments show that our method could effectively encode the whole scene to a small number of keypoints while keeping similar performance with a large number of keypoints, which demonstrates the effectiveness of the keypoint feature encoding strategy of our proposed PV-RCNN detection framework.
Effects of the grid size in RoI-grid pooling module. Table XII shows the performance of adopting different RoI-grid sizes for RoI-grid pooling module. We could see that the performance increases along with the RoI-grid sizes from to , but larger RoI-grid sizes ( and ) harm the performance slightly. We consider the reason may be that models with larger RoI-grid sizes contain much more learnable parameters which may result in over-fitting on the training set. Hence we finally adopt RoI-grid size for the RoI-grid pooling module.
|Method||Pedestrian 3D mAP (IoU=0.7)||Pedestrian BEV mAP (IoU=0.7)|
|Method||Cyclist 3D mAP (IoU=0.7)||Cyclist BEV mAP (IoU=0.7)|
6.3 Main results of PV-RCNN framework and comparison with state-of-the-art methods
In this section, we demonstrate the main results of our proposed PV-RCNN frameworks, and make the comparison with state-of-the-art methods on both the large-scale Waymo Open Dataset and the highly-competitive KITTI dataset.
|Method||Reference||Modality||Car - 3D Detection||Car - BEV Detection||Cyclist - 3D Detection||Cyclist - BEV Detection|
|MV3D ||CVPR 2017||RGB + LiDAR||74.97||63.63||54.00||86.62||78.93||69.80||-||-||-||-||-||-|
|ContFuse ||ECCV 2018||RGB + LiDAR||83.68||68.78||61.67||94.07||85.35||75.88||-||-||-||-||-||-|
|AVOD-FPN ||IROS 2018||RGB + LiDAR||83.07||71.76||65.73||90.99||84.82||79.62||63.76||50.55||44.93||69.39||57.12||51.09|
|F-PointNet ||CVPR 2018||RGB + LiDAR||82.19||69.79||60.59||91.17||84.67||74.77||72.27||56.12||49.01||77.26||61.37||53.78|
|UberATG-MMF ||CVPR 2019||RGB + LiDAR||88.40||77.43||70.22||93.67||88.21||81.99||-||-||-||-||-||-|
|3D-CVF at SPA ||ECCV 2020||RGB + LiDAR||89.20||80.05||73.11||93.52||89.56||82.45||-||-||-||-||-||-|
|CLOCs ||IROS 2020||RGB + LiDAR||88.94||80.67||77.15||93.05||89.80||86.57||-||-||-||-||-||-|
|SECOND ||Sensors 2018||LiDAR only||83.34||72.55||65.82||89.39||83.77||78.59||71.33||52.08||45.83||76.50||56.05||49.45|
|PointPillars ||CVPR 2019||LiDAR only||82.58||74.31||68.99||90.07||86.56||82.81||77.10||58.65||51.92||79.90||62.73||55.58|
|PointRCNN ||CVPR 2019||LiDAR only||86.96||75.64||70.70||92.13||87.39||82.72||74.96||58.82||52.53||82.56||67.24||60.28|
|3D IoU Loss ||3DV 2019||LiDAR only||86.16||76.50||71.39||91.36||86.22||81.20||-||-||-||-||-||-|
|STD ||ICCV 2019||LiDAR only||87.95||79.71||75.09||94.74||89.19||86.42||78.69||61.59||55.30||81.36||67.23||59.35|
|Part-A2-Net ||TPAMI 2020||LiDAR only||87.81||78.49||73.51||91.70||87.79||84.61||-||-||-||-||-||-|
|3DSSD ||CVPR 2020||LiDAR only||88.36||79.57||74.55||92.66||89.02||85.86||82.48||64.10||56.90||85.04||67.62||61.14|
|Point-GNN ||CVPR 2020||LiDAR only||88.33||79.47||72.29||93.11||89.17||83.90||78.60||63.48||57.08||81.17||67.28||59.67|
|PV-RCNN-v1 (Ours)||-||LiDAR only||90.25||81.43||76.82||94.98||90.65||86.14||78.60||63.71||57.65||82.49||68.89||62.41|
|PV-RCNN-v2 (Ours)||-||LiDAR only||90.14||81.88||77.15||92.66||88.74||85.97||82.22||67.33||60.04||84.60||71.86||63.84|
6.3.1 3D detection on the Waymo Open Dataset
To validate the effectiveness of our proposed PV-RCNN frameworks, we compare our proposed PV-RCNN-v1 and PV-RCNN-v2 frameworks with state-of-the-art methods on the Waymo Open Dataset, which currently is the largest 3D detection benchmark of autonomous driving.
Comparison with state-of-the-art methods. As shown in Table XIII, our proposed PV-RCNN-v2 framework outperforms previous state-of-the-art method  significantly with +1.74% mAP LEVEL 1 gain and +1.79% mAP LEVEL 2 gain for the most important vehicle detection. Table XIII also demonstrates that our proposed PV-RCNN-v2 framework also consistently outperforms all previous methods in terms of pedestrian and cyclist detection, including very recent Pillar-based method  and Part-A2-Net . Compared with our preliminary work PV-RCNN-v1, our latest PV-RCNN-v2 framework achieves remarkably better mAP/mAPH on all difficulty levels for the detection of all three categories, while also increasing the processing speed from 3.3 FPS to 10 FPS for the 3D detection of such a large area, which validates the effectiveness and efficiency of our proposed PV-RCNN-v2.
To better demonstrate the performance at different distance ranges, we also present the distance-based detection performance in Table XIV, Table XV, Table XVI for vehicle, pedestrian, cyclist three categories respectively. We follow StarNet  and MVF  to evaluate the models on the LEVEL 1 difficulty for comparing with previous methods. We could see that our PV-RCNN-v2 achieves best performance on all distance ranges of vehicle detection and pedestrian detection, and on most distance ranges for cyclist 3D detection. It is worth noting that Table XIV, Table XV, Table XVI show that our proposed PV-RCNN-v2 framework achieves much better performance than previous methods in terms of the furthest area (), where PV-RCNN-v2 outperforms previous state-of-the-art method with a +3.32% 3D mAP gain for vehicle detection, a +3.67% 3D mAP for pedestrian detection and a +2.39% for cyclist detection. It may benefit from several aspects of our PV-RCNN-v2 framework, such as the designing of our two-step point-voxel interaction to extract richer contextual information in the furthest area, and our SPC keypoint sampling strategy to produce enough keypoints for these furthest proposals, and our VectorPool aggregation to preserve accurate spatial structures for the fine-grained proposal refinement.
6.3.2 3D detection on the KITTI dataset
To evaluate our PV-RCNN frameworks on the highly-competitive 3D detection learderboard of KITTI dataset, we train our models with of train + val data and the remaining data is used for validation. All results are evaluated by submitting to the official evaluation server.
Comparison with state-of-the-art methods. As shown in Table XVII, both of our PV-RCNN-v1 and PV-RCNN-v2 outperform all published methods with remarkable margins on the most important moderate difficulty level. Specifically, compared with previous LiDAR-only state-of-the-art methods on the 3D detection benchmark of car category, our PV-RCNN-v2 increases the mAP by +, +, + on easy, moderate and hard difficulty levels, respectively. For the 3D detection and bird-view detection of cyclist, our methods outperforms all previous methods with large margins on the moderate and hard difficulty level, where the maximum gain is + mAP on the moderate difficulty level of bird-view detection for cyclist.
Compared with our preliminary work PV-RCNN-v1, our PV-RCNN-v2 achieves better performance on the moderate and hard levels of 3D detection over car and cyclist categories, while also greatly reducing the GPU-memory consumption and increasing the running speed from 10 FPS to 16 FPS in the KITTI dataset. The significant improvements on both the performance and the efficiency manifest the effectiveness of our PV-RCNN-v2 framework.
In this paper, we present two novel frameworks, named PV-RCNN-v1 and PV-RCNN-v2, for accurate 3D object detection from point clouds. Our PV-RCNN-v1 framework adopts a novel Voxel Set Abstraction module to deeply integrates both the multi-scale 3D voxel CNN features and the PointNet-based features to a small set of keypoints, and the learned discriminative keypoint features are then aggregated to the RoI-grid points through our proposed RoI-grid pooling module to capture much richer contextual information for proposal refinement. Our PV-RCNN-v2 further improves the PV-RCNN-v1 framework by efficiently generating more representative keypoints with our novel SPC keypoint sampling strategy, and also by equipping with our proposed VectorPool aggregation operation to learn structure-preserved local features in both the VSA module and RoI-grid pooling module. Thus, our PV-RCNN-v2 finally achieves better performance with much faster running speed than the v1 version.
Both of our proposed two PV-RCNN frameworks significantly outperform previous 3D detection methods and achieve new state-of-the-art performance on both the Waymo Open Dataset and the KITTI 3D detection benchmark, and extensive experiments are designed and conducted to deeply investigate the individual components of our proposed frameworks.
M3d-rpn: monocular 3d region proposal network for object detection.
Proceedings of the IEEE International Conference on Computer Vision, Cited by: §2.1.
Cascade r-cnn: delving into high quality object detection.
Proceedings of the IEEE conference on computer vision and pattern recognition, Cited by: §2.1.
-  (2020) End-to-end object detection with transformers. In ECCV, Cited by: §2.1.
-  (2017) Deep manta: a coarse-to-fine many-task network for joint 2d and 3d vehicle analysis from monocular image. In Proceedings of the IEEE conference on computer vision and pattern recognition, Cited by: §2.1.
-  (2019) Hybrid task cascade for instance segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, Cited by: §2.1.
-  (2019) Object as hotspots: an anchor-free 3d object detection approach via firing of hotspots. Cited by: §2.2.
-  (2016) Monocular 3d object detection for autonomous driving. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §2.1.
-  (2017-07) Multi-view 3d object detection network for autonomous driving. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §2.2, §2.2, TABLE XVII.
-  (2019) Fast point r-cnn. In Proceedings of the IEEE international conference on computer vision (ICCV), Cited by: §2.2.
-  (2020) Dsgn: deep stereo geometry network for 3d object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §2.1.
-  (2019) 4D spatio-temporal convnets: minkowski convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3075–3084. Cited by: §2.2.
-  (2016) R-fcn: object detection via region-based fully convolutional networks. In Advances in neural information processing systems, Cited by: §2.1.
-  (2019) Centernet: keypoint triplets for object detection. In Proceedings of the IEEE International Conference on Computer Vision, Cited by: §2.1.
-  (2012) Are we ready for autonomous driving? the kitti vision benchmark suite. In Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §6.1, §6.
-  (2015) Fast r-cnn. In Proceedings of the IEEE international conference on computer vision, pp. 1440–1448. Cited by: §1.
-  (2018) 3D semantic segmentation with submanifold sparse convolutional networks. CVPR. Cited by: §2.2, §2.2, §3.
-  (2017) Submanifold sparse convolutional networks. CoRR abs/1706.01307. External Links: Cited by: §2.2, §3.
-  (2020) Structure aware single-stage 3d object detection from point cloud. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §3.
-  (2017) Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, Cited by: §1.
-  (2015) Densebox: unifying landmark localization with end to end object detection. arXiv preprint arXiv:1509.04874. Cited by: §2.1.
-  (2018) Recurrent slice networks for 3d segmentation of point clouds. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §2.2.
-  EPNet: enhancing point features with image semantics for 3d object detection. In ECCV, Cited by: §2.2.
Multi-view pointnet for 3d scene understanding. In Proceedings of the IEEE International Conference on Computer Vision Workshops, pp. 0–0. Cited by: §2.2.
-  (2019) Hierarchical point-edge interaction network for point cloud semantic segmentation. In Proceedings of the IEEE International Conference on Computer Vision, pp. 10433–10441. Cited by: §2.2.
-  . Note: http://www.cvlibs.net/datasets/kitti/eval_object.php?obj_benchmark=3d, Accessed on 2019-11-15 Cited by: §6.1.
-  (2020) FoveaBox: beyound anchor-based object detection. IEEE Transactions on Image Processing. Cited by: §2.1.
-  (2018) Joint 3d proposal generation and object detection from view aggregation. IROS. Cited by: §1, §2.2, TABLE XVII.
-  (2019) PointPillars: fast encoders for object detection from point clouds. CVPR. Cited by: §1, §2.2, TABLE XIII, TABLE XIV, TABLE XV, TABLE XVII.
-  (2018) Cornernet: detecting objects as paired keypoints. In Proceedings of the European Conference on Computer Vision (ECCV), Cited by: §2.1.
-  (2019) Gradient harmonized single-stage detector. In AAAI, Cited by: §2.1.
-  (2019) Gs3d: an efficient 3d object detection framework for autonomous driving. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §2.1.
-  (2019) Stereo r-cnn based 3d object detection for autonomous driving. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §2.1.
-  (2020) RTM3D: real-time monocular 3d detection from object keypoints for autonomous driving. In ECCV, Cited by: §2.1.
-  (2018) Pointcnn: convolution on x-transformed points. In Advances in Neural Information Processing Systems, pp. 820–830. Cited by: §2.2.
-  (2018) Deep continuous fusion for multi-sensor 3d object detection. In ECCV, Cited by: §1, §2.2, TABLE XVII.
-  (2019) Multi-task multi-sensor fusion for 3d object detection. In CVPR, Cited by: §1, §2.2, TABLE XVII.
-  (2017) Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2117–2125. Cited by: §1.
-  (2018) Focal loss for dense object detection. IEEE transactions on pattern analysis and machine intelligence. Cited by: §1, §2.1, §4.1, §4.3.
-  (2016) SSD: single shot multibox detector. In European conference on computer vision, pp. 21–37. Cited by: §1, §1, §2.1, §3.
-  (2020) A closer look at local aggregation operators in point cloud analysis. arXiv preprint arXiv:2007.01294. Cited by: §2.2, §5.2, §6.2.2, TABLE X.
Point-voxel cnn for efficient 3d deep learning. In Advances in Neural Information Processing Systems, Cited by: §2.2.
-  (2019) Roi-10d: monocular lifting of 2d detection to 6d pose and metric shape. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §2.1.
-  (2019) Lasernet: an efficient probabilistic 3d object detector for autonomous driving. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 12677–12686. Cited by: TABLE XIV, TABLE XV.
A coarse-to-fine model for 3d pose estimation and sub-category recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §2.1.
-  (2017) 3d bounding box estimation using deep learning and geometry. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §2.1.
-  (2017) Reconstructing vehicles from a single image: shape priors for road scene understanding. In International Conference on Robotics and Automation, Cited by: §2.1.
-  (2019) Starnet: targeted computation for object detection in point clouds. arXiv preprint arXiv:1908.11069. Cited by: §6.3.1, TABLE XIII.
-  (2019) Libra r-cnn: towards balanced learning for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, Cited by: §2.1.
-  (2020) CLOCs: camera-lidar object candidates fusion for 3d object detection. In 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Cited by: TABLE XVII.
-  (2019) Deep hough voting for 3d object detection in point clouds. In Proceedings of the IEEE International Conference on Computer Vision, Cited by: §2.2.
-  (2017) Pointnet: deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 652–660. Cited by: §1, §2.2, §2.2, §4.1, §4.2.
-  (2018-06) Frustum pointnets for 3d object detection from rgb-d data. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §2.2, TABLE XVII.
-  (2017) Pointnet++: deep hierarchical feature learning on point sets in a metric space. In Advances in Neural Information Processing Systems, pp. 5099–5108. Cited by: §1, §1, §2.2, §2.2, §3, §4.1, §5.2.
-  (2020) End-to-end pseudo-lidar for image-based 3d object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §2.1.
-  (2016) You only look once: unified, real-time object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, Cited by: §1.
-  (2017) YOLO9000: better, faster, stronger. In Proceedings of the IEEE conference on computer vision and pattern recognition, Cited by: §2.1.
-  (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pp. 91–99. Cited by: §1, §1, §2.1, §3.
-  (2019) Pointrcnn: 3d object proposal generation and detection from point cloud. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–779. Cited by: §1, §2.2, TABLE I, §4.2, TABLE XVII.
-  (2020) From points to parts: 3d object detection from point cloud with part-aware and part-aggregation network. IEEE Transactions on Pattern Analysis and Machine Intelligence. Cited by: §1, §2.2, §2.2, §3, §4.2, §4.2, §6.2.1, §6.3.1, TABLE XIII, TABLE XIV, TABLE XV, TABLE XVI, TABLE XVII.
-  (2020) Point-gnn: graph neural network for 3d object detection in a point cloud. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1711–1719. Cited by: TABLE XVII.
-  (2016) Deep sliding shapes for amodal 3d object detection in rgb-d images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 808–816. Cited by: §1, §2.2.
-  (2018) Splatnet: sparse lattice networks for point cloud processing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2530–2539. Cited by: §2.2.
-  (2020-06) Scalability in perception for autonomous driving: waymo open dataset. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §6.1, §6.
-  (2018) Fishnet: a versatile backbone for image, region, and pixel level prediction. In Advances in neural information processing systems, pp. 754–764. Cited by: §5.2.
-  (2020) OpenPCDet: an open-source toolbox for 3d object detection from point clouds. Note: https://github.com/open-mmlab/OpenPCDet Cited by: §1.
-  (2019-10) KPConv: flexible and deformable convolution for point clouds. In The IEEE International Conference on Computer Vision (ICCV), Cited by: §2.2.
-  (2019) Fcos: fully convolutional one-stage object detection. In Proceedings of the IEEE international conference on computer vision, Cited by: §2.1.
-  (2020) Pointpainting: sequential fusion for 3d object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §2.2.
-  (2019) Voxel-fpn: multi-scale voxel feature aggregation in 3d object detection from point clouds. CoRR abs/1907.05286. External Links: Cited by: §2.2.
-  (2019) Pseudo-lidar from visual depth estimation: bridging the gap in 3d object detection for autonomous driving. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §2.1.
-  (2020) Pillar-based object detection for autonomous driving. arXiv preprint arXiv:2007.10323. Cited by: §2.2, §6.3.1, TABLE XIII, TABLE XIV, TABLE XV.
-  (2019) Dynamic graph cnn for learning on point clouds. ACM Transactions on Graphics (TOG) 38 (5), pp. 146. Cited by: §2.2.
-  (2019) Frustum convnet: sliding frustums to aggregate local point-wise features for amodal 3d object detection. In IROS, Cited by: §1.
-  (2019) Pointconv: deep convolutional networks on 3d point clouds. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9621–9630. Cited by: §2.2.
-  (2020) Grid-gcn for fast and scalable point cloud learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5661–5670. Cited by: §2.2, §5.1, §6.2.2.
-  (2018) SECOND: sparsely embedded convolutional detection. Sensors 18 (10), pp. 3337. Cited by: §1, §2.2, §2.2, §3, §4.3, TABLE XIII, TABLE XIV, TABLE XV, TABLE XVI, TABLE XVII.
-  (2018) HDNET: exploiting hd maps for 3d object detection. In 2nd Conference on Robot Learning, Cited by: §1, §2.2.
-  (2018) Pixor: real-time 3d object detection from point clouds. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7652–7660. Cited by: §1, §2.2, §2.2.
-  (2019) Reppoints: point set representation for object detection. In Proceedings of the IEEE International Conference on Computer Vision, Cited by: §2.1.
-  (2020) 3dssd: point-based 3d single stage object detector. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Cited by: §2.2, TABLE XVII.
-  (2019) STD: sparse-to-dense 3d object detector for point cloud. ICCV. Cited by: §1, §2.2, TABLE I, §4.2, §5.1, TABLE XVII.
-  (2020) HVNet: hybrid voxel network for lidar based 3d object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §2.2.
-  (2020) 3D-cvf: generating joint camera and lidar features using cross-view spatial feature fusion for 3d object detection. ECCV. Cited by: §2.2, TABLE XVII.
-  (2019) Pseudo-lidar++: accurate depth for 3d object detection in autonomous driving. arXiv preprint arXiv:1906.06310. Cited by: §2.1.
-  (2014) Are cars just 3d boxes?-jointly estimating the 3d shape of multiple objects. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §2.1.
-  (2019) PointWeb: enhancing local neighborhood features for point cloud processing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5565–5573. Cited by: §2.2.
-  (2019) IoU loss for 2d/3d object detection. In International Conference on 3D Vision (3DV), Cited by: TABLE XVII.
-  (2019) Objects as points. arXiv preprint arXiv:1904.07850. Cited by: §2.1.
-  (2019) Bottom-up object detection by grouping extreme and center points. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §2.1.
-  (2020) End-to-end multi-view fusion for 3d object detection in lidar point clouds. In Conference on Robot Learning, pp. 923–932. Cited by: §2.2, §6.3.1, TABLE XIII, TABLE XIV, TABLE XV.
-  (2018) Voxelnet: end-to-end learning for point cloud based 3d object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §1, §2.2, §2.2.
-  (2019) Class-balanced grouping and sampling for point cloud 3d object detection. CoRR abs/1908.09492. External Links: Cited by: §2.2.