Improving 3D Object Detection with Channel-wise Transformer

by   Hualian Sheng, et al.
Zhejiang University

Though 3D object detection from point clouds has achieved rapid progress in recent years, the lack of flexible and high-performance proposal refinement remains a great hurdle for existing state-of-the-art two-stage detectors. Previous works on refining 3D proposals have relied on human-designed components such as keypoints sampling, set abstraction and multi-scale feature fusion to produce powerful 3D object representations. Such methods, however, have limited ability to capture rich contextual dependencies among points. In this paper, we leverage the high-quality region proposal network and a Channel-wise Transformer architecture to constitute our two-stage 3D object detection framework (CT3D) with minimal hand-crafted design. The proposed CT3D simultaneously performs proposal-aware embedding and channel-wise context aggregation for the point features within each proposal. Specifically, CT3D uses proposal's keypoints for spatial contextual modelling and learns attention propagation in the encoding module, mapping the proposal to point embeddings. Next, a new channel-wise decoding module enriches the query-key interaction via channel-wise re-weighting to effectively merge multi-level contexts, which contributes to more accurate object predictions. Extensive experiments demonstrate that our CT3D method has superior performance and excellent scalability. Remarkably, CT3D achieves the AP of 81.77 category on the KITTI test 3D detection benchmark, outperforms state-of-the-art 3D detectors.



There are no comments yet.


page 1

page 2

page 4

page 5

page 6

page 7

page 9

page 10


3D Object Detection with Pointformer

Feature learning for 3D object detection from point clouds is very chall...

SIENet: Spatial Information Enhancement Network for 3D Object Detection from Point Cloud

LiDAR-based 3D object detection pushes forward an immense influence on a...

Hierarchical Context Embedding for Region-based Object Detection

State-of-the-art two-stage object detectors apply a classifier to a spar...

IQDet: Instance-wise Quality Distribution Sampling for Object Detection

We propose a dense object detector with an instance-wise sampling strate...

Category-Aware Transformer Network for Better Human-Object Interaction Detection

Human-Object Interactions (HOI) detection, which aims to localize a huma...

Attention-based Proposals Refinement for 3D Object Detection

Recent advances in 3D object detection is made by developing the refinem...

Sequential Context Encoding for Duplicate Removal

Duplicate removal is a critical step to accomplish a reasonable amount o...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

3D object detection from point clouds is envisioned as an indispensable part of future Autonomous Vehicle (AV). Unlike the developed 2D detection algorithms whose success is mainly due to the regular structure of image pixels, LiDAR point clouds are usually sparse, unordered and unevenly distributed. This makes the CNN-like operations not well suited to process unstructured point clouds directly. To tackle these challenges, many approaches employ voxelization or custom discretization for point clouds. Several methods [28, 15] project point clouds to a birds-eye view (BEV) representation and apply the standard 2D convolutions, however, it will inevitably sacrifice certain geometric details which are vital for generating accurate localization. Other methods [3, 33] rasterize point clouds into a 3D voxel grid and use regular 3D CNNs to perform computation in grid space, but this category of methods suffers from computational bottleneck associated with making the grid finer. A major breakthrough in detection task on point clouds is due to the effective deep architectures for point clouds representation such as volumetric convolution [33] and permutation invariant convolution [22].

Recently, most state-of-the-art methods for 3D object detection adopt a two-stage framework consisting of 3D region proposal generation and proposal feature refinement. Notice that the most popular region proposal network (RPN) backbone [33] has achieved over 95% recall rate on the KITTI 3D Detection Benchmark, whereas this method only achieves 78% Average Precision (AP). The reason for such a gap stems from the difficulty in encoding an object and extracting the robust feature from 3D proposals in cases of occlusion or long-range distance. Therefore, how to effectively model geometric relationships among points and exploit accurate position information during the proposal feature refinement stage is crucial for good performance. An important family of models is PointNet [22] and its variants [23, 19, 25], which use a flexible receptive field to aggregate features by local regions and permutation-invariant network. However, these methods have the drawback of involving plenty of hand-crafted designs, such as the neighbor ball radii and the grid size. Another family of models is the voxel-based methods [33, 27, 39] which use 3D convolutional kernels to gather information from neighboring voxels. But the performance of such methods is not optimal caused by the voxel quantization and sensitive to hyper-parameters. Later studies [43, 24, 4, 10] further apply the point-voxel mixed strategy to capture multi-scale features while retaining fine-grained localization but are strongly tied to the specific RPN architectures.

In this paper, we make two major contributions. First, we propose a novel end-to-end two-stage 3D objection detection framework called CT3D. Motivated by the recent Transformer-based 2D detection method DETR [1] that uses CNN backbone to extract features and encoder-decoder Transformer to enhance the RoI region features, we design our CT3D to generate 3D bounding boxes at the first stage, then learn per-proposal representation by incorporating a novel Transformer architecture with channel-wise re-weighting mechanism in decoder. The proposed framework exhibits very strong performance in terms of accuracy and efficiency, and thus can be conveniently combined with any high-quality RPN backbones.

The second contribution is the custom Transformer that offers several benefits over the traditional point/voxel-based feature aggregation mechanism. Despite the point-wise or voxel convolutions have the ability of local and global context modelling, there still have been several limitations in increasing receptive field and parameter optimization. In addition, point-cloud based 3D object detectors also have to deal with the challenging missing/noisy detections such as occlusion and distancing patterns with a few points. Self-attention in Transformers has recently emerged as a basic building block for capturing long-range interactions thus is a natural choice in acquiring context information for enriching the faraway objects or increasing the confidence of false negatives. Inspired by this idea, we initially introduce a proposal-to-point embedding to effectively encode the RPN proposal information in the encoder module. Furthermore, we exploit a channel-wise re-weighting approach to augment the standard Transformer decoder in consideration of both global and local channel-wise features for the encoded points. The purpose is to scale the feature decoding space where we can compute attention distribution over each channel dimension of key embeddings thus can enhance the expressiveness of query-key interactions. Extensive experiments show that our proposed CT3D can outperform the state-of-the-art published methods on both the KITTI dataset and the large-scale Waymo dataset.

Figure 1: Overview of CT3D. The raw points are first fed into the RPN for generating 3D proposals. Then the raw points along with the corresponding proposals are processed by the channel-wise Transformer composed of the proposal-to-point encoding module and the channel-wise decoding module. Specifically, the proposal-to-point encoding module is to modulate each point feature with global proposal-aware context information. After that, the encoded point features are transformed into an effective proposal feature representation by the channel-wise decoding module for confidence prediction and box regression.

2 Related Work

Point Cloud Representations for 3d Object Detection. Recently, there has been a lot of progress on learning effective representations for the raw LiDAR point clouds. A noticeable portion of efforts are PointNet series [22] which employed permutation invariant operations to aggregate the point features. F-PointNet [21] generated the region-level features for point clouds within each 3D frustum. PointRCNN [25] used PointNet++ [23] to segment foreground 3D points and refine the proposals with the segmentation features. STD [37] further extended the proposal refinement by transferring sparse point features into dense voxel representation. Moreover, 3DSSD [36] improved the point-based approach with a new sampling strategy based on feature distance. However, PointNet-like architectures still present limited ability to capture local structures for LiDAR data. Another category of methods [3, 13, 34, 35, 28, 15, 12, 16, 17] aimed to voxelize the unstructured point clouds as a regular 2D/3D grid over which conventional CNNs can be easily applied. Pioneer work [3] encoded the point clouds as 2D bird-view feature maps to generate highly accurate 3D candidate boxes, motivating many efficient bird-view representation-based methods. VoxelNet [43] transformed the points to form a compact feature representation. SECOND [33] introduced 3D sparse convolution for efficient 3D voxel processing. These voxel-based methods are still focused on the subdivision of a volume rather than adaptively modelling local geometric structure. Furthermore, various point-voxel based methods have been proposed for multi-scale feature aggregation. SA-SSD [10] presented an auxiliary network on the basis of 3D voxel CNN. PV-RCNN [24] and its variant VoxelRCNN [4] adopted 3D voxel CNN as RPN to generate high-quality proposals and then utilize PointNet to aggregate the voxel features around the grids. Nevertheless, these hybrid methods require plenty of hand-crafted feature designs.

Transformers for object detection.

A new paradigm for object detection has recently evolved due to the success of Transformers in many computer vision fields 

[1, 44, 5, 9, 6]. Since Transformer models are very effective at learning local context-aware representations, DETR [1] viewed the detection as a set prediction problem and employed Transformer with parallel decoding to detect objects in 2D image. A variant of DETR [44] further developed a deformable attention module to employ cross-scale aggregation. For point clouds, recent methods [9, 6] also explored to use self-attention for classification and segmentation tasks.

3 CT3D for 3D Object Detection

Given proposals generated by the widely used RPN backbones like 3D voxel CNN [33], current state-of-the-art proposal refinement approaches [24, 4]

focus on refining the intermediate multi-stage voxel features extracted by the convolution layers, suffering the difficulties of extra hyper-parameter optimization and designing generalized models. We believe that the raw points with precise position information are sufficient for refining the detection proposals. Bearing this view in mind, we construct our CT3D framework by deploying a well-designed Transformer on top of a RPN network to directly utilize the raw point clouds. Specifically, the whole CT3D detection framework is composed of three parts,

i.e., a RPN backbone for proposal generation, a channel-wise Transformer for proposal feature refinement and a detect head for object predictions. Figure 1 illustrates an overview of our CT3D framework.

3.1 RPN for 3D Proposal Generation

Starting from the point clouds P with 3-dimension coordinates and -dimension point features, the predicted 3D bounding box generated by RPN consists of center coordinate , length , width , height , and orientation . In this paper, we adopt the 3D voxel CNN SECOND [33] as our default RPN due to its high efficiency and accuracy. Note that any high-quality RPN should be readily replaceable in our framework and is amenable to training via an end-to-end manner.

3.2 Proposal-to-point Encoding Module

To refine the generated RPN proposals, we adopt a two-step strategy. Specifically, the first proposal-to-point embedding step maps the proposal to point features, then the second self-attention encoding step is to refine point features via modelling the relative relationships among points within the corresponding proposal.

Proposal-to-point Embedding. Given the proposals generated by RPN, we delimit out a scaled RoI area in point clouds according to the proposal. This aims to compensate the deviation between the proposal and the corresponding ground-truth box by wrapping all object points as much as possible. Specifically, the scaled RoI area is a cylindrical with unlimited height and a radius , where is a hyper-parameter, and , denote the length and width of the proposal, respectively. Hereinafter, the randomly sampled points within the scaled RoIs ( ) are taken out for further processing.

At first, we calculate the relative coordinates between each sampled point and the center point of the proposal for unifying the input distance feature, denoted as . A straightforward thought is to directly concatenate the proposal information into each point feature, i.e., , where is the raw point feature such as reflection. However, the size-orientation representation for proposal yields only modest performance as the Transformer encoder might be less effective to reorient in accord with above-mentioned geometric information.

It is noteworthy that the keypoints usually offer more explicit geometry property in detection tasks [41, 14], we propose a novel keypoints subtraction strategy to compute the relative coordinates between each point and the eight corner points of the corresponding proposal. The calculated relative coordinates are , where is the coordinate of the -th corner point. Note that and disappear but are contained in different dimensions of distance information. Through this way, the newly generated relative coordinates can be viewed as a better representation of proposal information. As shown in the left part of Figure 2, for each point , the proposal-guided point feature can be expressed as:


where is a linear projection layer to map point feature into a high-dimensional embedding.

Figure 2: Proposal-to-point encoding. The location features of raw point clouds are first modulated by the proposal information (center and corners) via subtraction operator. Then, the resulting point features are refined by the proposal-aware encoding module with multi-head self-attention mechanism.

Self-attention Encoding. The embeded point features are then fed into the multi-head self-attention layer, followed by a feed-forward network (FFN) with residual structure, to encode rich contextual relationships and point dependencies in proposal for refining point features. As shown in the right part of Figure 2, this self-attention encoding scheme shares almost the same structure as the original NLP Transformer encoder, except for the position embedding since it is already included in the point features. Reader can refer to [31] for more details. Denote as the embedded point features with the dimension , we have where are linear projections, and and are so-called query, key and value embeddings. These three embeddings are then processed by multi-head self-attention mechanism. In a -head attention situation, and are further divided into , , and , where , and . The output after multi-head self-attention is given by:


where is softmax function. Hereinafter, applying a simple FFN and residual operator, the result is as follows:


where denotes add and normalization operator, denotes a FFN with two linear layers and one Relu activation. We observe that a stack of 3 identical self-attention encoding modules is ideal for our CT3D framework.

3.3 Channel-wise Decoding Module

In this subsection, we manage to decode all point features (i.e., ) from the encoder module into a global representation, which is further processed by FFNs for the final detection predictions. Different from the standard Transformer decoder, which transforms multiple query embeddings using self- and encoder-decoder attention mechanism, our decoder only manipulates one query embedding according to the following two facts:

  • query embeddings suffer high memory latency, especially for processing with numbers of proposals.

  • query embeddings are usually independently transformed into words or objects, while our proposal refinement model only needs one prediction.

Generally, the final proposal representation after decoder can be regarded as a weighted sum of all point features, our key motivation is to determine the decoding weights that are dedicated for each point. In below, we first analyze the standard decoding scheme, and then develop an improved decoding scheme to acquire more effective decoding weights.

Standard Decoding.

The standard decoding scheme utilizes a learnable vector (

i.e., query embedding) of dimension to aggregate the point features across all channels. As shown in Figure 3(a), the final decoding weight vector for all point features in each attention head is:


where is the key embeddings of -th head computed by the projection of encoder output, and is the corresponding query embedding. Note that each value of vector can be viewed as the global aggregation for individual point (i.e., each key embedding), and the subsequent softmax

function assigns the decoding value for each point according to the probability in the normalized vector. Consequently, the values in decoding weight vector are derived from simple global aggregation and lack the local channel-wise modelling, which is essential to learn 3D surface structures of point clouds because different channels usually exhibit strong geometric relationships in point clouds.

Channel-wise Re-weighting. In order to emphasize the channel-wise information for key embeddings , a straightforward solution is to compute the decoding weight vector for points based on all the channels of . That is, we generate different decoding weight vectors for each channel to obtain decoding values. Further, a linear projection is introduced for these decoding values to form a united channel-wise decoding vector. As shown in Figure 3(b), this new channel-wise re-weighting for decoding weight vector can be summarized as:


where is a linear projection that compresses number of decoding values into a re-weighting scalar, computes the softmax along the dimension. However, the decoding weights computed by are associated with each channel, and thus ignore the global aggregation of each point. Therefore, we can conclude that the standard decoding scheme focuses on global aggregation while the channel-wise re-weighting scheme concentrates on the channel-wise local aggregation. To combine their characteristics, we propose an extended channel-wise re-weighting scheme as below.

Figure 3: Illustration of the different decoding schemes: (a) Standard decoding; (b) Channel-wise re-weighting; (c) Extended channel-wise re-weighting.

Extended Channel-wise Re-weighting. Specifically, we first repeat the matrix product of query embedding and key embeddings to spread the spatial information into each channel, and the output is then multiplied element-wise with the key embeddings for keeping the channel differences. As illustrated in Figure 3 (c), this novel extended channel-wise re-weighting scheme generates the following decoding weight vector for all the points:


where is a repeat operator makes . In this way, we can not only maintain the global information as compared to the channel-wise re-weighting scheme, but also enrich the local and detailed channel interactions as compared to the standard decoding scheme. Besides, this extended channel-wise re-weighting only brings 1K+ (Bytes) increase as compared to the other two schemes. As a result, the final decoded proposal representation can be described as follows:


where the value embeddings is the linear projection obtained from .

3.4 Detect head and Training Targets

In the previous steps, the input point features are summarized into a -dimension vector , which is then fed into two FFNs for predicting the confidence and the box residuals relative to the input 3D proposal, respectively.

To output the confidence, training targets are set as the 3D IoU between the 3D proposals and their corresponding ground-truth boxes. Given the IoU of the 3D proposal and its corresponding ground-truth box, we follow [11, 25, 24] to assign the confidence prediction target, which is shown as:


where and are the foreground and background IoU thresholds, respectively. Besides, regression targets (superscript ) are encoded by proposals and their corresponding ground-truth boxes (superscript ), given by:


where is the diagonal of the base of the proposal box.

3.5 Training Losses

We adopt an end-to-end strategy to train CT3D. Hence, the overall training loss is the summation of the RPN loss, the confidence prediction loss, and the box regression loss, which is presented:


Here, the binary cross entropy loss [11, 35] is exploited for the predicted confidence to compute the IoU-guided confidence loss:


Moreover, the box regression loss [35, 33] adopts:


where indicates that only proposals with contribute to the regression loss.

4 Experiments

In this section, we evaluate our CT3D on two public datasets, KITTI [7] and Waymo [18, 42]. Furthermore, we conduct comprehensive ablation studies to verify the effectiveness of each module in CT3D.

4.1 Dataset

KITTI Dataset. KITTI dataset officially contains 7,481 training LiDAR samples and 7,518 testing LiDAR samples. Following the previous work [2], we split the original training data into 3,712 training samples and 3,769 validation samples for experimental studies.

Waymo Dataset. Waymo dataset consists of 798 training sequences with around 158,361 LiDAR samples, and 202 validation sequences with 40,077 LiDAR samples. This large-scale Waymo dataset detection task is more challenging due to its various autonomous driving scenarios [42].

4.2 Implementation Details

RPN. We adopt SECOND [33] as our RPN due to its high-quality proposals and fast speed of inference. For the KITTI dataset, the axis ranges are set as , , , and the voxel size is set as in . For the Waymo dataset, the corresponding axis ranges are , , , and the voxel size is . consists of the Focal-Loss classification branch and the Smooth-L1-Loss based regression branch. Please refer to OpenPCDet [30] for more details since we conduct our experiments with this toolbox.

Method Par. 3D Detection - Car
(M) Easy Mod. Hard
MV3D, CVPR 2017 [3] - 74.97 63.63 54.00
ContFuse, ECCV 2018 [17] - 83.68 68.78 61.67
AVOD-FPN, IROS 2018 [12] - 83.07 71.76 65.73
F-PointNet, CVPR 2018 [21] 40 82.19 69.79 60.59
UberATG-MMF, CVPR 2019 [16] - 88.40 77.43 70.22
3D-CVF at SPA, ECCV 2020 [38] - 89.20 80.05 73.11
CLOCs, IROS 2020 [20] - 88.94 80.67 77.15
LiDAR only
SECOND, Sensor 2018 [33] 20 83.34 72.55 65.82
PointPillars, CVPR 2019 [13] 18 82.58 74.31 68.99
STD, ICCV 2019 [37] - 87.95 79.71 75.09
PointRCNN, CVPR 2019 [25] 16 86.96 75.64 70.70
3D IoU Loss, 3DV 2019 [40] - 86.16 76.50 71.39
Part-, PAMI 2020 [26] 226 87.81 78.49 73.51
SA-SSD, CVPR 2020 [10] 40.8 88.75 79.79 74.16
3DSSD, CVPR 2020 [36] - 88.36 79.57 74.55
PV-RCNN, CVPR 2020 [24] 50 90.25 81.43 76.82
Voxel-RCNN, AAAI 2021 [4] 28 90.90 81.62 77.06
CT3D (Ours) 30 87.83 81.77 77.16
Table 1: Performance comparisons with state-of-the-art methods on the KITTI test set. All results are reported by the average precision with 0.7 IoU threshold and 40 recall positions.
Method 3D Detection - Car
Easy Mod. Hard
MV3D, CVPR 2017 [3] 71.29 62.68 56.56
ContFuse, ECCV 2018 [17] - 73.25 -
AVOD-FPN, IROS 2018 [12] - 74.44 -
F-PointNet, CVPR 2018 [21] 83.76 70.92 63.65
3D-CVF at SPA, ECCV 2020 [38] 89.67 79.88 78.47
LiDAR only
SECOND, Sensor 2018 [33] 88.61 78.62 77.22
PointPillars, CVPR 2019 [13] 86.62 76.06 68.91
STD, ICCV 2019 [37] 89.70 79.80 79.30
PointRCNN, CVPR 2019 [25] 88.88 78.63 77.38
SA-SSD, CVPR 2020 [10] 90.15 79.91 78.78
3DSSD, CVPR 2020 [36] 89.71 79.45 78.67
PV-RCNN, CVPR 2020 [24] 89.35 83.69 78.70
Voxel-RCNN, AAAI 2021 [4] 89.41 84.52 78.93
CT3D (Ours) 89.54 86.06 78.99
Table 2: Performance comparisons with state-of-the-art methods on the KITTI val set. All results are reported by the average precision with 0.7 IoU threshold and 11 recall positions.
IoU BEV Detection 3D Detection
Thr. Easy Mod. Hard Easy Mod. Hard
0.7 96.14 91.88 89.63 92.85 85.82 83.46
Table 3: Performance of our CT3D on the KITTI val set with AP calculated by 40 recall positions for car category.
IoU Pedestrian Cyclist
Thr. Easy Mod. Hard Easy Mod. Hard
0.5 65.73 58.56 53.04 91.99 71.60 67.34
Table 4: Performance for pedestrian and cyclist on the KITTI.
Difficulty Method 3D Detection - Vehicle BEV Detection - Vehicle
Overall 0-30m 30-50m 50m-Inf Overall 0-30m 30-50m 50m-Inf
LEVEL_1 PointPillar, CVPR 2019 [13] 56.62 81.01 51.75 27.94 75.57 92.10 74.06 55.47
MVF, CoRL 2020 [42] 62.93 86.30 60.02 36.02 80.40 93.59 79.21 63.09
Pillar-OD, arXiv 2020 [32] 69.80 88.53 66.50 42.93 87.11 95.78 84.87 72.12
PV-RCNN, CVPR 2020 [24] 70.30 91.92 69.21 42.17 82.96 97.35 82.99 64.97
Voxel-RCNN, AAAI 2021 [4] 75.59 92.49 74.09 53.15 88.19 97.62 87.34 77.70
CT3D (Ours) 76.30 92.51 75.07 55.36 90.50 97.64 88.06 78.89
LEVEL_2 PV-RCNN, CVPR 2020 [24] 65.36 91.58 65.13 36.46 77.45 94.64 80.39 55.39
Voxel-RCNN, AAAI 2021 [4] 66.59 91.74 67.89 40.80 81.07 96.99 81.37 63.26
CT3D (Ours) 69.04 91.76 68.93 42.60 81.74 97.05 82.22 64.34
Table 5: Performance comparisons with state-of-the-art methods on the Waymo dataset with 202 validation sequences ( 40k samples) for vehicle detection.

Training Details. We use 8 GPUs to train the entire network with batch size for the KITTI dataset and batch size for Waymo dataset. For the encoder and decoder modules of channel-wise transformer, we set and . For training targets, we set

, respectively. The whole CT3D framework is trained end-to-end from scratch with ADAM optimizer for 100 epochs. We adopt cosine annealing learning rate strategy for our learning rate decay, and the maximum of leaning rate is 0.001. In the training stage, only 128 proposals are randomly selected to calculate the confidence loss while 64 (

) proposals are selected to calculate the regression loss. In the inference stage, top-100 proposals are selected for the final prediction.

4.3 Detection Results on the KITTI Dataset

We compare our CT3D with state-of-the-art methods on both the KITTI test and val sets with 0.7 IoU threshold. For our test submission, all the released training data is used to train the model. Following  [25, 24, 4, 10], the average precision (AP) for test set is calculated with 40 recall positions, while the AP for val set is calculated with 11 recall positions when compared to the previous methods111The setting of AP calculation is modified from 11 recall positions to 40 recall positions on 08.10.2019. For fair comparison with previous methods, we exploit the 11 recall setting on val set..

Performance Comparisons. Table 1 illustrates the performance comparisons between our method and state-of-the-art methods on the official KITTI test server. It shows CT3D achieves the best performance on moderate and hard levels for car detection on both LiDAR only and Lidar&RGB modalities, especially for the most important moderate level [8]. Compared with the newest released PV-RCNN which shares the same RPN (i.e., SECOND) as ours, CT3D achieves better performance while requiring about 1/3 times of parameters for refinement. Besides, as shown in Figure 4, CT3D presents much better visualization performance as compared to the PV-RCNN. This significant improvement mainly comes from the fact that CT3D processes the raw points in refinement stage rather than relying on human-specified designs and sub-optimal intermediate features. Note the AP on easy level of our CT3D is comparatively worse, there might be two reasons. First, we only sample 256 raw point within each proposal for all levels even the proposals in easy level usually have a much larger number of points. Second, we observe that KITTI exhibits large distribution differences between trainval and test sets.

For further validation, we conduct comparisons with previous methods on the KITTI val set. It shows that our CT3D outperforms all the other methods with a large margin, leading the state-of-the-art method Voxel-RCNN by 1.54% on moderate level, and achieves the competitive result on easy level. This improvement also verifies the effectiveness of our method, indicating our CT3D could better model the context information and dependencies as compared to the methods based on multi-scale feature fusion. Our model can also achieve strong performance on pedestrian and cyclist detection. The car-BEV, pedestrian-3D and cyclist-3D results are presented in Table 3 and Table 4 for reference.

4.4 Detection Results on the Waymo Dataset

As for the Waymo dataset, we train our model on the training set and evaluate it on the validation set. Likewise, the mAP is calculated with IoU threshold for vehicle detection. The data is split into two difficulty levels: LELVEL_1 denotes objects containing more than 5 points , LELVEL_2 denotes objects containing points.

Performance Comparisons. In Table 5, we compare our CT3D with state-of-the-art methods based on official released evaluation tools [29]

. It can be seen that our method outperforms all previous methods with remarkable margins on all distance ranges of interest in both LEVEL_1 and LEVEL_2. CT3D achieves 76.30% for the commonly used LEVEL_1 3D mAP evaluation metric, surpassing previous state-of-the-art method Voxel-RCNN by 0.71% on 3D detection, and 2.31% on bird-view detection. This significant improvement also verifies the effectiveness of our CT3D approach on large-scale point cloud feature representation. We report the results of LEVEL_2 difficulty in Table 

5, our method outperforms Voxel-RCNN significantly by 2.45% on 3D detection. A contributing factor is that Voxel-RCNN limits the feature interactions via dividing the RoI space into grids, while our proposed CT3D has the obvious advantage of capturing long-range interactions among sparse points.

Figure 4: Qualitative comparison results of 3D object detection on the KITTI test set. Our CT3D enables more reasonable and accurate detection as compared to the PV-RCNN.
Figure 5: Attention maps generated by the self-attention layer. We visualize the weights of at most 256 sampled points within 3 RoIs (red dotted line) as the -th and -th epochs.

4.5 Ablation Studies

In this section, we conduct comprehensive ablation studies for the CT3D to verify the effectiveness of each individual component. We report the 3D detection AP metric with 40 recall positions on the KITTI val set.

Different RPN Backbones. In Table 6, we validate the effects of our refinement network with “SECOND RPN [33],” and “PointPillar RPN [13]”, respectively. It can be seen that the detection performance boosts with +5.47% and +4.82% when compared to the RPN baselines. This benefits from that our two-stage framework CT3D could be integrated on the top of any RPNs to provide strong ability for proposal refinement. We also provide the amount of parameters in Table 6 for reference.

Proposal-to-point Embedding. We investigate the importance of the keypoints subtraction strategy by comparing it with the baseline size-orientation strategy adopted in the proposal-to-point embedding of Sec. 3.2. The and rows of Table 7 show that keypoints subtraction approach significantly improves the performance in all three difficulty levels. The rationale behind this strategy is that the relative coordinates between each point and the proposal keypoints could provide more effective geometric information, forming high-quality point location embeddings.

Self-attention Encoding. The and rows of Table 7 show that removing the self-attention encoding drops performance a lot, which demonstrates that the self-attention enables better feature representation for each point by aggregating the global-aware context information and dependencies. Moreover, we visualize the attention maps of the last self-attention layer of a trained model from different epoch checkpoints. As shown in Figure 5, the points on cars get more attention in epoch 80, even in an extremely sparse case as Figure 5 (c). On the contrary, the background points get less attention with the training process. Therefore, CT3D pays more attention to foreground points, and thus achieves considerable performance.

Channel-wise Decoding. As shown in the , and rows of Table 7, the extended channel-wise re-weighting outperforms both the standard decoding and channel-wise re-weighting with a large margin. This benefits from the integration of the standard decoding and the channel-wise re-weighting for both global and channel-wise local aggregation, generating more effective decoding weights.

5 Conclusion

In this paper, we present a two-stage 3D object detection framework CT3D with a novel channel-wise Transformer architecture. Our method first encodes the proposal information into each raw point via an efficient proposal-to-point embedding, followed by self-attention to capture the long-range interactions among points. Subsequently, we transform the encoded point features into a global proposal-aware representation by an extended channel-wise re-weighting scheme which could obtain effective decoding weights for all points. The CT3D provides a flexible and highly-effective framework which is particularly helpful for point cloud detection tasks. Experimental results on both the KITTI dataset and the large-scale Waymo dataset also verify that CT3D could achieve significant improvement over the state-of-the-art methods.

PointPillar SECOND Two-stage Par. Moderate AP (%)
RPN RPN refinement (M)
18 79.26
28 84.08
20 80.35
30 85.82
Table 6: Ablation studies for different RPNs on the KITTI val set in terms of 3D detection AP metric with 40 recall positions.
K. S. S. E. S. D. C. R. E. C. R Easy Mod. Hard
90.29 79.20 74.59
91.92 83.41 81.79
92.09 85.10 82.98
92.56 85.34 83.23
92.85 85.82 83.46
Table 7: Ablation studies for proposal-to-point embedding, self-attention encoding and channel-wise decoding on the KITTI val set. “K. S.” stands for the keypoints subtraction strategy, “S. E.” stands for the self-attention encoding, “S. D.”, “C. R.” and “E. C. R.” represent the standard decoding, channel-wise re-weighting, and our extended channel-wise re-weighting, respectively.


This work was supported by Alibaba Innovative Research (AIR) progamm and Major Scientifc Research Project of Zhejiang Lab (No. 2019DB0ZX01).