Investigating Attention Mechanism in 3D Point Cloud Object Detection

Object detection in three-dimensional (3D) space attracts much interest from academia and industry since it is an essential task in AI-driven applications such as robotics, autonomous driving, and augmented reality. As the basic format of 3D data, the point cloud can provide detailed geometric information about the objects in the original 3D space. However, due to 3D data's sparsity and unorderedness, specially designed networks and modules are needed to process this type of data. Attention mechanism has achieved impressive performance in diverse computer vision tasks; however, it is unclear how attention modules would affect the performance of 3D point cloud object detection and what sort of attention modules could fit with the inherent properties of 3D data. This work investigates the role of the attention mechanism in 3D point cloud object detection and provides insights into the potential of different attention modules. To achieve that, we comprehensively investigate classical 2D attentions, novel 3D attentions, including the latest point cloud transformers on SUN RGB-D and ScanNetV2 datasets. Based on the detailed experiments and analysis, we conclude the effects of different attention modules. This paper is expected to serve as a reference source for benefiting attention-embedded 3D point cloud object detection. The code and trained models are available at: https://github.com/ShiQiu0419/attentions_in_3D_detection.

READ FULL TEXT VIEW PDF
10/29/2020

An Overview Of 3D Object Detection

Point cloud 3D object detection has recently received major attention an...
09/18/2020

Deep Learning for 3D Point Cloud Understanding: A Survey

The development of practical applications, such as autonomous driving an...
09/17/2018

Focal Loss in 3D Object Detection

3D object detection is still an open problem in autonomous driving scene...
07/28/2021

High-speed object detection with a single-photon time-of-flight image sensor

3D time-of-flight (ToF) imaging is used in a variety of applications suc...
12/23/2020

Multi-Modality Cut and Paste for 3D Object Detection

Three-dimensional (3D) object detection is essential in autonomous drivi...
04/25/2019

HAR-Net: Joint Learning of Hybrid Attention for Single-stage Object Detection

Object detection has been a challenging task in computer vision. Althoug...
06/22/2018

Visual-Inertial Object Detection and Mapping

We present a method to populate an unknown environment with models of pr...

Code Repositories

attentions_in_3D_detection

Investigating Attention Mechanism in 3D Point Cloud Object Detection (arXiv 2021)


view repo

1 Introduction

3D data such as point cloud provides detailed geometric and structure information in scene understandings compared to images. As a result, the applications of 3D data have attracted more and more attention in recent years. Among the applications, 3D point cloud object detection is the most essential function, which is highly desired in robotics, autonomous driving, and augmented reality. Due to the irregular format of 3D point cloud data, the pipelines of 3D object detection [24, 23, 22] differ from conventional 2D object detection methods [9, 30, 12], raising new challenges.

(a) Overall results (%) on SUN RGB-D [33] dataset.
(b) Overall results (%) on ScanNetV2 [5] dataset.
Figure 1: The performances (IoU threshold=0.25) of using different attention modules in VoteNet [23] backbone for 3D point cloud object detection.

Many frameworks [25, 34, 4, 35] have been proposed to process the irregular point cloud data for object detection. For example, VoxNet [21] voxelized the point cloud data and used 3D CNNs to detect the objects, but it usually suffers from the expensive computation cost. To share the solid performance of 2D detectors, some methods [3, 17] project the point cloud data to front view or bird’s eye view (BEV) and lift the existing 2D detectors for 3D region proposals. However, as the occlusion problem is still challenging in 2D detection, this issue could also be problematic in 3D space.

PointNet [25] and PointNet++ [26] are proposed to process point cloud data directly without voxelization or projection into BEVs. This track of methods achieves excellent performance in classification [27] and semantic segmentation [29] tasks. Although these methods can effectively process the unordereded point cloud data, they cannot be directly used for 3D object detection due to point cloud’s inherent sparsity. In 2D images, the center of an object is likely to be inside the object’s pixels, while the center of an object in 3D point cloud is usually in empty space since only the surfaces of an object can be captured by the 3D scanners [2, 15]. Thus, it is hard to aggregate the context around an object center, which leads to difficulties in detecting 3D objects in the point cloud.

To address this problem, VoteNet [23]

generates some seed points that are close to object centers via a PointNet++ backbone and then produces the region proposals after voting and clustering the seed points. Moreover, VoteNet is an end-to-end trainable network that significantly simplifies the 3D object detection pipeline by omitting 2D to 3D conversions and appropriate pre/post-processing steps. Although VoteNet is efficient in 3D object detection, its effectiveness highly relies on the point features learned from the backbone network. In detail, the learned point features would not only affect the selection of seed points, but also involve in the following voting process since the offsets from the object centers are totally estimated from them. Therefore, extracting high-quality point features is a critical factor for the success of such a 3D object detection pipeline.

Feature learning is always considered as a fundamental problem in all computer vision tasks, especially in using Convolutional Neural Networks (CNNs). Following the success of the attention mechanism 

[36] in language-related topics, attention as a basic function for extracting and refining features can significantly improve the performance of many computer vision tasks. Basically, regular attention modules used in image feature learning can be categorized into three main types based on the operating domains: spatial attention [38, 14], channel attention [13], and mixed attention [40, 8]. More recently, an increasing number of attention modules are proposed for 3D point cloud analysis, including self-attention [41, 7, 28] and transformer [10, 49] methods. Although the attention mechanism shows the effectiveness in point cloud classification [28] or semantic segmentation [43], fewer of them are utilized in the point cloud object detection since it is unclear if the attention mechanism also fits with this task. As we explained above, the learned point features in VoteNet [23] are critical for 3D object detection. Thus, taking the recent VoteNet [23] as a basic pipeline, we investigate the advantages and disadvantages of different attention modules in 3D point cloud object detection.

To thoroughly investigate the attention mechanism’s effects on 3D point cloud object detection, we analyze five classical 2D attention modules [13, 38, 14, 40, 8] and five novel 3D attention modules [41, 7, 28, 10, 49] (more details are provided in Section 3). Furthermore, we conduct the experiments on two widely-used 3D object detection benchmarks, SUN RGB-D [33] and ScanNetV2 [5] datasets, under different metrics such as the ones shown in Figure 1. In general, our main contributions can be concluded from the following aspects:

  • We push the VoteNet pipeline towards better performance for 3D point cloud object detection by integrating attention mechanisms into it.

  • We are the first to comprehensively evaluate the performances of ten recent attention modules for 3D point cloud object detection on SUN RGB-D and ScanNetV2 datasets.

  • We concretely summarize the effects and characters of different types of attention modules and provide novel insights and inspiration to facilitate the understanding of the attention mechanism for 3D point cloud object detection.

2 Related Work

Point Cloud Networks. Due to the rapid development of 3D sensors, point cloud data can be easily collected using LiDAR scanners and RGB-D cameras. To better understand the contained information in point cloud data, different CNN-based point cloud networks have been invented for machine perception. To be specific, early methods [34, 48, 21] attempt to convert raw point clouds into a particular intermediate representation (, images or voxels) according to the projective relations, then apply regular 2D/3D CNN operations to learn the high-dimensional features for the following analysis. However, the intermediate representations of point cloud data usually encounter the problem of geometric information loss, resulting in inaccurate predictions.

To avoid this problem, current methods tend to exploit the usage of Multi-Layer Perceptron (MLP), which was originally proposed in PointNet 

[25]. In practice, MLP is implemented as the

convolutions followed by a Batch Normalization (BN) layer and an activation function. Particularly, MLP can avoid the sparsity and unorderedness of raw point cloud data because each point shares the learnable weights. Moreover, PointNet++ 

[26]

extends the basic MLP operation to aggregate the local features from pre-defined point neighborhoods via a symmetric function like max-pooling. Since VoteNet 

[23] takes PointNet++ as the backbone for feature learning, we mainly use MLP as the essential convolutional operation in the attention mechanism for 3D point cloud object detection.

3D Object Detection. The aim of object detection in 3D space is to predict the class label and 3D bounding box for each object. In general, the standard approaches for 3D object detection can be categorized into two streams [11]: the region proposal-based and single-shot methods.

Region proposal-based methods are usually in two-stage where the first stage generates region proposals and the second stage decides the class label of each proposal. More concretely, the region proposal-based methods are in three tracks: the multi-view based methods [3, 17, 19], segmentation-based methods [31, 37, 47], and frustum-based methods [24, 42, 39]

. In addition, the single-shot methods can be more efficient than the region proposal-based methods since they directly estimate the class probabilities and regress bounding boxes. According to the ways of processing the raw 3D data, the single-shot methods can be further divided into BEV 

[45, 44] methods, discretization [50, 32, 18] methods, and point-based [46] methods. Theoretically, VoteNet can be recognized as a region proposal-based method [11]; meanwhile, it also shares the efficiency of point-based methods that can directly use point cloud data as input.

Attention Mechanism. Initially, the attention mechanism intends to imitate the human vision system by focusing on the more relevant features to our targets rather than the whole scene containing some irrelevant context. Many methods have been introduced to estimate attention (weight) maps in order to re-weight the original feature map learned from the CNNs. As for image-related tasks, the attention map can be generated according to spatial [38, 14] or channel-related [13, 1, 6] information, while some methods [8, 40] incorporate both of them for better information integration. In addition, point cloud networks tend to utilize a self-attention [36] structure, which can estimate the long-range dependencies regardless of a specific order between the elements. In practice, we can leverage the basic form of self-attention to calculate either point-wise relations [41, 7] or channel-wise affinities [28] in a wide range of point cloud analysis problems.

In this work, we adopt ten standard attention modules covering the main types of existing designs used in both 2D images and 3D point clouds. By applying them to the backbone of VoteNet, we can comprehensively study the role of the attention mechanism in 3D point cloud object detection.

Figure 2: An overview of VoteNet [23] pipeline including the structure of our attentional backbone.

3 Approach

As shown in Figure 2, the VoteNet pipeline consists of a backbone that learns features, a voting module estimating the object centers, as well as a predicting module regressing the bounding boxes and class labels.

More concretely, the last two rows of Figure 2 compare the detailed structures of VoteNet’s original backbone and our attentional backbone. In general, the Set Abstraction (SA) [26] layer and the Feature Propagation (FP) [26] layer act as the encoder and decoder in the backbone, respectively. Following a regular usage in image-related CNNs, the attention module is placed after each encoder (SA) and decoder (FP) of the backbone. Moreover, to generate only a few seed points (, 1024) from all input points (, 20,000), we adopt the official implementation111https://github.com/facebookresearch/votenet of VoteNet, which leverages four SA layers in a down-sampling manner while only two FP layers for up-sampling.

Figure 3: The detailed structures of different attention modules that are investigated in this work. (2D attentions: Non-Local [38], Criss-cross [14], SE [13], CBAM [40], Dual-Attention [8]; 3D attentions: A-SCN [41], Point-Attention [7], CAA [28], Offset-Attention [10], Point-Transformer [49].)

3.1 2D Attentions

Non-local [38] block is spatial attention that uses a weighted sum of features to represent a pixel. Particularly, the weights are estimated as the long-range dependencies (, inner produces) between the pixels, by which the learned feature maps can be further enhanced with both local and global information.

Criss-cross attention [14] is another spatial attention module, which exploits the criss-cross pixels to obtain the contextual information of a certain point effectively. Compared to the Non-local block, the criss-cross attention saves memory and achieves comparable performance in the meantime.

Squeeze and Excitation (SE) block [13] is channel attention that intends to refine the features by exploiting the inter-dependencies between channels. At first, the SE block squeezes the spatial information and generates the channel-wise descriptors; then, it learns the weights for different channels by exciting the descriptors with convolutions and activations.

Convolutional Block Attention Module (CBAM) [40] is mixed attention that consists of two sequential attention blocks. To be specific, a channel attention block is leveraged to capture the channel-wise information based on the inter-channel relationship of features. In practice, it utilizes the sum of space-wise average-pooled and max-pooled descriptors to encode a channel attention map. In addition, the spatial attention block exploits the inter-spatial relationship of features for complementary context. Differently, the spatial attention block applies the average-pooling and max-pooling operations along the channels, then concatenates the pooled features to generate a spatial attention map.

Moreover, Dual-Attention [8] is another mixed attention structure exploiting both channel-wise and spatial-wise long-range dependencies of features to enhance the discriminant ability. Notably, this structure utilizes a position attention module and a channel attention module in parallel. To be more specific, the position attention module can model the dependencies between positions in a feature map following a similar manner of the Non-local [38] block. In contrast, the channel attention module directly estimates the dependencies between the channels without any convolution. Finally, the two modules are aggregated using simple element-wise summation to generate the final feature map.

 

Method bed table sofa chair toilet desk dresser night- book- bath- mAP
stand shelf tub
baseline [23] 83.3 49.8 64.1 74.1 89.3 23.8 26.4 60.7 30.9 72.8 57.5

2D Attentions

Non-local [38] 84.7 51.4 62.9 75.1 89.4 24.3 28.8 61.8 28.0 76.6 58.3
Criss-cross [14] 82.9 49.8 62.1 74.1 85.9 24.2 27.3 60.2 28.1 67.2 56.2
SE [13] 84.2 50.7 65.0 75.3 90.6 26.8 32.3 63.4 31.6 76.5 59.6
CBAM [40] 84.8 50.7 64.1 74.5 90.4 25.8 33.7 65.9 28.8 72.0 59.1
Dual-attn [8] 79.7 44.5 54.3 67.4 86.5 18.6 23.8 45.8 18.1 67.1 50.6

3D Attentions

A-SCN [41] 81.8 48.9 63.8 74.0 88.3 24.5 26.7 57.5 24.9 65.4 55.6
Point-attn [7] 84.4 49.0 61.9 73.8 87.4 25.7 24.6 56.0 28.2 73.1 56.4
CAA [28] 83.7 50.2 63.4 74.9 89.7 25.7 30.6 64.7 27.5 77.6 58.8
Point-trans [49] 84.6 50.4 63.2 74.3 86.7 25.2 27.6 64.5 31.5 73.1 58.1
Offset-attn [10] 82.8 49.8 60.5 73.0 86.5 23.6 27.1 56.5 25.6 71.2 55.7

 

Table 1: The results of Average Precision on SUN RGB-D [33] dataset. (IoU threshold = 0.25)

 

Method bed table sofa chair toilet desk dresser night- book- bath- AR
stand shelf tub
baseline [23] 95.2 85.5 89.5 86.7 97.4 78.8 81.0 87.8 68.6 90.4 86.1

2D Attentions

Non-local [38] 95.4 84.6 89.2 87.2 96.0 79.7 81.9 90.6 63.2 92.3 86.0
Criss-cross [14] 93.4 84.2 89.0 86.7 94.7 78.3 82.4 89.8 66.2 84.6 84.9
SE [13] 94.5 85.6 89.2 86.9 99.3 80.4 82.4 89.8 69.2 90.4 86.8
CBAM [40] 95.9 84.7 90.1 86.7 97.4 79.1 83.8 90.2 68.6 86.5 86.3
Dual-attn [8] 92.1 80.9 86.1 84.1 95.4 77.2 79.2 83.1 66.2 84.6 82.9

3D Attentions

A-SCN [41] 94.1 83.3 88.4 87.3 96.7 78.8 77.3 85.4 67.6 80.8 84.0
Point-attn [7] 94.8 83.6 88.9 86.3 95.4 78.7 78.2 88.2 62.5 86.5 84.3
CAA [28] 94.1 84.7 89.7 86.8 97.4 79.3 80.6 89.8 65.9 90.4 85.9
Point-trans [49] 93.9 84.3 87.9 85.5 96.7 77.2 76.9 88.6 66.6 84.6 84.2
Offset-attn [10] 94.1 83.5 87.8 86.1 97.4 78.9 78.2 88.2 64.9 86.5 84.6

 

Table 2: The results of Recall on SUN RGB-D [33] dataset. (IoU threshold = 0.25)

3.2 3D Attentions

Attentional ShapeContextNet (A-SCN) [41] introduces a self-attention-based module to exploit the shape context-driven features in the 3D point cloud. By comparing the query and key matrices, the attention map is estimated as the point-wise similarities. The output is then calculated as a matrix product between the attention map and the value matrix, together with an additional skip connection from the value matrix.

Point-Attention [7] module also follows the basic structure of self-attention to captures more shape-related features and long-range correlations from the point space of local point graphs. Additionally, it applies a skip-connection to strengthen the relationship between the input and output.

Channel Affinity Attention [28]

estimates the attention map between the channels by calculating the channel-wise affinities in a self-attention structure. Specifically, it utilizes a compact channel-wise comparator block and a channel affinity estimator block to compute the similarity matrix and affinity matrix.

Recently, inspiring by the success of transformers [16] in the 2D image, researchers also realize the transformer-based networks for point cloud analysis, in order to heavily exploit the attention mechanism as the basic point feature learning module. For example, Offset-Attention [10] is proposed to estimate the offsets between the input and attention features, which are calculated from a self-attention structure. Mainly, Offset-Attention leverages the robustness of relative coordinates in transformations and the effectiveness of the Laplacian matrix in graph convolution.

Moreover, Point Transformer [49] is designed to take advantage of the local geometric relations between the center point and its neighbors. Using basic MLP operations, the Point Transformer block can effectively aggregate a local feature for each point based on the learned attention weights for its neighbors. With the help of rich local and geometric context, this method achieves outstanding performances in both point cloud classification and segmentation tasks.

 

Method cabinet bed chair sofa table door window book- picture counter desk curtain refri- shower- toilet sink bathtub garba- mAP
shelf gerator curtain gebin
baseline [23] 38.5 88.7 87.8 90.4 58.9 45.0 36.7 44.8 4.9 50.0 61.4 39.1 50.4 59.1 97.3 48.7 91.3 39.0 57.3

2D Attentions

Non-local [38] 33.3 86.6 87.5 85.4 58.7 42.4 33.4 47.9 3.6 50.9 66.5 40.3 51.4 65.9 96.9 57.2 92.9 34.4 57.5
Criss-cross [14] 37.7 86.7 86.3 85.2 60.3 40.8 34.6 45.1 5.0 57.8 71.5 40.2 43.9 61.1 94.0 47.4 88.9 34.8 56.7
SE [13] 35.3 88.6 87.5 86.7 59.4 44.7 35.3 57.5 5.6 49.6 70.8 47.1 49.2 61.9 95.7 50.4 92.9 36.3 58.6
CBAM [40] 39.2 88.7 87.9 89.5 60.1 48.2 38.5 49.4 5.3 51.9 69.6 42.5 54.3 61.7 93.3 49.0 88.4 38.6 58.7
Dual-attn [8] 34.7 88.2 86.5 84.4 56.4 42.2 27.0 41.3 2.6 51.7 66.1 37.2 46.3 56.6 98.3 46.2 85.4 33.4 54.7

3D Attentions

A-SCN [41] 37.3 85.7 88.2 87.9 58.2 41.3 31.8 46.8 3.5 50.9 67.9 35.8 49.6 61.3 96.6 53.2 83.9 37.8 56.5
Point-attn [7] 31.8 87.4 84.0 88.4 58.5 38.2 31.5 41.2 2.2 61.2 69.1 29.6 50.7 49.5 97.3 46.6 83.9 33.4 54.7
CAA [28] 36.4 88.5 88.7 89.7 60.0 44.5 38.6 48.4 4.4 49.3 69.8 39.0 43.1 60.4 94.3 53.0 91.3 37.2 57.6
Point-trans [49] 38.8 85.1 87.4 89.3 61.7 47.5 33.8 46.9 9.5 65.3 72.3 47.1 45.6 62.4 98.5 46.1 86.0 41.1 59.1
Offset-attn [10] 38.0 88.1 87.2 89.9 58.5 43.2 27.5 50.2 6.8 59.6 69.9 39.5 50.6 61.5 95.8 51.1 87.2 38.9 58.0

 

Table 3: The results of Average Precision on ScanNetV2 [5] dataset. (IoU threshold = 0.25)

 

Method cabinet bed chair sofa table door window book- picture counter desk curtain refri- shower- toilet sink bathtub garba- AR
shelf gerator curtain gebin
baseline [23] 76.3 95.1 91.9 99.0 82.0 72.4 63.8 84.4 23.0 84.6 93.7 71.6 93.0 78.6 98.3 64.3 96.8 70.6 80.0

2D Attentions

Non-local [38] 74.2 93.8 91.8 97.9 84.0 71.3 60.6 81.8 19.4 82.7 94.5 79.1 93.0 96.4 98.3 74.5 96.8 65.7 80.9
Criss-cross [14] 73.9 95.1 91.3 97.9 82.9 68.5 61.0 85.7 18.0 86.5 93.7 73.1 94.7 78.6 94.8 67.3 93.5 67.4 79.1
SE [13] 72.6 95.1 92.0 96.9 82.9 71.3 64.5 85.7 21.2 80.8 96.1 77.6 91.2 92.9 96.6 63.3 96.8 67.7 80.3
CBAM [40] 77.2 95.1 92.1 97.9 84.3 70.7 69.1 87.0 19.8 82.7 94.5 76.1 96.5 96.4 96.6 65.3 90.3 65.8 81.0
Dual-attn [8] 73.4 95.1 92.3 97.9 81.7 71.3 56.0 85.7 18.0 86.5 93.7 74.6 98.2 92.9 100.0 65.3 93.5 67.0 80.2

3D Attentions

A-SCN [41] 75.8 95.1 93.0 97.9 83.7 71.5 62.8 84.4 19.4 82.7 93.7 71.6 98.2 92.9 100.0 71.4 90.3 66.8 80.6
Point-attn [7] 71.2 93.8 90.4 97.9 82.6 70.0 61.7 84.4 17.6 84.6 95.3 74.6 96.5 85.7 100.0 66.3 90.3 66.6 79.4
CAA [28] 74.2 93.8 92.1 99.0 84.0 71.5 65.6 83.1 21.2 84.6 95.3 73.1 94.7 92.9 94.8 67.3 96.8 66.2 80.6
Point-trans [49] 74.2 95.1 91.7 95.9 83.7 69.6 59.6 83.1 23.0 90.4 92.9 76.1 93.0 85.7 100.0 64.3 93.5 68.3 80.0
Offset-attn [10] 73.9 95.1 92.1 97.9 81.4 71.7 56.7 80.5 20.7 86.5 96.1 73.1 98.2 92.9 98.3 67.3 90.3 65.5 79.9

 

Table 4: The results of Recall on ScanNetV2 [5] dataset. (IoU threshold = 0.25)

4 Experiments

4.1 Datasets

We evaluate the performances of different attention modules on two datasets, SUN RGB-D [33] and ScanNetV2 [5], which are both captured from the indoor scenes in real-world using RBG-D cameras. To be more specific:

  • SUN RGB-D: There are 5,285 training and 5,050 testing RGB-D images in the dataset, where each object is precisely annotated with a bounding box and one of 35 semantic classes. According to the provided camera parameters, the original data and annotated bounding boxes can be projected as 3D point clouds. Following a widely used experimental setting [23], we only use the 3D coordinates as input and report the average precision (AP) and recall of the ten most common classes, together with the overall metrics of mean average precision (mAP) and average recall (AR).

  • ScanNetV2: The original dataset contains the reconstructed meshes from 18 object categories, where 1,201 samples are for training and 312 samples are in the validation set. The point cloud data is sampled from the vertices of reconstructed meshes. Besides the usage in segmentation, we input the 3D coordinates and predict the bounding box and category of each object in the same evaluating protocol as [23].

4.2 Implementation

In general, all attention modules in our work are adopted and slightly modified from the available official implementations. For 2D attentions, we regard the spatial space of 2D image as the point space of point cloud, following relation of where and are the height and width of an image while is the number of points. Moreover, to achieve a stable training process in 3D cases, we replace the original convolutions in 2D attentions with the MLP operations if necessary. For fair comparisons, the reduction factor in all attention modules is empirically set as 8. As for the recent Point Transformer [49] whose code is not released yet, we reproduce the structure according to the related descriptions in the paper.

The implementations are realized by PyTorch and Python platforms on a single Tesla-P100 GPU using CUDA and Linux operating system. All the experiments adopt the similar training settings such as the learning rate of 0.001, the batch size of 8, and a total of 180 training epochs. Following the default configurations in 

[23], in SUN RGB-D [33] dataset, the number of input points is 20,000; while in ScanNetV2 [5] dataset, the input size is 40,000. Coupled with this paper, we will release the source code of all deployed attention modules.

4.3 Experimental Results

By applying different attention modules in our attentional backbone, we conduct the experiments of 3D point cloud object detection on SUN RGB-D [33] dataset and ScanNetV2 [5] dataset, respectively.

Table 1 and 2 present the detailed detection results in SUN RGB-D [33] dataset under the metrics of average precision (AP), mean average precision (mAP), recall and average recall (AR). Although we can notice the different effects by using different attention modules, as shown in Table 1, the SE [20] method realizes the best overall result (59.6% mAP) among all tested attention modules and significantly exceeds the baseline’s result (57.5% mAP) by 2.1%. In terms of each category’s AP, the SE [20] method achieves the highest values in five out of ten object categories. Meanwhile, the results in Table 2 can also verify the outstanding performance of SE [13] under the metrics of recall. In comparison to the complicated self-attention methods, a compact attention structure like in SE [20] or CBAM [40] is able to benefit the point cloud detection task effectively and efficiently. Moreover, we observe that the channel-related information plays a crucial role in the attention mechanism of the point cloud, since the effectiveness of SE [13], CBAM [40] and CAA [28] are more prominent than the spatial attention modules.

 

Method mAP@0.25 AR@0.25 mAP@0.5 AR@0.5
baseline [23] 57.5 86.1 33.1 51.1

2D Attentions

Non-local [38] 58.3 86.0 31.4 49.7
Criss-cross [14] 56.2 84.9 33.1 50.0
SE [13] 59.6 86.8 34.5 52.1
CBAM [40] 59.1 86.3 34.9 53.1
Dual-attn [8] 50.6 82.9 24.4 42.1

3D Attentions

A-SCN [41] 55.6 84.0 30.1 48.2
Point-attn [7] 56.4 84.3 32.2 49.7
CAA [28] 58.8 85.9 33.3 51.4
Point-trans [49] 58.1 84.2 34.8 52.3
Offset-attn [10] 55.7 84.6 30.6 48.2

 

Table 5: Overall evaluations of 3D point cloud object detection results using SUN RGB-D [33] dataset. (mAP: mean average precision; AR: average recall; the value behind “@” denotes the corresponding IoU threshold.)

In Table 3 and 4, we further compare the performances of different attention modules under a challenging condition of more input points and detected object categories using ScanNetV2 [5] dataset. Even though SE [13] and CBAM [40] can still provide relatively good performances evaluated under either precision- or recall-related metrics, the Point Transformer [49] achieves the best overall result (59.1% mAP) among all the ten tested attention modules. Mainly, the advantages of Point Transformer [49] can be concluded from two sides: on the one hand, it incorporates more local context for each point rather than a single feature representation learned from a shared MLP; on the other hand, the attention map is estimated from the geometric relations in 3D space, while most of the rest methods only calculate the dependencies from the feature space.

4.4 Overall Evaluations

In addition to the detailed AP and mAP metrics discussed in Section 4.3, we also provide more overall evaluations regarding both mAP and average recall (AR) under an IoU threshold of 0.25 and 0.5, respectively.

In general, Table 5 shows the similar results as in Table 1, where the SE [13] and CBAM [40] methods achieve better mAP and AR scores under both the IoU thresholds of 0.25 and 0.5 in SUN RGB-D dataset. Moreover, it is worth noting that Table 6 indicates the effectiveness of CBAM [40] method evaluated under more metrics: when the IoU threshold is set at 0.5, CBAM [40] achieves the improvements of 3.4% mAP and 2.6% AR compared to the baseline results; while the Point Transformer only increases by 2.7% mAP and 2.5% AR. Since CBAM [40] can integrate more global perception in both spatial and channel domains, this attention module can better benefit the object detection task in dense point cloud scenes in ScanNetV2 [5] dataset.

 

Method mAP@0.25 AR@0.25 mAP@0.5 AR@0.5
baseline [23] 57.3 80.0 33.7 49.9

2D Attentions

Non-local [38] 57.5 80.9 34.6 49.5
Criss-cross [14] 56.7 79.1 33.8 49.2
SE [13] 58.6 80.3 35.8 51.4
CBAM [40] 58.7 81.0 37.1 52.5
Dual-attn [8] 54.7 80.2 30.2 47.2

3D Attentions

A-SCN [41] 56.5 80.6 33.1 48.7
Point-attn [7] 54.7 79.4 30.8 46.7
CAA [28] 57.6 80.6 35.1 50.4
Point-trans [49] 59.1 80.0 36.4 52.4
Offset-attn [10] 58.0 79.9 36.0 50.4

 

Table 6: Overall evaluations of 3D point cloud object detection results using ScanNetV2 [5] dataset. (mAP: mean average precision; AR: average recall; the value behind “@” denotes the corresponding IoU threshold.)
(a) Input [33]
(b) Baseline [23]
(c) Non-local [38]
(d) Criss-cross [14]
(e) SE [13]
(f) CBAM [40]
(g) Dual-Attention [8]
(h) A-SCN [41]
(i) Point-Attention [7]
(j) CAA [28]
(k) Offset-Attention [10]
(l) Point-Transformer [49]
Figure 4: Visualizations of the votes generated from VoteNet [23] backbone by leveraging different attention modules. The input point cloud scene (from SUN RGB-D [33] dataset) contains 3 objects for detection, where the bounding boxes (ground-truths) are drawn in white frames. Theoretically, the generated votes (yellow points) are expected to be around the centroids (red points) of detected objects as many as possible.

4.5 Visualization

Recall that in the VoteNet pipeline, the most critical usage of its backbone is to generate the votes (yellow points) that are expected to approach the centroids (red points) of detected objects. Therefore, the generated votes can intuitively indicate the quality of the backbone’s output.

To this end, in Figure 4, we compare the votes that are generated from different attentional backbones. In the sub-figure of the baseline, we can see that the votes can be easily attached to the most significant object’s (the middle one) centroid, while there are fewer votes around the centroids of two smaller objects. As for the sub-figure of the SE method, it can be clearly observed that more votes have been centralized at the centroids of two smaller objects (especially for the left object), providing more confident estimations on the detected objects bounding boxes.

5 Insights

From the experimental results and our analysis, we obtain several interesting observations and insights of attention mechanism in 3D point cloud object detection:

  • The self-attention modules are not preferable in processing 3D point cloud data. On the one hand, the fashion of self-attention needs high computational resources. On the other hand, the effectiveness of point-wise long-range dependencies used in self-attention modules is relatively limited as such an operation may cause some redundancies in representing the large-scale 3D point cloud data.

  • The compact attention structures like SE [13] and CBAM [40] enable the effectiveness and efficiency of 3D point cloud feature refinement. This is achieved by capturing the global perception from a broad perspective in feature space.

  • Comparing the spatial attention module with the channel-attention module, we found that the channel-related information is more important than spatial information when embedded into the attention modules for point cloud feature representations.

  • As reflected from the Point Transformer [49]’s results, incorporating more local context could better represent the complex point cloud scenes, thus leading to better 3D point cloud object detection performance.

6 Conclusion

This paper proposes an attentional backbone used in the VoteNet pipeline for 3D point cloud object detection. By integrating different standard 2D and 3D attentions modules, we compare their effects via various metrics and datasets. Based on the experiments and visualization, we summarize the effects of the attention mechanism in 3D point cloud object detection. Moreover, we provide our insights on how to effectively leverage the attention mechanism for point cloud feature representation. In addition to presenting a benchmark evaluating the performances of different attention modules, we expect our preliminary findings to help future research in designing reliable and transparent attention structures for more point cloud analyzing works.

References

  • [1] S. Anwar and N. Barnes (2019) Real image denoising with feature attention. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3155–3164. Cited by: §2.
  • [2] F. Blais et al. (2004) Review of 20 years of range sensor development. Journal of electronic imaging 13 (1), pp. 231–243. Cited by: §1.
  • [3] X. Chen, H. Ma, J. Wan, B. Li, and T. Xia (2017) Multi-view 3d object detection network for autonomous driving. In

    Proceedings of the IEEE conference on Computer Vision and Pattern Recognition

    ,
    pp. 1907–1915. Cited by: §1, §2.
  • [4] C. Choy, J. Gwak, and S. Savarese (2019)

    4d spatio-temporal convnets: minkowski convolutional neural networks

    .
    In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3075–3084. Cited by: §1.
  • [5] A. Dai, A. X. Chang, M. Savva, M. Halber, T. Funkhouser, and M. Nießner (2017) Scannet: richly-annotated 3d reconstructions of indoor scenes. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5828–5839. Cited by: 0(b), §1, Table 3, Table 4, §4.1, §4.2, §4.3, §4.3, §4.4, Table 6.
  • [6] P. Fang, J. Zhou, S. K. Roy, P. Ji, L. Petersson, and M. T. Harandi (2021) Attention in attention networks for person retrieval. IEEE Transactions on Pattern Analysis and Machine Intelligence. Cited by: §2.
  • [7] M. Feng, L. Zhang, X. Lin, S. Z. Gilani, and A. Mian (2020) Point attention network for semantic segmentation of 3d point clouds. Pattern Recognition 107, pp. 107446. Cited by: §1, §1, §2, Figure 3, §3.2, Table 1, Table 2, Table 3, Table 4, 3(i), Table 5, Table 6.
  • [8] J. Fu, J. Liu, H. Tian, Y. Li, Y. Bao, Z. Fang, and H. Lu (2019) Dual attention network for scene segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3146–3154. Cited by: §1, §1, §2, Figure 3, §3.1, Table 1, Table 2, Table 3, Table 4, 3(g), Table 5, Table 6.
  • [9] R. Girshick (2015) Fast r-cnn. In Proceedings of the IEEE international conference on computer vision, pp. 1440–1448. Cited by: §1.
  • [10] M. Guo, J. Cai, Z. Liu, T. Mu, R. R. Martin, and S. Hu (2021) PCT: point cloud transformer. Computational Visual Media 7, pp. pages187–199. Cited by: §1, §1, Figure 3, §3.2, Table 1, Table 2, Table 3, Table 4, 3(k), Table 5, Table 6.
  • [11] Y. Guo, H. Wang, Q. Hu, H. Liu, L. Liu, and M. Bennamoun (2020) Deep learning for 3d point clouds: a survey. IEEE transactions on pattern analysis and machine intelligence. Cited by: §2, §2.
  • [12] K. He, G. Gkioxari, P. Dollár, and R. Girshick (2017) Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pp. 2961–2969. Cited by: §1.
  • [13] J. Hu, L. Shen, and G. Sun (2018) Squeeze-and-excitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7132–7141. Cited by: §1, §1, §2, Figure 3, §3.1, Table 1, Table 2, Table 3, Table 4, 3(e), §4.3, §4.3, §4.4, Table 5, Table 6, item 2).
  • [14] Z. Huang, X. Wang, L. Huang, C. Huang, Y. Wei, and W. Liu (2019) Ccnet: criss-cross attention for semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 603–612. Cited by: §1, §1, §2, Figure 3, §3.1, Table 1, Table 2, Table 3, Table 4, 3(d), Table 5, Table 6.
  • [15] M. Jaboyedoff, T. Oppikofer, A. Abellán, M. Derron, A. Loye, R. Metzger, and A. Pedrazzini (2012) Use of lidar in landslide investigations: a review. Natural hazards 61 (1), pp. 5–28. Cited by: §1.
  • [16] S. Khan, M. Naseer, M. Hayat, S. W. Zamir, F. S. Khan, and M. Shah (2021) Transformers in vision: a survey. arXiv preprint arXiv:2101.01169. Cited by: §3.2.
  • [17] J. Ku, M. Mozifian, J. Lee, A. Harakeh, and S. L. Waslander (2018) Joint 3d proposal generation and object detection from view aggregation. In 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 1–8. Cited by: §1, §2.
  • [18] A. H. Lang, S. Vora, H. Caesar, L. Zhou, J. Yang, and O. Beijbom (2019) Pointpillars: fast encoders for object detection from point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12697–12705. Cited by: §2.
  • [19] M. Liang, B. Yang, S. Wang, and R. Urtasun (2018) Deep continuous fusion for multi-sensor 3d object detection. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 641–656. Cited by: §2.
  • [20] X. Liu, Z. Han, Y. Liu, and M. Zwicker (2019) Point2Sequence: learning the shape representation of 3d point clouds with an attention-based sequence to sequence network. In

    Proceedings of the AAAI Conference on Artificial Intelligence

    ,
    Vol. 33, pp. 8778–8785. Cited by: §4.3.
  • [21] D. Maturana and S. Scherer (2015) Voxnet: a 3d convolutional neural network for real-time object recognition. In 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 922–928. Cited by: §1, §2.
  • [22] C. R. Qi, X. Chen, O. Litany, and L. J. Guibas (2020) Imvotenet: boosting 3d object detection in point clouds with image votes. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 4404–4413. Cited by: §1.
  • [23] C. R. Qi, O. Litany, K. He, and L. J. Guibas (2019) Deep hough voting for 3d object detection in point clouds. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9277–9286. Cited by: Figure 1, §1, §1, §1, Figure 2, §2, Table 1, Table 2, Table 3, Table 4, 3(b), Figure 4, 1st item, 2nd item, §4.2, Table 5, Table 6.
  • [24] C. R. Qi, W. Liu, C. Wu, H. Su, and L. J. Guibas (2018) Frustum pointnets for 3d object detection from rgb-d data. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 918–927. Cited by: §1, §2.
  • [25] C. R. Qi, H. Su, K. Mo, and L. J. Guibas (2017) Pointnet: deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 652–660. Cited by: §1, §1, §2.
  • [26] C. R. Qi, L. Yi, H. Su, and L. J. Guibas (2017) Pointnet++: deep hierarchical feature learning on point sets in a metric space. In Advances in neural information processing systems, pp. 5099–5108. Cited by: §1, §2, §3.
  • [27] S. Qiu, S. Anwar, and N. Barnes (2021) Dense-resolution network for point cloud classification and segmentation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pp. 3813–3822. Cited by: §1.
  • [28] S. Qiu, S. Anwar, and N. Barnes (2021) Geometric back-projection network for point cloud classification. IEEE Transactions on Multimedia. External Links: Document Cited by: §1, §1, §2, Figure 3, §3.2, Table 1, Table 2, Table 3, Table 4, 3(j), §4.3, Table 5, Table 6.
  • [29] S. Qiu, S. Anwar, and N. Barnes (2021) Semantic segmentation for real point cloud scenes via bilateral augmentation and adaptive fusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1757–1767. Cited by: §1.
  • [30] S. Ren, K. He, R. Girshick, and J. Sun (2015) Faster r-cnn: towards real-time object detection with region proposal networks. Advances in neural information processing systems 28, pp. 91–99. Cited by: §1.
  • [31] S. Shi, X. Wang, and H. Li (2019) Pointrcnn: 3d object proposal generation and detection from point cloud. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 770–779. Cited by: §2.
  • [32] V. A. Sindagi, Y. Zhou, and O. Tuzel (2019) Mvx-net: multimodal voxelnet for 3d object detection. In 2019 International Conference on Robotics and Automation (ICRA), pp. 7276–7282. Cited by: §2.
  • [33] S. Song, S. P. Lichtenberg, and J. Xiao (2015)

    Sun rgb-d: a rgb-d scene understanding benchmark suite

    .
    In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 567–576. Cited by: 0(a), §1, Table 1, Table 2, 3(a), Figure 4, §4.1, §4.2, §4.3, §4.3, Table 5.
  • [34] H. Su, S. Maji, E. Kalogerakis, and E. Learned-Miller (2015) Multi-view convolutional neural networks for 3d shape recognition. In Proceedings of the IEEE international conference on computer vision, pp. 945–953. Cited by: §1, §2.
  • [35] H. Thomas, C. R. Qi, J. Deschaud, B. Marcotegui, F. Goulette, and L. J. Guibas (2019) Kpconv: flexible and deformable convolution for point clouds. In Proceedings of the IEEE International Conference on Computer Vision, pp. 6411–6420. Cited by: §1.
  • [36] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: §1, §2.
  • [37] S. Vora, A. H. Lang, B. Helou, and O. Beijbom (2020) Pointpainting: sequential fusion for 3d object detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 4604–4612. Cited by: §2.
  • [38] X. Wang, R. Girshick, A. Gupta, and K. He (2018) Non-local neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7794–7803. Cited by: §1, §1, §2, Figure 3, §3.1, §3.1, Table 1, Table 2, Table 3, Table 4, 3(c), Table 5, Table 6.
  • [39] Z. Wang and K. Jia (2019) Frustum convnet: sliding frustums to aggregate local point-wise features for amodal 3d object detection. In 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 1742–1749. Cited by: §2.
  • [40] S. Woo, J. Park, J. Lee, and I. So Kweon (2018) Cbam: convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 3–19. Cited by: §1, §1, §2, Figure 3, §3.1, Table 1, Table 2, Table 3, Table 4, 3(f), §4.3, §4.3, §4.4, Table 5, Table 6, item 2).
  • [41] S. Xie, S. Liu, Z. Chen, and Z. Tu (2018) Attentional shapecontextnet for point cloud recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4606–4615. Cited by: §1, §1, §2, Figure 3, §3.2, Table 1, Table 2, Table 3, Table 4, 3(h), Table 5, Table 6.
  • [42] D. Xu, D. Anguelov, and A. Jain (2018) Pointfusion: deep sensor fusion for 3d bounding box estimation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 244–253. Cited by: §2.
  • [43] X. Yan, C. Zheng, Z. Li, S. Wang, and S. Cui (2020) PointASNL: robust point clouds processing using nonlocal neural networks with adaptive sampling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5589–5598. Cited by: §1.
  • [44] B. Yang, M. Liang, and R. Urtasun (2018) Hdnet: exploiting hd maps for 3d object detection. In Conference on Robot Learning, pp. 146–155. Cited by: §2.
  • [45] B. Yang, W. Luo, and R. Urtasun (2018) Pixor: real-time 3d object detection from point clouds. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 7652–7660. Cited by: §2.
  • [46] Z. Yang, Y. Sun, S. Liu, and J. Jia (2020) 3dssd: point-based 3d single stage object detector. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 11040–11048. Cited by: §2.
  • [47] Z. Yang, Y. Sun, S. Liu, X. Shen, and J. Jia (2019) Std: sparse-to-dense 3d object detector for point cloud. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1951–1960. Cited by: §2.
  • [48] T. Yu, J. Meng, and J. Yuan (2018) Multi-view harmonized bilinear network for 3d object recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 186–194. Cited by: §2.
  • [49] H. Zhao, L. Jiang, J. Jia, P. Torr, and V. Koltun (2020) Point transformer. arXiv preprint arXiv:2012.09164. Cited by: §1, §1, Figure 3, §3.2, Table 1, Table 2, Table 3, Table 4, 3(l), §4.2, §4.3, Table 5, Table 6, item 4).
  • [50] Y. Zhou and O. Tuzel (2018) Voxelnet: end-to-end learning for point cloud based 3d object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4490–4499. Cited by: §2.