With the widespread availability of 3D scanning devices and depth sensors LF_Feng
, 3D geometric data is being increasingly used in many different application domains such as robotics, autonomous driving, 3D scene understanding, city planning, infrastructure maintenance etcPR_1; PR_2; PR_3; PR_4. Several representations of 3D shape have been investigated, such as depth maps, voxels, multi-views, meshes and point clouds Voxsegnet. However, point cloud is arguably the simplest format for 3D data representation and has hence attracted increasing research interest. Similar to the pixels in a 2D image, points in the three-dimensional coordinate system are basic building units of point clouds, which naturally encode the geometric features and their spatial distributions of a real 3D scene.
The extraction of meaningful information from 3D point clouds requires semantic segmentation. Point cloud semantic segmentation has been a challenging and active research topic for the last few years. Unlike pixels of 2D images which have a rectangular grid-like structure with no missing bits, 3D point clouds are sparse, irregular, unordered and with missing regions due to the limited range of scanners and occlusions. While deep learning has been very successful in semantic segmentation of 2D images, its use for 3D point clouds has not been fully exploited yet. Qi et al. Pointnet
first proposed PointNet that learns point features directly from unordered point sets. In PointNet, all 3D points are independently passed through a set of multi-layer perceptions (MLP) and then aggregated to a global feature using max-pooling. Recent research directions focus on extending the basic idea of PointNet to incorporate local geometric features for abstracting more discriminative high level featuresPointnet++; DynamicEdge; ECC. Among these methods, Pointnet++ Pointnet++ exploited neighborhood points within a ball query radius, where each local point is processed separately by a PointNet-based hierarchical network. However, the relationships between local points are neglected. Recently dynamic graph CNN DynamicEdge was proposed which considers neighborhood points as a local graph and uses a filter generating network to assign edge labels. Since the edge-conditioned network does not consider the order of local points, it does not have transformation invariance. Similar to dynamic graph CNN DynamicEdge, dynamic edge conditioned filters ECC were introduced as an edge function to encode local information by combining the relative coordinates (raw features) between the center point and its
-nearest neighbors (KNN). Although dynamic edge conditioned filtersECC attempt to use a function designed to handle local points, it does not fully exploit the geometrical correlations of the local neighborhood points.
To address the above short comings, we propose a local attention-edge convolution (LEA-Conv) layer that extends the ideas of Pointnet++; DynamicEdge and ECC. The LAE-Conv layer constructs a local graph based on the neighborhood points searched along multiple directions. Unlike KNN and ball query methods, we propose a multi-directional search strategy that finds all neighborhood points from 16 directions spread systematically within a ball query making the local geometric shape more generalizable across space. After the search operation, LAE-Conv layer assigns attention coefficients to each edge and then aggregates the central point features as a weighted sum of its neighbors. Aggregating features from a group of points with their contribution coefficients, rather than a single max-pooling operation, better exploits the correlations between points to get accurate and robust local geometric details. Moreover, LAE-Conv layer is invariant to the ordering of points and can implicitly infer how the points contribute to the overall 3D shape.
Equipped with the LAE-Conv layer, we are able to design hierarchical deep learning architectures on point clouds for semantic segmentation. Since each LAE-Conv layer has a limited local receptive field, each unit of the output features (at the initial layers) exploits correlations within its local scale only. However, later LAE-Conv layers have progressively larger receptive fields enabling the network to learn hierarchical features. While existing networks Pointnet++; PointCNN; DynamicEdge capture multi-scale shapes for high-level point feature learning, they do not leverage the long-range contextual relationship among points belonging to the same categories, which is important for semantic segmentation. Superpoint graphs Large_superCVPR18
employed a recurrent neural network to exploit long-range dependencies based on an unsupervised geometric partitioning. However, that method relies heavily on the partitioning results. To address the above problems, in this paper, we propose a point-wise spatial attention module, which captures long-range contextual information in the spatial dimension. Features obtained from LAE-Conv layer are fed into the point-wise spatial attention module to generate a global dependency matrix which models the correlations between any two points of the feature maps. Through multiplying the dependency matrix with original features, the differences between point features of the same category are reduced. Hence, any two points with similar features can contribute mutual improvement regardless of their spatial distance.
Using the proposed LAE-Conv layer and point-wise spatial attention model as the main building blocks, we design a U-shape network to predict the dense labels for semantic segmentation of 3D point clouds. The unorganized 3D points (raw data) are input directly to our point attention network comprising an encoder and a decoder. This is different from other approachesPointnet++; PointCNN; DynamicEdge since our method stacks the point-wise attention module after the LAE-Conv layer at different stages of the network enabling it to learn more accurate local geometric features and long range relationships.
To summarize, our contributions include: (1) A novel local attention-edge convolution (LAE-Conv) layer to encode point features using a weighted sum of its neighborhood points with edge attention coefficients. The proposed multi-directional search strategy makes the local geometric shape more generalizable across space. (2) A novel point-wise spatial attention module that learns the long-range contextual information and significantly improves the segmentation results by boosting the representation power of local features obtained from the LAE-Conv layers. (3) Extending the U-shaped network to incorporate the proposed LAE-Conv layer and point-wise spatial attention module. Experimental results show that our method obtains on pair or better performance than existing state-of-the-art methods quantitatively and qualitatively on challenging benchmark datasets. Finally, we show that our proposed point attention block can generalize to other networks and improve their performance.
2 Related Work
A number of deep learning architectures have been recently proposed to learn directly from 3D point cloud data or its derived representations for applications such as semantic segmentation, object part segmentation and object categorization. We provide a brief survey of these methods and divide them into three categories based on the underlying data representations they use.
2.1 Indirect methods
This category includes methods that transform the irregular 3D point cloud data to a canonical form so that traditional convolutions can be applied PointGrid; SpiderCNN. Volumetric representations Voxel_1; Voxsegnet; Voxel_3; Voxel_4; VoxelNet are the most common canonical form used by these methods due to their simplicity. However, voxel representations have cubic complexity leading to dramatic increase in the memory consumption and computing resources required to process even medium size point clouds. To alleviate this problem, Octree-Net O-cnn; Octnet_1 and Kd-Net KdNet have been proposed which skip representation and computations at empty spaces to save memory and processing resources respectively Spherical. Moreover, sparse convolutional operations, where the activations are kept sparse in the convolution layers Submanifold; SBnet, have been introduced to process spatially-sparse 3D point clouds. Nevertheless, the kernels are still dense and inefficient in their implementation. Multi-view convolutional neural networks and their variants Point_depth; 3DMV_point; PVNet_point; Muti_point have also been proposed. These methods render the 3D shape from multiple pre-defined views, which are then processed by conventional image-based convolution networks. The main drawback of the multi-view frameworks is that the 3D geometric information is not always fully retained in the 2D projections.
The sparse lattice networks proposed by Hang et al. Splatnet project the input 3D points onto a high dimensional lattice, perform standard spatial convolution on it and then filter the features back to the input points. Matan et al. Extension_point extended the function over point cloud to a volumetric function, where volumetric convolution is applied and then a restriction operator is used to do the inverse action. Qiangui et al. Recurrent_Slice used a slice pooling layer to project unordered point clouds into an ordered format, making it feasible to apply traditional deep learning algorithms. Fully convolutional networks FC_point have been proposed that sample the input point cloud uniformly and use PointNet as a low-level feature learner, followed by 3D convolutions to learn features at multiple scales. Finally, tangent convolutions Tangent_3D have also been proposed that operate directly on surface geometry in the tangent space. Although the above methods have used deep learning techniques to realize the 3D data analysis tasks, they have not used the 3D point clouds directly. We believe that learning directly from raw 3D point cloud data can achieve higher accuracy and efficiency as learning from raw data is the major strength of deep learning.
2.2 Graph convolution methods
Graph convolutional methods combine the power of convolution operation with graph representations of irregular data. Graph convolutional networks have been designed to perform convolutions either in the spectral or spatial domain. More recently, Joan et al. Spectral_graph proposed a generalization of convolution for graph via the Laplacian operator. In that method, the spectral network can learn convolutional layers with a number of parameters for low dimensional graphs. Wang et al. Local_Spectral proposed a local spectral graph convolution to construct local graph from a point’s neighborhood and aggregate information from nodes using their spectral coordinates. The PointNet++ architecture is then applied along with the local spectral graph convolution layers and graph pooling layers. The regularized graph convolution network proposed by Gusi et al. RGCNN_graph
treats point cloud as a graph and defines convolution operation over it. Moreover, a graph smoothness prior is used in the loss function to regularize the learning process. Graph Laplacian based methods have a number of drawbacks including the computational complexity of Laplacian eigen-decomposition, the large number of parameters to express the convolutional filters, and the lack of spatial localization. Different from these methods, Martin et al.ECC proposed a convolution-like operation on graph signals in the spatial domain and used an asymmetric edge function to describe the relationships between local points. However, the edge labels are dynamically generated and hence, the irregular distribution of local points is not taken into account. This method was improved by Wang et al. DynamicEdge through max pooling operation on local features. However, max pooling operation is still unable to fully utilize the correlations of local points. Our proposed method exploits local feature learning using a completely different approach. We propose a local attention-edge convolution layer that learns local relationships between points.
2.3 Point cloud methods
Many researchers have proposed deep learning architectures that learn directly from point clouds. One of the earliest methods in this category is the PointNet Pointnet
that operates on point clouds using multi-layer perception (MLP). PointNet is robust to the global transformation of 3D shape because the spatial transformer networkT-Net is used to learn the 3D alignment. The main limitation of PointNet is that it only relies on the max-pooling layer to learn global features. Since PointNet does not consider local relationships, Qi et al. Pointnet++ introduced an improved network named PointNet++, which exploits local geometric features in point sets and aggregates them for hierarchical inference. However, PointNet++ still treats points within local regions individually and does not consider relationships between the neighborhood points.
Later, Francis et al. Multi_context_ICCV17 designed a multi-scale architecture to enlarge the receptive field over the 3D scene by incorporating larger-scale spatial grid blocks into PointNet. Loic et al. Large_superCVPR18
used an unsupervised method to cluster input points into superpoint graphs, then fed the graphs to PointNet-based gated recurrent unit. Li et al.PointCNN proposed X-Conv layer instead of MLP to permute unordered local points into a latent potentially canonical order. A similar approach was proposed in Mining_Kernel, where kernel correlation was introduced to incorporate local information extracted from point cloud by PointNet. Wang et al. SGPN_instance
introduced a similarity group proposal network for point cloud instance segmentation, which use a similarity matrix to produce a grouping proposal based features extracted from PointNet. Different from these PointNet-based frameworks, Hua et al.PointWise_CVPR18 presented a point-wise convolution operator that can be applied to each point of the point set. Recently, Zhao et al. PointWeb proposed PointWeb for point cloud processing, which connects all points densely in a local neighborhood for better encoding local geometric features. Wu et al. PointConv introduced PointConv, a nonlinear function kernel for point cloud, which is used to learn the translation-invariant and permutation-invariant features in 3D space. Wang et al. Graph_AC designed a graph attention kernel to adapt to the local geometric, which is useful for fine-grained segmentation.
A common limitation of all the aforementioned methods is that they are unable to simultaneously exploit fine local details and long-range contextual information. We fill this gap and propose a network that learns local geometrical features using their edge attention coefficients and allows deep learning architectures to exploit fine details as well as interactions over longer distances.
3 Proposed Approach
We first give details of the LAE-Conv layer that captures accurate local geometric details. Next, we explain the point wise spatial attention module that aggregates the long-range contextual information based on the output of LAE-Conv layers. Finally, we present a general framework of our network.
3.1 Local Attention-Edge Convolution (LAE-Conv)
The Local Attention-Edge Convolution (LAE-Conv) layer forms the basic component of our point attention network architecture for 3D point cloud semantic segmentation. Inspired by DGCNN DynamicEdge, ECC ECC, GATs GAT_ICLR2018 and Non-local network Non_local, we construct a multi-directional neighborhood graph and apply graph attention mechanism to compute local edge features. Similar to traditional convolution in images, LAE-Conv explores local regions to leverage correlations between unordered points and exploits the local geometric structure of the points. We summarize the LAE-Conv operator in Algorithm 1.
3.1.1 Multi-directional Search
In image convolution operation, the local region of a pixel can be represented in a grid-like structure given a convolution kernel size. However, the neighborhood of a center point (in a point cloud) is defined by metric distance in a 3D coordinate system where neighboring points are irregularly distributed. To robustly leverage local point correlations, we endeavour to explicitly capture geometric information in different orientations. Given an unordered point cloud with , where is the number of points, and is the feature dimension at each point. When each point is represented by its 3D coordinates , then . We denote a central point in as , and its neighbors in as , . As shown in Figure 1(a), the space around the reference point within a radius of is split into 16 bins, where each bin indicates a direction. Each bin has an azimuth angle . Within the spatial range represented by each bin, we select nearest points of from all the points that fall in that bin and use their features to represent the bin, i.e. when , . Since some points far away from are not very useful to represent , we set the radius empirically as a hyper-parameter according to each layer. In case there are insufficient points inside a bin, point is repeated. This is similar to self convolution.
Two common ways for range query are K-nearest neighbor (KNN) search and ball query. KNN returns a fixed number of neighboring points while ball query returns all points that are within a radius. The local shape will not be well represented if all selected points, using either of the methods, are from a small region or one direction. Different from KNN and ball query, our search method guarantees that neighborhood points are from different directions to ensure sufficient expressive power of encoding the local geometric information. We compare the effectiveness of our search method over ball query and KNN in the experiments section.
For a set of local points , , where is the central point and others are its neighbors, we consider a graph , where is a finite set of points with and is a set of directed edges . We define the attention edge coefficients as , which represent the importance of neighbors to the central point , computed by an attention mechanism .
Where is a learnable weight matrix that transforms the input point set to higher-level features, and represent the central point and its neighbors respectively and the mechanism
is a single layer MLP, parametrized by a weight vector. To make the edge coefficients easily comparable across different points, we use the softmax function to normalize them across all neighbors of the reference point :
The final edge coefficients computed by the attention mechanism may then be expressed as:
Where the neighbor points of the central point are transformed to local coordinate systems by and then the local coordinates of each point are lifted to higher-order features by .
Once obtained, the normalized edge coefficients are used to assign attributes to each edge. Our approach computes the filtered feature at point as a weighted sum of points in its neighborhood. The proposed commutative aggregation method not only solves the problem of undefined point ordering, but also smoothes out the structural information. The local graph attention aggregator is defined as
where is the updated features of central point .
Now we have an aggregated representation for the central point . It is natural to add a feature transformation function
to incorporate additional non-linearity and increase the learning capacity of the model. The transformation can be realized by MLP with a non-linear activation function. The output of the transformation function is. The proposed LAE-Conv layer is described in Algorithm 1.
3.2 Point-wise Spatial Attention Block
The output point cloud : of the LAE-Conv layer have rich representation power for local geometric features. However, since each LAE-Conv layer have a local receptive field, individual units of the filtered features are unable to exploit contextual information outside of their local regions. In , features corresponding to the points with the same label are significantly different when the points are far apart. These differences affect the point wise segmentation accuracy of the scene as a whole. To address this issue, we focus on the global spatial relationships to boost the representation power of the LAE-Conv layer. We design a point-wise spatial attention module that captures the global dependencies by building associations among features within the point set. We demonstrate that by stacking these blocks after LAE-Conv layers, we can construct local-global architectures that adaptively encode long-range contextual information, thus improving the semantic segmentation accuracy of 3D point clouds that cover large areas. Next, we introduce a process to adaptively aggregate point-wise spatial contexts.
Inspired by the position attention operation Dual_attention, we define a point-wise spatial attention module for 3D point clouds. As illustrated in Figure 2, two MLP layers are used to transform the local feature into two new representations and respectively, where . We compute relationships between different points based on the transpose of and . Unlike Dual_attention, we calculate the spatial correlations of all points directly from the transpose of and without reshaping the matrices, hence, maintaining the original space distribution. Softmax is then applied to normalize relationship map to get the point-wise spatial attention map with size :
where and denote the point positions in and respectively, is the point’s impact on the point, and denotes matrix multiplication. We show that two points have a strong correlation when their features have similar semantic information.
At the same time, the local feature is transformed to a new feature by an MLP layer. This is followed by a matrix multiplication between and . Finally, the output is multiplied by a scale parameter and element-wise summation is performed with the features to obtain the final output as follows:
where denotes matrix multiplication. Here, the resulting feature contains a long-range contextual information and selectively aggregates contexts according to the point-wise spatial attention map . This module improves the feature representation power and is more accurate for 3D point cloud semantic segmentation.
3.3 Network Architecture
For dense point label prediction, the output resolution is high. Moreover, there are multiple objects with different scales in one scene. Selecting the most representative scale for each kind of object is important for semantic segmentation. Following the hierarchical structure of PointNet++ Pointnet++, our network consists of encoder and decoder parts. As shown in Figure 3, our point attention network comprises the LAE-Conv layers and point-wise spatial attention modules. At the encoder part, the input point set is processed by three LAE-Conv layers, which transform it into fewer representation points but with richer features. The input point cloud is represented by its 3D coordinates and sometimes with the RGB color values as well. The point-wise spatial attention modules are stacked after the third and fourth LAE-Conv layers to aggregate long range point-wise contextual information from output of the previous LAE-Conv layer. The long-range contextual features along with the local features from LAE-Conv layers together achieve robust and accurate 3D point cloud semantic segmentation.
At the decoder part, three skip connections are used to combine features from the encoders. The point-wise spatial attention module is also inserted after the fifth LAE-Conv layer at the decoder part. In our hierarchical architecture, we use three steps of down-sampling operations and tree steps of up-sampling operations which are followed by set abstraction and feature propagation modules as in PointNet++ Pointnet++
. Finally, all the features in the last decoder layer go through fully connected layer and convert to class probabilities.
4 Comparison with Existing Methods
Our point attention network is a more generalized form of the classic approach PointNet++ Pointnet++. We explain how PointNet++ is a special case of our network. PointNet++ is an extension of Pointnet with considers local point structure. Given a reference point , ball query search local points with data size , PointNet processes the local region points individually and then max pools them to get the most representative point feature as the output of the local region. Different from PointNet++, the LAE-Conv layer constructs the local graph for the neighbors and central point . We compute attention edge coefficients to indicate different contributions of each neighbor to the central point. When , features are selected to represent the local region. We can observe that the basic convolution layer of PointNet++ is an instance of our LAE-Conv layer.
DGCNN DynamicEdge uses KNN to establish local point shape and proposes an aggregation operation . In that operation, the neighbor points are moved to the local coordinate system first and then stacked with the central point. All the neighbors have equal contribution to the central point, which is equivalent to our operator when all edge coefficients are equal to . Since DGCNN is based on PointNet, the receptive field remains constant () at different layers, which is a disadvantage when encoding point clouds with different spatial distribution densities.
Similar to PointNet++, PointCNN PointCNN follows the encoder-decoder architecture and learns a transformation to lift the input irregular points into an unknown canonical format, then applying a typical convolution on the transformed point cloud. In PointCNN, the dilated convolution process from image convolution networks is employed to expand the local receptive field of different layers. The local receptive field changes the number of neighborhood points by adjusting the dilation ratio. Different from the grid structure of local pixels, points are disordered in a three-dimensional coordinate system and the density distribution is not uniform. Although KNN searches for neighborhood points which is controlled by the dilation ratio proportionally, the global geometric features learned by the change of receptive field is limited. To address this issue, our point attention network inserts a point-wise attention module in the high level feature layer. A crucial difference between these two operations is that the latter assumes a long range dependency, which reduces the gap between features corresponding to the points with the same label encoding more accurate global information. The more similar are the feature representations of the two points, the greater is the correlation between them.
5 Experiments and Discussion
We evaluate the performance of the proposed network on the ShapeNet ShapeNet_2 3D part segmentation dataset and the two largest point cloud segmentation benchmarks, ScanNet Scannet and Stanford Large-Scale 3D Indoor Spaces (S3DIS) S3DIS. While ShapeNet is synthetic data, ScanNet and S3DIS are real point clouds obtained with a scanner. We perform ablation studies of different design choices and network variations as well as compare the performance of our network with existing state of the art.
ScanNet Scannet contains 1513 scans annotated with semantic voxel labels from 21 categories (bed, refrigerator, floor, table etc. plus other furniture). ScanNet is divided into 1201 training and 312 test samples. Similar to Pointnet++
, we split the ScanNet training scenes into 2m by 2m by 3m blocks, with 0.5m padding in each direction (,,) and sample 8192 points randomly from each block on the fly. To predict semantic label of every point of the test scene, we similarly split it into similar cubes using a sliding window strategy along the
plane with different stride sizes. If the same point gets different predictions in the overlap regions, we choose the one with highest confidence.
Although ScanNet also contains RGB values for each point, we only use the coordinates as point features for a fair comparison with other methods. Hence, the input data size for the network is . As shown in Figure 3, we use downsampling and upsampling operations from PointNet++ Pointnet++ for both the encoder and decoder parts. The output point numbers and feature dimensions of different LAE-Conv layers are , , , , , and respectively. The fully connected layer with size converts the final features into class probabilities. We set for the neighborhood search. For the three point-wise attention block, the output point numbers and feature dimensions are , and
respectively. The initial learning rate is 0.001, batch size is 22 and the momentum is 0.9. We set the decay rate of 0.7 and stop training after 1000 epochs.
|Method||mean IoU||Overall Accuracy (OA)|
|Methods||Size (M)||Time (s)|
Model size and inference time comparison, where ”M” means million and ”s” denotes second. We use the model file (.cptk) size obtained by the training using tensorflow to represent the complexity of different methods. The entire scenes was tested 5 times and the average time was recorded.
Table 1 shows quantitative comparison of our proposed point attention network with PointNet++ Pointnet++, PointCNN PointCNN on the ScanNet dataset. This comparison is done using two metrics, namely the mean per-class IoU (mIoU, ) and per voxel overall accuracy (OA, ). For a fair comparison, Table 1 shows results of baseline methods reported in the original papers since the trained models are not available for testing. Compared to the baseline methods, our network achieves the highest accuracy on both metrics. Table 2 reports the model size and average inference time of a few representative methods Pointnet, Pointnet++, DynamicEdge, SpiderCNN, where the released source codes are easy to use. Experiments are conducted by a single NVIDIA GTX TitanX GPU with tensorflow and an Intel i7-9700K@3.6 GHZ 8 cores CPU. Compared with these methods, we can see that our proposed architecture improves segmentation results with only marginal extra computation cost.
Figure 4 qualitatively compares the semantic segmentation obtained by PointNet++, PointCNN and our method. We use boxes to highlight some examples where our method performed significantly better than the competitors. In the first scene, the window and the door are embedded in the wall whereas the picture is hung on the wall making the semantic segmentation a real challenge. Our method’s output is more regular than that of PointNet++ and PointCNN. The table in the lower left corner is incomplete with a mere skeleton. Hence, segmentation methods like PointNet++ and PointCNN get worse results compared to our method. In the second scene, all the methods get incorrect predictions on the chair that is close to the floor as well as the irregularly shaped desks. This is because the per class samples in ScanNet dataset are unbalanced Scannet making existing segmentation methods fail on the rare categories. In the third scene, our method performs better on bookshelves than others. In addition, the un-annotated object in the center of scene is misidentified as table by PointCNN and our method because its shape is more like a table than an ordinary chair.
To better understand the influence of various design choices made in our network, we analysis them on ScanNet.
5.1.1 Ablation study on parameters of LAE-Conv layer
As mentioned in Sec 3.1, there are three options (KNN, ball query and our multi-direction searching method) for searching the neighbors of the central point. We use ScanNet as a test benchmark to compare these options. We also set different point numbers at each cube for our proposed search method. In Table 3, we can see that our method is more efficient for selecting local point shapes. When and , the segmentation accuracy is greatly reduced. This is because the parameters of LAE-Conv layer will increase as the number of neighbors increase. Too many neighbors bring information redundancy, which reduces the efficiency and accuracy of the LAE-Conv layer.
|Neighborhood Search Method||Overall Accuracy (OA %)|
|Ball query (K=16)||85.3|
|Proposed Multi-direction (m=1,K=16)||86.7|
|Proposed Multi-direction (m=2,K=32)||85.9|
|Proposed Multi-direction (m=3,K=48)||84.4|
5.1.2 Ablation study on point-wise spatial attention block
To take full advantage of the point-wise spatial attention block, we show the segmentation results with more attention blocks in the network architecture. We add 7 attention blocks (after LAE-Conv layer ), 5 attention blocks (after LAE-Conv layer ) and 3 attention blocks (after LAE-Conv layer ). As shown in the first part of Table 4, more point-wise spatial attention blocks do not lead to an improvement in performance. One explanation is that more attention blocks massively increase the number of parameters and the network can not find a local optimal solution within the specified training steps on ScanNet. The second part of Table 4 compares same number of attention blocks added to different stages of network. The attention block is added to the right, after the LAE-Conv layer (2,4,6) and (1,4,7) respectively. We can see that the results deteriorate when the attention blocks are added to layers with lower feature dimensions. A possible explanation is that the point features do not contain enough representative semantic information when their dimensions are low, the features of the points with the same labels are significantly different, and the number of parameters of attention block will also increase from (2,4,6) to (1,4,7). Under this condition, the effectiveness of attention block is limited. Finally, we choose to add three attention blocks to the right after LAE-Conv layers (3,4,5) in Figure 3. We also tested adding three attention blocks to vanilla PointNet++ (without MSG and DP Pointnet++) at the corresponding stages as in our network. As shown in the third part of Table 4, the performance of baseline network (vanilla PointNet++ Pointnet++) is improved by . This shows that our proposed point attention block is generic and is able to improve the performance of any network architecture.
|Block Position||Overall Accuracy (OA %)|
|LAE-Conv layer (1-7)||84.9|
|LAE-Conv layer (2,3,4,5,6)||85.7|
|LAE-Conv layer (3,4,5)||86.7|
|LAE-Conv layer (2,4,6)||86.0|
|LAE-Conv layer (1,4,7)||85.5|
|PointNet++Pointnet++ (vanilla) baseline||83.3|
|PointNet++Pointnet++ (vanilla, 3-5)||84.7|
The Stanford Large-Scale 3D Indoor Spaces (S3DIS) dataset S3DIS contains 3D scans obtained with the Matterport scanners in 6 areas from three different buildings, divided into 271 individual rooms. Each point in the scene is annotated with one label from categories (ceiling, wall, beam, chair, column etc. and clutter), and is represented by its 3D coordinates, RGB features and normalized location. The S3DIS is a highly unbalanced dataset S3DIS, floor, wall, chair and other common furniture items being the dominant classes in the dataset while bookcase, window and beam etc. being the rare classes. To prepare the training data, rooms in S3DIS are split into blocks of , with padding on each direction (,). We randomly sample 4096 points from each block during training while all points are used at test time. Similar to PointNet Pointnet, we follow the same 6-fold cross validation strategy across 6 areas. To obtain the overall segmentation accuracy, we evaluate 6 models on their corresponding test areas and report the average results.
For comparison, we use coordinates and RGB information as the point features. Therefore, the input data size to the network is . As shown in Figure 3, we use downsampling and upsampling operations from PointNet++ Pointnet++ for both encoder and decoder parts. The output point numbers and feature dimensions of different LAE-Conv layers are , , , , , and respectively. The fully connected layer with size converts the final features into probability of each class. We set during the neighbors search process. For the three point-wise attention modules, the output point numbers and feature dimensions are , and respectively. We set the initial learning rate to 0.001, batch size to 32 and momentum to . We set the decay rate to 0.7 and stop the training process after 1000 epochs.
Table 5 summerizes the quantitative results where our proposed method outperforms the baseline methods PointNet Pointnet, SPGraph Large_superCVPR18, RSNet Recurrent_Slice, 3DRCNN 3DRCNNet and PointCNN PointCNN. It is worth noting that our method achieves higher accuracy for some rare class objects, such as beam, column, window, board and clutter because our method is able to capture more global information of points that are far apart.
In Figure 5, we compare our method with PointNet Pointnet and PointCNN PointCNN qualitatively. It is not surprising that chairs are correctly segmented more often by the three baseline methods because their shapes are more consistent, they are small and not easily confused with other objects. We can see that objects such as the whiteboard hung on the wall, column and window embedded in the wall, clutter next to the table and irregular bookcases are quite difficult to segment. We also use boxes to mark some examples where our method outperformed the baseline methods. In the first scene, our method obtains more regular segmentation of the painting on the wall than PointNet Pointnet and PointCNN PointCNN
. Our final segmentation result preserves the full shapes of column and bookcase next to wall while other methods mistake them for wall or clutter. We also obtain a smoother prediction for chair in the front row than the other methods. In the second scene, the board on the wall is more accurately estimated by our method compared to PointNet and PointCNN. Our method makes fewer mistakes in predicting the bookshelves, beam and clutter which are up and below the table compared to other approaches. Notably, our method can predict the clutter on the left wall, even though it is not marked by ground truth. In the third scene, our method also outputs fewer incorrect predictions for bookcase, table, chair and beam compared to other approaches.
We also extend our network architecture to perform part segmentation on the ShapeNet dataset ShapeNet, which consists of shape models from 16 object categories. Each object in ShapeNet is annotated with to parts. We follow the settings from ShapeNet_2 to divide the ShapeNet dataset for training, validation and testing. During training, we randomly sample 2048 points from each 3D shape while all points from each 3D shape are used during the test stage.
For a fair comparison, we only use the coordinates as the point features. The size of input data for the network is . The network architecture is illustrated in Figure 3, we adjust the network parameters to suit ShapeNet. The output point numbers and feature dimensions of different LAE-Conv layers are , , , , , and respectively. A fully connected layer with size is used at the end to convert the point features into part predictions. We set for the neighborhood search. For the three point-wise attention block, the output point numbers and feature dimensions are , and respectively. We set the initial learning rate to 0.003, batch size to 16, momentum to 0.9, decay rate to 0.7 and stop the training after 500 epochs.
We use the same evaluation metric (mean IoU) on points as PointNetPointnet to compare our method with others methods Pointnet; Pointnet++; DynamicEdge; Recurrent_Slice; SGPN_instance; Attentional_context; Extension_point; PointGrid; Splatnet; Mining_Kernel; SpiderCNN; SONet; Submanifold; PointCNN. We report the part-averaged IoU (pIoU), mean per-category pIoU (mpIoU) and per-category IoU () scores in Table 6. Our method achieves on par performance with most methods in the metrics pIoU and mpIoU. In individual categories, we rank the best in ear phone, lamp, motor and rocket. As we can see, our method performs better when there are fewer data points as in the case of ear phone, motor and rocket.
We proposed a point attention network for 3D point cloud semantic segmentation. Our network adaptively integrates local point features and long-range contextual information. We introduced a novel local attention-edge convolution (LAE-Conv) layer which exploits attention mechanism on a local graph constructed by the central point and its neighborhood to capture accurate and robust geometric details. To refine the output local features of LAE-Conv layer, we proposed a point-wise spatial attention module and showed that this module can generalize to other networks to improve their accuracy. Finally, we adapted the U-shaped network to combine the LAE-Conv layer and point-wise spatial attention modules. Experiments on challenging benchmark datasets show that our method quantitatively and qualitatively obtains on pair or better performance than existing state-of-the-art in 3D point cloud semantic segmentation.
This work was supported in part by National Natural Science Foundation of China under Grant 61573134, Grant 61973106 and in part by the Australian Research Council (ARC) grant DP190102443. Thank Yifeng Zhang and Tingting Yang from Hunan University for helping with baseline experiments setup.