DAPnet: A double self-attention convolutional network for segmentation of point clouds

04/18/2020 ∙ by Li Chen, et al. ∙ Central South University 8

LiDAR point cloud has a complex structure and the 3D semantic labeling of it is a challenging task. Existing methods adopt data transformations without fully exploring contextual features, which are less efficient and accurate problem. In this study, we propose a double self-attention convolutional network, called DAPnet, by combining geometric and contextual features to generate better segmentation results. The double self-attention module including point attention module and group attention module originates from the self-attention mechanism to extract contextual features of terrestrial objects with various shapes and scales. The contextual features extracted by these modules represent the long-range dependencies between the data and are beneficial to reducing the scale diversity of point cloud objects. The point attention module selectively enhances the features by modeling the interdependencies of neighboring points. Meanwhile, the group attention module is used to emphasizes interdependent groups of points. We evaluate our method based on the ISPRS 3D Semantic Labeling Contest dataset and find that our model outperforms the benchmark by 85.2 improvements over powerline and car are 7.5 comparison, we find that the point attention module is more effective for the overall improvement of the model than the group attention module, and the incorporation of the double self-attention module has an average of 7 improvement on the pre-class accuracy of the classes. Moreover, the adoption of the double self-attention module consumes a similar training time as the one without the attention module for model convergence. The experimental result shows the effectiveness and efficiency of the DAPnet for the segmentation of LiDAR point clouds. The source codes are available at https://github.com/RayleighChen/point-attention.



There are no comments yet.


page 2

page 3

page 8

page 9

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Airborne Laser Scanning (ALS) is one of the most important remote sensing technologies that experiences a fast development in recent years  [wallace2016assessment, yu2017single, barnes2017individual]. LiDAR Point cloud, which is a major format of ALS, is advantageous over optical data in terms of the influence of various lighting conditions and shadows. Thus, it has become the most important dataset to provide a full 3D profile of landscape at large spatial scales [vosselman2004recognising, brostow2008segmentation, douillard2011segmentation]. However, the LiDAR point cloud contains irregularly distributed points with a series of attributes and has a complex data structure, which makes the object segmentation and classification tasks challenging [weinmann2013feature, wahabzada2015automated, grilli2017review].

Figure 1: The location of the study area in Vaihingen, Baden-Wurttemberg, Germany. The left side is a map of the borders of the cities in Baden-Wurttemberg state. The right side is a partial orthophoto image of the Vaihingen. Our study data comes from the area covered by blue. Area 1 is the training set, and areas 2 and 3 are the test set.

Extensive research has been done for LiDAR object segmentation tasks at large spatial scales [nguyen20133d, luo2015patch]. Traditional segmentation algorithms heavily relies on hand-crafted features [vosselman2004recognising, hough1962method, fischler1981random]. They usually divide the large-scale LiDAR into smaller units for classification or segmentation based on point clusters, voxels or collection of images [rabbani2006segmentation, papon2013voxel, yang2015hierarchical]

. Then, unique features are extracted from the standardized data and feed to the classifier including maximum likelihood algorithms 


, support vector machines (SVMs) 


, random forest (RF) 

[guo2011relevance], object-oriented modeling [zhou2013object], etc [chen2017multispectral, huang2008knowledge]. Previous research can be summarized into three discrete processes: data transformation, feature extraction, and classification. These processes need to adopt separate algorithms, which make the optimization difficult. Moreover, data transformation would distort the relationship between point clouds and cause information loss based on different hand-crafted processes that provide poor generalizations of models [sun2018classification, luo2018semantic]. Therefore, an end-to-end learning mechanism is necessary to overcome these limitations [caltagirone2017fast, caltagirone2019lidar, asvadi2018multimodal].

In recent years, deep learning 


, especially Convolutional Neural Networks (CNNs), has been proved to be effective in automatic feature extraction and computer vision tasks in an end-to-end fashion 

[litjens2017survey, deng2014tutorial, zhang2018survey]. During the training process, CNNs learn both local and global features at different layers [wang2019deep]. CNNs have presented their unprecedented successes in many classification, detection and segmentation tasks [simonyan2014very, szegedy2017inception, ren2015faster]. These novel CNNs also inspire researchers to tackle challenging 3D classification tasks [xu20183d]. However, traditional CNNs normally consist of 2D layers, which cannot directly adapt to the structure of 3D point clouds. Hence, 3D CNNs are applied on 2D contexts transformed from LiDAR  [li2016vehicle, maturana20153d], such as VoxNet [maturana2015voxnet] and ShapeNets [wu20153d]. Other research utilizes multi-view CNNs to extract geometric features from multiple 2D rendering views of point clouds [su2015multi, boulch2017unstructured.caltagirone2017fast]. These volumetric CNNs and multi-view CNNs can only be applied to the transformed data from 3D point clouds, which would cause much information loss. To address this issue, the approach ideally would be to build a 3D model that can be directly applied to the unique structure of point clouds. Qi [qi2017pointnet]

proposed a unique deep learning structure called PointNet, which is the first architecture that utilizes a set of functions to obtain global features from point clouds. To capture local features, and improve the generalizability of the model for better pattern recognition tasks, Qi 

[qi2017pointnet++] further proposed a novel model called PointNet++ by hierarchically concatenating the grouping process at different scales. These frameworks integrate the preprocessing, processing, and classification of LiDAR, which have been proved to be successful in various applications.

Compared to the objects from indoor scenes, objects from terrestrial landscapes exhibit great variation even in objects from the same class [hsiao2004change]. For example, the geometries of powerlines have a variety of sizes and scales. Therefore, previous methods adopt the extracted geometric features to classify powerlines that are prone to misclassifications caused by the large variations of object shapes and scales [yousefhussien2018multi, blomley20163d]. Objects in this type of LiDAR data contain complex spatial features and interdependencies. By extracting the contextual features among points, the model would be able to classify terrestrial LiDAR data at a higher accuracy [im2008object, loog2006segmentation]. For example, by quantifying the scale dependency of the soil properties and terrain attributes in LiDAR [maynard2014scale], the optimal spatial scale can be identified for deriving terrain attributes. Ebadat [parmehr2014automatic] uses statistical dependence to develop a novel method of mutual information, which can improve the automated registration of multi-sensor images. However, these methods are difficult to formulate as end-to-end models because these processes have different objective functions. The interdependency of terrestrial LiDAR presents an urgent need for explicit modeling. Meanwhile, point clouds that constitute the object are also assigned different weights based on their variations among objects [qi2017pointnet, qi2017pointnet] for generating segmentation results. Explicitly, the model should pay more attention to the feature extraction process of those critical points [nam2017dual]. Modeling of the long-range dependencies of point clouds among objects and leveraging important features to enhance the segmentation results are still in need.

Figure 2: 3D point cloud acquired via airborne laser scanning.
Class power low_veg imp_surf car fence_hedge roof fac shrub tree Total
Training Set 546 180,850 193,723 4,614 12,070 152,045 27,250 47,605 135,173 753,876
Test Set 600 98,690 101,986 3,708 7,422 109,048 11,224 24,818 54,226 411,722
Table 1: Number of 3D points per class.

In this research, we propose a novel double self-attention convolutional network called DAPnet to address the challenging issues. The novel model is inspired by the self-attention mechanism which can enhance the quality of spatial encodings and combine the geometric and semantic features of LiDAR. The weights of important features found by the self-attention module would increase. These weight-optimized features combined with interdependencies would improve the segmentation performance of point clouds at various scales. To be specific, we created a double self-attention, which refers to the point and group attention modules as the top geometric feature extraction layer of the model. The point attention module obtains the contextual features of the input data through geometric features and feature attention matrices. The outputs of the point attention module are computed by adding a weighted summation of the features of points. The weighted summation would strengthen the long-range dependencies among points regardless of their locations. Similarly, the group attention module introduces a self-attention mechanism into the model by group sampling, which reweighed the contributions of different groups abstracted from neighboring clustering of points to improve the quality of representations. These different groups represent different types of local features. The group attention module is able to adjust the weights on different local features based on their different importance to the model. The two attention modules process the extracted features in parallel. Then the outputs from the modules are fused back to the model for propagating the segmentation process. The DAPnet can be directly applied to raw point clouds and also strengthen the long-range and multi-level feature dependencies among individual points and groups of points respectively. The experimental results show that our method can effectively and efficiently segment the point clouds by generating a state-of-the-art overall accuracy of 90.7%, which is 5.5% above the benchmark. Compared to benchmark methods, the DAPnet also significantly improved the average F1 score to 82.3%, up12.6%. The ablation experiment shows that the double self-attention module would enhance the average per-class accuracy by 7% at the same rate of model convergence.

The major contribution of this research can be summarized as follows:

  • We propose a novel deep learning architecture named DAPnet to handle long-range dependencies of LiDAR point clouds by combining geometric and contextual features for terrestrial LiDAR.

  • We incorporate the point and group attention modules to enhance feature learning based on spatial interdependencies. These two modules are flexible and can be easily applied to other architectures.

  • The proposed method obtains the highest overall accuracy (90.7%) on the ISPRS 3D Semantic Labeling Contest dataset compared to the benchmarks. The largest improvements are and, powerline (+7.5%) and car(+13.0%).

The remainder of this paper is organized into four additional sections. Section 2 describes the LiDAR data. We present our method in Section 3. The experimental results are presented in Section 4. We draw conclusions and detailed discussion for further work in Section 5.

2 Study area and data

2.1 Study area

The study area is located in Vaihingen, Baden-Wurttemberg, Germany (Figure 1). It is 25 km northwest of Stuttgart and situated on the river Enz. The total area is about 73.42 km with a population of 30,000. It has a temperate continental climate and is often dry for a long time. Winters are long, mild and cold, and summers are hot. Sometimes the maximum temperature can exceed 30 C for many days due to hot wind. The vegetation in this area is closely related to the urban environment and overlaps with each other. Multiple types of terrestrial objects, such as fence, tree, shrub, and facade, have complex and irregular shapes, and the buildings are characterized as dense and complex.

2.2 Data

In this study, the data were obtained by the Leica ALS50 system in August 2008. Its average flying height is 500 m above and its field of view is 45 . The average overlap of the strips is 30%, and the point density in the test area is about 8 points/m. Multiple intensities and echoes were recorded. In the acquired data, most are small buildings with multi-layered structures and many areas of detached buildings surrounded by trees. Among them, because the acquisition was made in the summer, only a few points (2.3%) received multiple returns. Therefore, the distribution of vertical points in most trees only describes the canopy. The Vaihingen dataset has been proposed within the scope of the ISPRS test project on urban classification  [cramer2010dgpf, niemeyer2014contextual], and it is the benchmark dataset for ISPRS 3D semantic labeling benchmarks. More details about this dataset are described on ISPRS website111https://bit.ly/2wCROQ6.

Figure 3: Demonstration of data processing. From left to right is the process of 3D LiDAR data in the area to a regular input data.

Specifically, the LiDAR point cloud consists of 1,165,598 points, which are divided into two areas for training and testing. There are a total of 753,876 training points and 411,722 test points. The training area is mainly residential, with detached houses and high-rise buildings. It has an area of 399m421m. The test area is located in the center of Vaihingen and has dense and complex buildings. It covers an area of 389m419m. We discern the following 9 object classes, including Powerline (power), Low vegetation (log_veg), Impervious surfaces (im_surf), Car, Fence/Hedge (fence_hedge), Roof, Facade (fac), Shrub and Tree. Each point cloud contains LiDAR-derived (x, y, z) coordinates, backscattered intensity, return number, number of returns, and reference label. The training and test areas are demonstrated in Figure 2, and the class distribution of the training and test sets are shown in Table 1.

3 Methods

3.1 Data preprocessing

Figure 4: An overview of the double self-attention convolutional network, DAPnet

Figure 3 shows data preprocessing and normalization which can make point clouds into a regular batch input. Given an unordered 3D LiDAR data with . Each point is a vector of its LiDAR-derived coordinates plus extra channels such as intensity, return number and number of returns, and corresponds to a reference label

. For the training set, we obtain the length and width of the area based on the maximum and minimum coordinates. Then, a fixed-size block slides the entire area in strides. Each block can be overlapped with the previous one. The entire area of the dataset can be divided into multiple small blocks of the same size with a different number of points. The processed blocks reconstruct a new dataset. The new dataset is

, where represents the block and corresponding the number of points . When the number of points is below the threshold, we remove the block. The test set uses the same processing method without overlap.

During the model training process, we randomly select blocks and use the min-max normalization method [jain2011min] to process the coordinates and intensity based on the point clouds in the block. A fixed number of point clouds are then randomly sampled. Therefore, for each block, we can obtain , where D represents the sampled points, and is the number of samples at a time. When the number of point clouds of the block is less than the fixed number, we adopt the Bootstrap sample method [shao1994bootstrap] to sample enough points. As a result, the final dataset is .

3.2 DAPnet

3.2.1 Method overview

The proposed DAPnet, a 3D point cloud semantic segmentation model, consists of feature abstraction layers, point attention module, group attention module, and feature propagation layers as shown in Figure 4.

The feature abstraction layer is a feature extractor adapted to raw point clouds. It contains group sampling and convolutional layers, which can effectively extract hierarchical features. Group sampling divides all data into different groups in order to learn multiple local features, and convolutional layers extract features from the data via multi-size kernels. After the input data is processed from multiple feature abstraction layers, we can obtain output results with multiple features of points and groups.

The designed point and group attention module are two types of self-attention modules, which are key parts to the DAPnet. They can enhance the interdependency between features of points and groups, and improve the performance of the segmentation result. We put the processed data by the feature abstraction layer into two modules in parallel. In the point attention module, it can improve the valuable features of each point while reducing the meaningless features. Similarly, in the group attention module, each group is regarded as a unit to capture the correlation between the groups. Finally, we sum the results of the two modules to get the final feature-enhanced outputs.

Feature propagation is an upsampling operation that can use the learned features to retrieve the features of all input points. Because the final feature-enhanced outputs cannot be directly used to classify each point, we need to get the features of each point. During the upsampling process, it also concatenates the features on the corresponding feature abstraction layer through the skip-connection method. Then, the neighbor points would have similar features. The final results of feature propagation are used to the classifier.

The final classifier implements the classification of each point class, thereby achieving the semantic segmentation of the LiDAR point clouds.

3.2.2 Feature abstraction layer

Feature abstraction layer can construct points into a hierarchical group, and progressively extract features at different hierarchies, which is the first part of the DAPnet.

For a given set of points , where is a matrix. is the number of points, and is the dimension of a point feature, such as coordinate, RGB, intensity, etc. For group sampling, in order to be able to continuously measure spatial distance, we need the spatial coordinates of the LiDAR data. Therefore, we transform the LiDAR data to a matrix, where represents the -dim coordinate. The additional coordinate dimensions are not involved in the process of feature extraction. In our study data, we can achieve the and . We can get the input points . Then, in order that the sampled group can cover the entire object, we use iterative farthest point sampling (FPS) [eldar1997farthest] to find several major centroids, which means that during the iteration, each selected point is the farthest one from the rest. In this study, we used the distance to measure the spatial distance between two points. The sampled centroids is a subset of , which are . represents the index in , and represents the number of centroids. Based on these centroids, we construct groups based on the ball query method [qi2017pointnet++] which can find all points within a fixed radius. Moreover, we can also use the multiple-scale radius for grouping, taking into account the problem of sparse and dense point sampling. After group sampling, the input data become , where represents the number of groups.

For the feature extraction, we apply a 1-D convolutional operation on each group which is treated as a new object. The results after convolution are normalized with Batch Normalization (BN) 


and put into the activation function ReLU 

[krizhevsky2012imagenet], a non-linear operation. This process can be written mathematically as follows,


where is a group of points, and

represents 1-D convolutional operation. For the outputs of the feature extraction, we use max pooling 

[murray2014generalized] to extract global features. The global features of groups are the different local features of the object. After processing the whole feature abstraction layer, the output data are , where represents the dimension of the extracted feature.

(a) Point Attention Module
(b) Group Attention Module
Figure 5: The details of the double self-attention module.

3.2.3 Point attention module

The dependency relationship between points can be used to improve the segmentation results. We designed a point attention module, which can extract the contextual features between points to explicitly model this relationship. It is a new type of self-attention module that can be briefly described as mapping of queries to key-value pairs and usually contains queries, keys, values, and output [vaswani2017attention]. Through the query in the key-value pair, the weight of the query under the corresponding key is obtained. Then we add the weights to the corresponding query to get the output. The same query has different outputs under different keys, that is, different attentions. For the input LiDAR data, due to the different scales and shapes of the terrestrial objects, even the same class would have different outputs under the attention of different contextual features. This feature-enhanced output is beneficial to model classification. Therefore, we need to construct queries, key-value pairs, and outputs for point features.

For the output of the feature abstraction, the process of the point attention module is illustrated in Figure 5(a). We feed the output data into two convolutional layers respectively and achieve two outputs and . Then the point features of each group are expanded into vectors. We reshape them to two matrices, and , where is , and further transpose the matrix . After the matrix multiplication , we apply a softmax function [bouchard2011clustering] on the results to obtain the point attention matrix which is the key-value pairs as follows,


where . The indicates that the feature impacts on feature. The higher of two feature dependencies has the higher the value. It also represents the long-range dependency relationship between points.

On the other hand, we also feed the output into a convolutional layer and achieve the output data to query. We convert it to the matrix and multiply by the transpose of the point-attention matrix. The result is . Then we reshape the result to and multiply a scale to do an element-wise sum with the data . The final result is the out of attention mechanism, which can be written mathematically as follows,


where is that learnable scale parameter and initialized as 0. it can gradually assign more weight to the non-local features beyond the local neighborhood. The final result is a feature-enhanced output with the point attention matrix, which can combine the geometrical and contextual features.

3.2.4 Group attention module

Each group, or a local feature, also contributes differently to the final segmentation result. In order to strengthen the contribution of different groups, we also designed a group attention module to model the interdependencies between groups. Similarly, it also needs to construct queries, key-value pairs, and outputs. Different from the point attention module, it does not require a convolution operation at the beginning, since it can destroy the relationship between groups.

For the output of the feature abstraction, the process of the group attention module is illustrated in Figure 5(b). We directly reshape it to a matrix. Then we perform a matrix multiplication between the matrix and the transpose matrix and a softmax function on the result to obtain the group attention matrix as follows,


is the key-value pair for the group attention module. It represents the interdependency between groups. Then the reshaped output multiple V with a scale . Finally, we do an element-wise sum with the data . The process can be written as follows.


where is also initialized as 0. When each group adds the corresponding weights, local features can be boosted based on the long-range dependencies between groups.

Layer Name Input Output Operations # Kernels Note
Feature Abstraction Conv,Conv,Conv,Max (32,32,64)
Conv,Conv,Conv,Max (64,64,128)
Conv,Conv,Conv,Max (128,128,256)
Conv,Conv,Conv,Max (256,256,512)
Point Attention Module Conv 64
Conv 64
, Multiple,Softmax - key-value pairs
Conv - query
,, Multiple,Add - ,output
Group Attention Module Multiple,Softmax key-value pairs
, Multiple,Sum ,query,output
Feature Propagation


Conv,Conv,Conv (128,128,128)
Classifier Conv (128)
, Softmax,Cross entropy
Table 2: The detailed operations of the DAPnet

3.2.5 Feature propagation layer

In order to obtain the features of all points, the feature propagation layer generates features from the fused features which is the element-wise summation of the double self-attention module to each point.

Feature propagation is also a hierarchical process, which progressively generates features corresponding to all points from fused features. Meanwhile, we concatenate the summed outputs and the features of the previous feature abstraction layer through the skip-connection method to the input of the first feature propagation layer. Then we use inverse distance weighted method [setianto2013comparison]

based on k nearest neighbors (KNN

[fukunaga1975branch] to interpolate features value of the points in each layer, as follows,


From Eq. 6, we can find that points farther from the point has less weight. After the weight of each point is assigned, we perform a global normalization for all weights. Through the processing of all feature propagation layers, we can get the each point score , and compare it with the corresponding label

under the cross-entropy loss function 

[bosman2000negative]. Finally, we use the gradient descent algorithm [kingma2014adam] to update all model weights.

3.3 Training parameters

In this section, we introduce the architecture of the DAPnet. The details of the architecture are provided in Table 2, including data flow, the main process of operation and the number of convolutional kernels.

First, the DAPnet has 4 layers of feature abstraction. In each layer, the point clouds are divided by group sampling. For these groups, we use 3 convolutional layers for feature extraction, and the number of convolution kernels is gradually increased. For deeper feature abstraction layers, the number of convolution kernels is also larger than that of the previous layer. After 3 convolutional layers, we adopt max-pooling to obtain the global features of each group . Then, for the final output of feature abstraction, we can get the enhanced-feature results of and by the point and group attention modules. The next 4 layers are feature propagation. Except for the last layer, the input is concatenated to the output of the corresponding feature abstraction. Finally, by comparing the obtained score of all original points with the corresponding label , the classifier performs classification.

Method Transformation Features Classifier
IIS_7 Yes Geometrical
UM Yes Geometrical OvO classifier
HM_1 Yes Geometrical/Contextual RF
WhuY3 Yes Geometrical Softmax
LUH Yes Geometrical/Contextual RF
RIT_1 No Geometrical Softmax
NANJ2 Yes Geometrical Softmax
PointNet No Geometrical Softmax
PointNet++ No Geometrical Softmax
DAPnet No Geometrical/Contextual Softmax
Table 3: The detail of the benchmark methods

3.4 Benchmark methods

To compare benchmark methods on ISPRS 3D Semantic Labeling Contest, we selected the following algorithms for a brief review based on performance, feature extraction methods, and other factors, and their abbreviations as method names.

Classes power low_veg imp_surf car fence_hedge roof fac shrub tree
power 90.2 0.2 0.0 0.0 0.0 4.7 0.3 0.3 4.3
low_veg 0.0 89.4 6.6 0.1 0.1 0.4 0.1 2.9 0.4
imp_surf 0.0 3.0 96.6 0.1 0.0 0.1 0.0 0.1 0.0
car 0.0 2.0 1.3 88.8 1.4 1.5 0.5 4.2 0.2
fence_hedge 0.0 6.5 1.8 1.7 40.1 2.0 2.0 31.6 14.4
roof 0.1 0.4 0.1 0.0 0.1 96.4 0.6 1.0 1.4
fac 0.1 8.2 1.0 1.1 0.3 13.9 60.9 7.3 7.2
shrub 0.0 11.5 0.6 0.6 1.4 2.7 1.4 73.9 8.0
tree 0.0 2.3 0.0 0.1 0.4 1.7 0.7 7.4 87.5
Precision 84.4 90.8 93.3 83.4 77.9 96.4 80.2 61.7 89.1
Recall 90.2 89.4 96.6 88.8 40.1 96.4 60.9 73.9 87.5
F1 Score 87.2 90.1 94.9 86.0 53.0 96.4 69.2 67.3 88.3
Table 4: The detail of the per-class accuracy of DAPnet (%), and the overall accuracy is 90.2%
Classes power low_veg imp_surf car fence_hedge roof fac shrub tree
power 90.3 0.2 0.0 0.0 0.0 4.7 0.3 0.3 4.2
low_veg 0.0 89.5 6.6 0.1 0.1 0.5 0.1 2.8 0.4
imp_surf 0.0 2.2 97.3 0.1 0.0 0.1 0.0 0.1 0.0
car 0.0 1.9 1.1 90.2 0.8 1.6 0.5 3.7 0.3
fence_hedge 0.0 9.2 1.8 1.6 41.1 1.8 1.2 28.4 14.8
roof 0.1 0.3 0.1 0.0 0.0 97.0 0.4 1.0 1.1
fac 0.1 8.1 0.9 1.1 0.3 13.8 62.0 7.0 6.7
shrub 0.0 11.6 0.6 0.5 1.2 2.6 1.1 74.0 8.3
tree 0.0 2.2 0.0 0.1 0.3 1.6 0.7 6.9 88.2
Precision 85.0 91.4 93.4 84.7 80.9 96.5 84.4 63.1 89.5
Recall 90.3 89.5 97.3 90.2 41.1 97.0 62.0 74.0 88.2
F1 Score 87.6 90.4 95.3 87.4 54.5 96.7 71.5 68.1 88.9
Table 5: The detail of the per-class accuracy of DAPnet_MSG (%), and the overall accuracy is 90.7%

The differences between these benchmark methods are mainly in three aspects, data transformation, features, and type of classifier. Table 3 show the characteristics of the mentioned methods and our proposed method. The IIS_7222https://bit.ly/3cHD6I6 method over-segments the LiDAR data into supervoxels in terms of various attributes (i.e., shape, colors, intensity, etc.), and applied the spectral and geometrical feature extraction. The UM333https://bit.ly/32Z4FIA method combines various features including LiDAR point-attributes, textural analysis, and geometric attributes to a one-vs-one classifier. The HM_1444https://bit.ly/330b7PN method depends on the geometric features on a selection of neighborhoods. For conducting the contextual classification, it utilized a Conditional Random Field (CRF) [finkel2008efficient] with RF classifier. The WhuY3 [yang2017convolutional] method transforms the 3D neighborhood features of point clouds to a 2D image and applies a CNN to extract the high-level representation of features. The LUH555https://bit.ly/38yMg6u method extends the Voxel Cloud Connectivity Segmentation [papon2013voxel]. It designs a two-layer hierarchical CRF framework to connect contextual relationships, along with the Fast Point Feature Histograms (FPFH) features [rusu2009fast]. The RIT_1 [yousefhussien2018multi] method proposes a 1D-fully convolutional network extended by PointNet [qi2017pointnet] in an end-to-end fashion. The NANJ2666https://bit.ly/2PWLHgy method applies a multi-scale convolutional neural network to learn the geometric features based on a set of multi-scale contextual images. Furthermore, we also use PointNet and PointNet++ [qi2017pointnet++] as comparison methods. During the training process, we used the same data processing methods as this study. Both PointNet and PointNet++ can be directly applied to the original point cloud data. They use the extracted geometric features to achieve the classification of points through the softmax classifier.

(a) Ground Truth
(b) DAPnet
(c) DAPnet_MSG
Figure 6: The ground truth of the test set, the classification and error maps. (a) the ground truth data. In (b) and (c), the left image is the classification map and the right side is the corresponding error map.
Figure 7: The comparison of error maps between the new model and benchmark methods. Circles indicate the mixed region of powerline, roof, and tree classes.

4 Experiment

4.1 Implementation details

The method is implemented using PyTorch. We choose 30

30m as the block size and 10m as the stride. Each block is evenly sampled to 1024 points. The number of groups for the group sampling of the feature abstraction layer is 256, 128, 64 and 32, which correspond to the sampling radius of 0.1, 0.2, 0.4 and 0.8. The multi-scale sampling radiuses of the MSG operation are [0.05, 0.1], [0.1, 0.2], [0.2, 0.4], and [0.4, 0.8] respectively. A batch size of 16 and the maximum epoch of 200 are chosen for model training. The Adam optimizer 

[kingma2014adam] with a momentum of 0.9 and a decaying rate of 0.0001 is used. The poly learning rate policy is adopted to multiply iteratively to the learning rate by and the initial learning is set as 0.001. The lower bound of the learning rate is set as

for model training. The evaluation metrics of per-class accuracy, precision, recall, and F1-score are calculated using the ISPRS 3D Semantic Labeling Contest dataset.

Method power low_veg imp_surf car fence_hedge roof fac shrub tree OA
IIS_7 40.8 49.9 96.5 46.7 39.5 96.2 52 68.8 76.2
UM 33.3 79.5 90.3 32.5 2.9 90.5 43.7 43.3 85.2 80.8
HM_1 82.8 65.9 94.2 67.1 25.2 91.5 49.0 62.7 82.6 80.5
WhuY3 24.7 81.8 91.9 69.3 14.7 95.4 40.9 38.2 78.5 82.3
LUH 53.2 72.7 90.4 63.3 25.9 91.3 60.9 73.4 79.1 81.6
RIT_1 29.8 69.8 93.6 77.0 10.4 92.9 47.4 73.4 79.3 81.6
NANJ2 61.2 87.7 93.3 55.6 34.0 91.6 38.6 72.7 77.5 85.2
PointNet 63.5 81.3 88.3 50.5 17.2 78.1 26.5 42.2 67.9 75.1
PointNet++ 80.7 86.2 93.6 74.5 33.0 88.4 51.5 62.4 79.2 84.2
DAPnet 90.2 89.4 96.6 88.8 40.1 96.4 60.9 73.9 87.5 90.2
DAPnet_MSG 90.3 89.5 97.3 90.2 41.1 97.0 62.0 74.0 88.2 90.7
Table 6: The overall accuracy (OA) and corresponding per-class accuracy. (%)
Method power low_veg imp_surf car fence_hedge roof fac shrub tree Avg. F1
IIS_7 54.4 65.2 85.0 57.9 28.9 90.9 39.5 75.6 55.3
UM 46.1 79.0 89.1 47.7 5.2 92.0 52.7 40.9 77.9 59.0
HM_1 69.8 73.8 91.5 58.2 29.9 91.6 54.7 47.8 80.2 66.4
WhuY3 37.1 81.4 90.1 63.4 23.9 93.4 47.5 39.9 78.0 61.6
LUH 59.6 77.5 91.1 73.1 34.0 94.2 56.3 46.6 83.1 68.4
RIT_1 37.5 77.9 91.5 73.4 18.0 94.0 49.3 45.9 82.5 63.3
NANJ2 62.0 88.8 91.2 66.7 40.7 93.6 42.6 55.9 82.6 69.3
PointNet 43.6 80.4 87.8 47.3 21.6 81.9 28.7 38.5 64.5 54.9
PointNet++ 71.4 84.9 91.6 73.9 36.2 90.4 51.9 50.5 75.9 69.6
DAPnet 87.2 90.1 94.9 86.0 53.0 96.4 69.2 67.3 88.3 81.4
DAPnet_MSG 87.6 90.4 95.3 87.4 54.5 96.7 71.5 68.1 88.9 82.3
Table 7: The per-class F1 score and the average F1 scores (Avg. F1). (%)
Method power low_veg imp_surf car fence_hedge roof fac shrub tree OA
DAPnet-wo-PAM&GAM 80.3 86.1 93.4 75.5 33.2 88.5 51.1 62.2 80.1 84.5
DAPnet-w-GAM 79.5 86.5 94.1 78.6 38.2 91.0 55.0 66.0 83.0 86.1
DAPnet-w-PAM 88.3 86.0 95.1 79.6 38.1 93.9 59.8 63.5 85.4 87.3
DAPnet-w-PAM&GAM 90.2 89.4 96.6 88.8 40.1 96.4 60.9 73.9 87.5 90.2
DAPnet_MSG-wo-PAM&GAM 84.3 86.3 95.0 78.1 37.6 91.1 54.4 64.0 81.6 86.0
DAPnet_MSG-w-GAM 88.5 86.8 94.6 82.1 40.5 93.9 58.8 69.8 83.9 87.6
DAPnet_MSG-w-PAM 90.2 87.9 94.7 84.6 41.0 95.1 61.9 64.4 86.5 88.3
DAPnet_MSG-w-PAM&GAM 90.3 89.5 97.3 90.2 41.1 97.0 62.0 74.0 88.2 90.7
Table 8: The comparison of performances using different attention strategies. (%)

4.2 Classification results

Table 4

shows the confusion matrix of the DAPnet model. The overall accuracy is 90.2% and the model provides accuracies of 90% or more on powerline, Impervious surfaces, and roof classes. The model performs worse on fence/hedge (40.1%), facade (60.9%) and shrub (73.9%) classes. The most of error cases in fence/hedge are from the shrub (31.6%) and tree (14.4%). The two classes have a high similarity in terms of height, topological and spectral reflectance. the precision of the fence/hedge is 77.9%, which indicates that the model is less likely to misclassify shrub and tree into the fence/hedge. Many facade points are misclassified as roof class because of the similarity between the two classes, while roof points are rarely misclassified as the facade class. The result reveals that some classes have dominant features in the DAPnet classification, which provides a lower recall and a higher precision on the fence/hedge class.

The confusion matrix of the model using multi-scale group sampling (DAPnet_MSG) is shown in Table 4. Compared to the DAPnet, the DAPnet_MSG provides an accuracy increase of 1.4%, 1%, and 1.1% in car, fence/hedge, and facade, respectively. It also reduces the misclassification rate of fence/hedge data to shrub by 2.8%. Meanwhile, the DAPnet_MSG obtains an average accuracy improvement of 0.64% among the classes and an increase of the overall accuracy by 0.5% compared to the DAPnet. These increases demonstrate the effectiveness of incorporating the multi-scale group sampling strategy to the model.

The classification results and error maps of the DAPnet and the DAPnet_MSG are shown in Figure 6. The distribution of misclassifications are consistent: accuracies are high in regions where Impervious surfaces and roofs are mixed. In contrast, the mixed regions of shrub and tree have a larger classification error than the non-mixed regions according to the error map. This indicates that the mixture of classes compromises s the classification result from the DAPnet.

4.3 Performance evaluation

Table 6 shows the comparison of our model to the benchmark methods. The DAPnet generates a 5.0% higher accuracy than the best benchmark. And the DAPnet_MSG obtains the best overall accuracy (90.7%), which is 5.5% higher than the best benchmark. It also outperforms the best benchmark by an average of 3.5% in per-class accuracies. Especially, the accuracies of powerline and car are significantly improved by 7.5% and 13.2%. impervious surfaces and roofs have the least improvements (0.8%).

Table 7 shows the F1 score of our method and the benchmark methods. The DAPnet_MSG produces 12.7% higher F1 scores than the best benchmark method. In terms of per class F1 score, our model has an average improvement of 9.4%. For the classes of powerline, car, fence/hence, facade, and shrub, the improvement is even more significant by an average percentage of 14.17%, where the powerline is improved the most (16.2%). According to Table 1, the result shows that our method performs much better in the classes (powerline, car, and fence/hedge) with the limited number of training samples. A high F1 score indicates the effectiveness of the model.

Figure 7 shows the error map of the best 5 benchmark methods compared to our method. The DAPnet and DAPnet_MSG generates fewer errors and achieve better results in the mixed regions of powerline, roof, and tree, while the benchmark methods generate many more errors in the regions where the DAPnet makes mistakes. The DAPnet performs better in classifying classes in mixed regions.

4.4 Ablation study

In order to verify the effectiveness of the attention module, we test different attention strategies, including the vanilla or raw model (-wo-PAM&GAM), the model with group attention module (-w-GAM), the model with point attention (-w-PAM), and the model with both types of the attentions (-wo-PAM&GAM). The results are shown in Table 8.

Table 8 shows that both models have higher overall accuracy than the ”-wo-PAM&GAM” strategy under the ”-w-GAM” strategy by solely incorporating the group attention module. However, the accuracies of powerline and impervious surface classes are about 0.5% lower than the ”-wo-PAM&GAM” strategy. Compared to the DAPnet-wo-PAM&GAM and the DAPnet-w-GAM, the group attention module would improve the accuracy of car, facade, and shrub classes by 3.1%, 3.9%, and 3.8% respectively. Similarly, the DAPnet_MSG-w-GAM improves the accuracies of the car, facade, and shrub classes by an average of 4.7%. This indicates that the group attention module can effectively capture the long-range dependency among the car, facade and shrub classes without being affected by their intra-class differences. The incorporation of the group attention module also enhances the per-class accuracy by an average percentage of 2.6% for all.

Figure 8: The segmentation results from different models with various block sizes.

From the perspective of the point attention module, the overall accuracy of the model is higher than that of the ”-wo-GAM&PAM” and the ”-w-GAM” strategies. Compared to the ”-wo-GAM&PAM” strategy, the incorporation of point attention module improves the accuracies of powerline (8.0%), car (4.1%), roof (5.4%), and facade (8.7%) classes significantly. For these classes, the DAPnet_MSG-w-PAM also achieves the improvements of 5.9%, 6.5%, 3.0%, and 7.5%. In general, the point attention module provides an average increase of 4.4% on the per-class accuracy compared to the vanilla DAPnet, and an average increase of 3.8% compared to the vanilla DAPnet_MSG. Meanwhile, compared to the ”-w-GAM” strategy, the benefits of all per-class accuracy enhancement from the point attention module is higher than that of the group attention module, except the shrub class. This shows that the point attention module can effectively capture the interdependence between points and improve the classification accuracies. On the other hand, the improvements from the point attention module is not significant on low vegetation and impervious surface classes, which is consistent with the result from the group attention module.

The incorporation of both point and group attention modules generates the best accuracy. When the ”-w-GAM&PAM” strategy is adopted, the overall accuracy is improved by 4.7% and 5.7% compared to adopting the ”-wo-GAM&PAM” strategy. Compared to the vanilla DAPnet, the ”-w-GAM&PAM” strategy generates an average per-class improvement of 8.2%, where the car and shrub have the largest improvement of 13.3% and 11.7% respectively. The accuracy of adopting the double self-attention module is 4.1% higher than using solely group attention and 2.9% higher than using solely the point attention. Similarly, the DAPnet_MSG-w-GAM&PAM generates a significant accuracy increase on the car (12.1%) and shrub (10.0%), which is also higher than the adoption of either a single attention module. The accuracy generated from the double self-attention module is 3.1% and 2.4% higher than the accuracies of the ”-w-GAM” or ”-w-PAM” strategies. These results show that the adoption of double self-attention module would improve the accuracy of the most and the utilization of MSG operation would also be helpful in enhancing the accuracy.

4.5 Visualization of results

Figure 9: The training phases of loss and overall accuracy from different models.

Figure 8 shows the classification results of 3 random blocks in the test set under different strategies. Block 1 is a mixed region of mainly the shrub and tree. The overall accuracy of ”-w-GAM&PAM” strategy is 12% higher than the ”-wo-GAM&PAM” strategy. According to the error map, the performance of the two models are similar. For example, the models perform well in the roof, but worse in the regions where the roof and impervious surface are mixed. However, the ”-w-GAM&PAM” strategy shows a good performance in the mixed region of low vegetation, shrub, and tree. In the mixed region of the roof, low vegetation and tree (block 2), the ”-w-GAM&PAM” strategy also shows a significantly better result over the ”-wo-GAM&PAM” strategy. The error map also indicates that the ”-w-GAM&PAM” strategy can classify the mixed area of the roof and low vegetation better. In block 3, the overall accuracy of the ”-w-GAM&PAM” strategy also is the best (84.9%). This block is a mixed region of roof and impervious surface. From the error map, the models with the double self-attention module classify the region of tree and impervious surfaces well. Similar to block 1, the models perform poorer in the regions with a mix of roof and impervious surface. Although these blocks with the mix of the roof and impervious surface are poor harder, the double self-attention module can still achieve better classification results, which demonstrates the effectiveness of the double self-attention module for modeling interdependence between points.

The double self-attention module can improve the classification accuracy without affecting the convergence speed of the model. The training phrases of the models are shown in Figure 9. All of the models converge at around 100 epochs. The ”-w-GAM&PAM” and ”-wo-GAM&PAM” models converge at approximately the same epoch. The incorporation of the double self-attention module speeds up the convergence of the model by reducing loss values quickly, which demonstrates the effectiveness of the double self-attention module for enhancing the speed of model convergence.

5 Conclusion and Discussion

In this study, we propose a novel model called DAPnet by using a double self-attention module to segment point cloud objects of various shapes and scales. The double self-attention module consists of point and group attention modules. The point and group attention modules are used to model the long-range dependencies among point clouds for extracting important contextual features. Then the DAPnet would combine the extracted geometric and contextual features to achieve good segmentation results. Besides, DAPnet is an end-to-end training method by directly processing the raw point clouds, which avoids the information loss during data preprocessing.

In the experiment, our method obtains an overall accuracy of 90.7% using the ISPRS 3D-Semantic Labeling data, which outperforms the state-of-the-art benchmark by 85.2%. Our model also achieves a better per-class accuracy (+3.4%) compared to all benchmark methods. In particular, the accuracies of powerline and car are improved by 7.5% and 13.2% respectively. Through ablation experiments, we find that the two different attention modules perform differently over classes. Compared with the strategy without any attention module, the double self-attention module has significantly improved the results of various classes, especially car and shrub, which have improved by more than 10.0%. The group attention module enhances the accuracy of about 3% on the classes of car, facade, and shrub, while the point attention module enhances the accuracy by 5% in the classes of powerline, car, facade, and tree. We find that the positive effect of adding point attention is bigger than adding the group attention module. It might because the group attention module is a higher-level point attention module and is not able to model the dependencies at an individual point level as the point attention module. The ablation experiments show that the strategy of adding the double self-attention module is the best strategy in terms of both overall accuracy and the per-class accuracy. Moreover, the incorporation of the double self-attention module would not affect the convergence speed of the model and can be flexibly transferred to other models or applications.

DAPnet has the following limitations: (1) In data preprocessing, the extracted features of point clouds are limited by the block size and the sampling number of points. When the scale or shape of the object is larger than the block, the modeling process would fail to extract the features and incur wrong classifications. (2) In some regions with mixed classes, such as the roof and impervious surfaces, the double self-attention module cannot classify these locations well. This might be that the enhancement of features by the attention matrix emphasizes too much on certain classes and makes the model classification biased. We also find that our method and the benchmark methods have low accuracies on the fence class. This might be because the spatial relationship of the mixed classes is complicated. It not only exists the long-range dependency between the points within the class but also exists out of the class. These features are more difficult to capture, even when enhanced by the attention module. In our future work, we would break through the limitation of point cloud data sampling scale, and model the correlation between mixed classes.


This work was supported by the National Natural Science Foundation of China (grant numbers 41871364, 41871276, 41871302, and 41861048).