The official implementation of LSANet in tensorflow
Directly learning features from the point cloud has become an active research direction in 3D understanding. Existing learning-based methods usually construct local regions from the point cloud and extract the corresponding features using shared Multi-Layer Perceptron (MLP) and max pooling. However, most of these processes do not adequately take the spatial distribution of the point cloud into account, limiting the ability to perceive fine-grained patterns. We design a novel Local Spatial Attention (LSA) module to adaptively generate attention maps according to the spatial distribution of local regions. The feature learning process which integrates with these attention maps can effectively capture the local geometric structure. We further propose the Spatial Feature Extractor (SFE), which constructs a branch architecture, to aggregate the spatial information with associated features in each layer of the network better.The experiments show that our network, named LSANet, can achieve on par or better performance than the state-of-the-art methods when evaluating on the challenging benchmark datasets. The source code is available at https://github.com/LinZhuoChen/LSANet.READ FULL TEXT VIEW PDF
Exploiting fine-grained semantic features on point cloud is still challe...
We present a simple and general framework for feature learning from poin...
The application of deep learning to 3D point clouds is challenging due t...
In this paper, we investigate the problem of learning feature representa...
Learning discriminative shape representation directly on point clouds is...
Few prior works study deep learning on point sets. PointNet by Qi et al....
We present a new permutation-invariant network for 3D point cloud proces...
The official implementation of LSANet in tensorflow
With the rapid growth of various 3D sensors, how to effectively understand the 3D point cloud data captured from those 3D sensors is becoming a fundamental requirement. In the 2D image processing domain, deep convolutional neural network (CNN) based methods have achieved great success in almost all computer vision tasks. Unfortunately, it is still tricky to directly migrate these CNN based techniques to 3D point sets oriented research. Point sets have their unique property of invariance to permutations and cannot be accurately represented by regular lattices, making those successful methods in 2D image domain unsuitable to be applied in 3D cases. The most common direction is transforming 3D data to voxel grids[Maturana and Scherer2015, Riegler et al.2017, Wang et al.2017, Engelcke et al.2017, Graham et al.2018] or multiple views of 2D images [Su et al.2015] to take advantage of existing operations used in 2D images. However, it would lead to some negative issues such as quantization artifacts and inefficient computation [Qi et al.2017a, Li et al.2018a].
Recently, some seminal researches attempted to process point cloud data directly by developing specific deep learning methods,e.g. PointNet [Qi et al.2017a] and PointNet++ [Qi et al.2017b]. As a pioneering work, PointNet introduces a simple yet efficient network-based architecture, while its feature extraction is point-wise and thus cannot exploit the local region information. PointNet++ [Qi et al.2017b] gets the local regions by using farthest point sampling (FPS) and ball query algorithms, then extracts the features of each local region, achieving excellent results on different 3D datasets. However, the feature extraction operations in PointNet++, which are completed by shared Multi-Layer Perceptron (MLP) and max-pooling, are independent with spatial structure in the local region thus can not capture the geometric pattern explicitly as shown in Fig. 1 (a). To overcome this difficulty, SpiderCNN uses a complicated family of parametrized non-linear functions, where the parameters of convolution are determined according to the spatial coordinates in the local region. However, these operations only consider the spatial information of the single point, instead of the entire spatial distribution of the local region as shown in Fig. 1 (b), thus dealing with geometric transform poorly. Moreover, in PointNet++ and its improved versions, the raw spatial coordinates of points, which are relative to their center point in the local region, are concatenated with features of points in each layer of networks to alleviate the limitations of per-point operation. However, the local coordinates have a different dimension and representation from the associated features. PointCNN [Li et al.2018b] alleviates this problem by lifting them into higher dimension and more abstract representation. In this way, with the deepening of the network, semantic information is gradually enriched in associated features, but fixed in coordinates.
In this paper, we propose a new network layer, named Local Spatial Aware (LSA) Layer, to model geometric structure in local region accurately and robustly. Each feature extracting operation in LSA layer is related to Spatial Distribution Weights (SDWs), which are learned based on the spatial distribution in local region, to establish a strong link with inherent geometric shape. As a result, these processes can consider the local spatial distribution as shown in Fig. 1 (c), thus perceiving fine-grained shape patterns. We have also solve the problem of fixed semantic information of coordinates with the deepening of the network by using hierarchical Spatial Feature Extractor (SFE). Our new network architecture, named LSANet, which is composed of LSA layer, is shown in Fig. 2. In summary, our contributions are as follows:
A novel Local Spatial Aware (LSA) Layer is proposed, it establishes the relationship between operation and spatial distribution by SDWs, which can capture geometric structures more accurately and robustly.
Our LSANet, taking LSA layer as its basic unit, considers the better integration of space coordinates and middle layer features in design, and achieves the state-of-the-art results on benchmark datasets.
Volumetric and Multi-view approach: Volumetric approach converts the point sets to a regular 3D grid where the 3D convolution can be applied [Maturana and Scherer2015, Qi et al.2016]. However, the 3D convolution usually introduces high computation cost, and the volumetric representations are often inefficient due to the sparse property of the point sets. Some existing works [Riegler et al.2017, Wang et al.2017, Engelcke et al.2017, Graham et al.2018, Klokov and Lempitsky2017] aim at improving computational performance. For instance, some representations for deep learning with sparse 3D data are proposed such as Octree [Wang et al.2017], Kd-Tree [Klokov and Lempitsky2017]. In [Engelcke et al.2017], the authors use a feature-centric voting scheme to implement a fast 3D convolution. While in [Graham et al.2018], a new sparse convolutional operation is introduced to perform efficient 3D convolution on sparse data. Multi-view approaches convert the 3D point sets to a collection of 2D views so that the popular 2D convolutional operations can be applied on the converted data [Su et al.2015, Kalogerakis et al.2017]. As an example, the multi-view CNN [Su et al.2015] constructs the CNN for each view, and a view pooling procedure is used to aggregate the extracted features of each view.
Point-based approach: PointNet [Qi et al.2017a] is the milestone work for directly processing point sets using the deep neural network. It extracts each point’s feature with a shared MLP and aggregates them with a symmetric function, such as max pooling, which is independent of input order. However, PointNet [Qi et al.2017a] cannot combine the information of neighbor points. To address this issue, PointNet++ [Qi et al.2017b] uses FPS and neighborhood query algorithms to sample centroids and their neighbor points and then extracts their features using a shared MLP and max pooling. The feature extraction operations mentioned above still do not take the local spatial distribution into account as shown in Fig. 1 (a). That is, in existing methods, the operations on points at different spatial locations use the same weighting factors. On the contrary, by combining the SDWs with subsequent operations, we can make such process spatially variable.
There are some other concurrent point-based approaches to process point sets using deep learning, such as [Li et al.2018a, Huang et al.2018, Shen et al.2018, Li et al.2018b, Su et al.2018]. Especially, SO-Net [Li et al.2018a] applies the self-organizing network on the point sets processing. RSNet [Huang et al.2018]
uses Recurrent Neural Network (RNN) to process point sets. KCNet[Shen et al.2018] introduces the kernel correlation to combine the information of the neighborhood. And PointCNN [Li et al.2018b] learns a transform from the point sets to permute them in canonical order. In [Su et al.2018], they project the point features into regular domains, so that the typical CNNs can be applied. The sparse data can also be represented as meshes [Monti et al.2017] or graphs [Defferrard et al.2016, Yi et al.2017], and there are some works that aim at learning feature from these representations. We refer the reader to [Masci et al.2016] for a more comprehensive survey.
Firstly, we introduce the method of extracting spatial distribution feature of the local region; then the generation of Spatial Distribution Weights (SDWs), which based on the spatial distribution feature, is described in depth. We elaborate on the integration of Spatial Distribution Weights (SDWs) with other operations and introduce our LSANet finally.
Let the relative coordinate of each point in a local region is , where is the number of points in a local region. The spatial distribution feature consists of two parts, one is the spatial feature of the point itself, and the other is the spatial feature of the local region where the point is located.
The spatial feature of the point can be expressed as:
where , and , which is the spatial feature of the point itself.
We use the following formula to encode the spatial distribution of the whole local region:
where . As shown above, encode spatial information of all points in the local region. To preserve permutation invariance, we apply the same weight to all points in the local region.
We concatenate the spatial feature of each point with the spatial distribution of the region and get the final spatial distribution feature:
where denotes the concatenation operation, is the spatial distribution feature of each point, which is generated by the above formula and associated with not only spatial location itself but also all points of the local region, encoding the spatial information explicitly. Different points in the same local region share the same . We will utilize each point’s spatial distribution feature to generate SDWs next.
Suppose the feature of a local region in the -th sub-layer is , where denotes the channel of in the -th layer, is the number of points in the local region, and is the index of sub-layers in the LSA Layer.
We use the SDWs generator to generate SDWs for their subsequent feature extraction operations. The SDWs generator takes spatial distribution feature of local region as input, expressed in , where , is the index of the neighboring points. Note that is related to the spatial structure of and its local region. In order to generate the first SDWs for the corresponding operation, we define the SDWs as , where is a non-linear function which is determined by the learnable parameters . In this work, we use a fully connected network as to get the first SDWs , which can be expressed as:
denotes the sigmoid function,, and means that the belongs to our SDWs generator. For the output , it has the same dimension as the point feature . We can use the following formulation to generate new SDWs for the further feature learning process:
where . Note that shares the same dimension as . We use the activation function after each to introduce nonlinearity. Therefore, the formula mentioned above generates the expected SDWs which are related to the spatial distribution in each local region. Note that the process mentioned above can be easily extended with multiple local regions. Fig. 3 (b) shows the whole processes.
Next, we show how our SDWs participate in other feature extraction operations, which allows the feature extraction processes to take the local spatial distribution into account. For example, combining the SDWs with the shared MLP can be expressed as follows:
where denotes element-wise multiplication, , the parameter means that the weight belongs to the shared MLP operation which is shown in Fig. 3 (a), , and . As shown in Equ. (6), the value of is independent with the spatial coordinate of and shared across different points in the local region. After combined with the SDWs , the value of the updated weight is related to the spatial distribution. For each point in the local region, can adaptively learn to assign different weights according to its spatial distribution, with which the local shape pattern can be captured better. The entire process is shown in Fig. 3 (a).
The max pooling operation selects the point with the strongest response in each channel regardless of spatial relationship in the local region. However, by combining the SDWs, the pooling operation can be guided to select the optimal point based on its spatial distribution, which can be formulated as:
where , , and . We will further investigate it by experiments to show that the combination of the SDWs with the feature extraction operations can lead to better results, which is shown in Tab. 3.
To combine spatial coordinates with the features in each layer better, we propose an additional branch architecture, in which Spatial Feature Extractor (SFE) is mounted to get high-dimension spatial representation as shown in Fig. 2.
The input of SFE are the raw coordinates of local regions or the spatial feature from the previous SFE. To improve the dimensions of the coordinates, we send the spatial coordinates information of input to shared MLP. Then we combine the output of the shared MLPs with the input and use it as the spatial information that flows into the backbone network for abstract representation. Finally, we use shared MLP to enhance the representation of spatial information further and inflow it into the next SFE. In this way, we can lift the dimension of raw coordinates and get more abstract representation layer by layer.
The architecture of LSANet is shown in Fig. 2. Note that we use LSA layer as our basic unit, and add the additional branch to enhance the spatial feature representation using the SFE. We use farthest point sampling (FPS) and ball query algorithms to sample and group, which are the same as PointNet++. The output features of the last LSA layer are aggregated by a fully connected network for classification. The segmentation model extends the classification model using the FP module in PointNet++ [Qi et al.2017b] to upsample the reduced points and outputs per-point scores for semantic labels.
We evaluate the performance of the proposed LSA Layer and LSANet with extensive experiments. First, the experimental results of our LSANet and other state-of-the-art point-based approaches on the ModelNet40 [Wu et al.2015], ShapeNet [Yi et al.2016], ScanNet [Dai et al.2017], and S3DIS [Armeni et al.2016] are shown in Sec. 4.1. Second, we perform the ablation study to validate our LSANet design, and then visualize what our LSA layer learns in Sec. 4.2. At last, we analyze the space and time complexity in Sec. 4.3.
|KCNet [Shen et al.2018]||-||91.0%||-||-||82.2%||84.7%||-||-||-|
|Kd-Net[Klokov and Lempitsky2017]||88.5%||91.8%||-||-||77.4%||82.3%||-||-||-|
|SO-Net [Li et al.2018a]||-||90.9%||-||-||81.0%||84.9%||-||-||-|
|PCNN [Atzmon et al.2018]||-||92.3%||-||-||81.8%||85.1%||-||-||-|
|SPLATNet[Su et al.2018]||-||-||-||-||83.7%||85.4%||-||-||-|
|SpecGCN [Wang et al.2018]||-||-||-||91.5%||-||85.4%||84.8%||-||-|
|SpiderCNN [Xu et al.2018]||-||-||-||90.5%||81.7%||85.3%||81.7%||-||-|
|SCN [Xie et al.2018]||87.6%||90.0%||-||84.6%||-||-||-|
|PointNet [Qi et al.2017a]||-||-||86.2%||89.2%||80.4%||83.7%||-||78.5%||47.6%|
|PointNet++ [Qi et al.2017b]||-||-||-||90.7%||81.9%||85.1%||84.5%||-||-|
|SyncSpecCNN [Yi et al.2017]||-||-||-||-||82.0%||84.8%||-||-||-|
|PointCNN [Li et al.2018b]||88.8%||92.5%||88.1%||92.2%||84.6%||86.1%||85.1%||85.14%||65.39%|
|RSNet [Huang et al.2018]||-||-||-||-||81.4%||84.9%||-||-||56.5%|
|SPG [Landrieu and Simonovsky2018]||-||-||-||-||-||-||-||85.5%||62.1%|
Dataset: We apply our LSANet on the following datasets:
ModelNet40 [Wu et al.2015]: This dataset includes 12,311 CAD models from the 40 categories, and we use the official split with 9,843 for training and 2,468 for testing. To get the 3D points, we sample 1,024 points uniformly from the mesh model.
ShapeNet [Yi et al.2016]: 6,880 models from 16 shape categories and 50 different parts consist in the ShapeNet [Yi et al.2016], and each shape is annotated with 2 to 6 parts. Following [Qi et al.2017b], we use 14,006 models for training and 2,874 for testing, 2,048 points are sampled uniformly from each CAD models, and each point is associated with a part label. These points with their surface normals are used as input, assuming that the category labels are known.
ScanNet [Dai et al.2017]: The ScanNet [Dai et al.2017] is a large-scale semantic segmentation dataset containing 2.5M views in 1513 scenes. Since ScanNet is constructed from real-world 3D scans of indoor scenes, it is more challenging than the synthesized 3D datasets. In our experiment, we follow the configuration in [Qi et al.2017b] and use 1201 scenes for training, 312 scenes for testing with 8192 points as our inputs. We remove the RGB information in this experiments and only use the spatial coordinates as input.
S3DIS [Armeni et al.2016]: The S3DIS dataset contains 3D scans in 6 areas including 271 rooms. Each point is annotated with the label from 13 categories. We follow the way in [Qi et al.2017a] to prepare training data and split the training and testing set with k-fold strategy. 8192 points are sampled in each block randomly for training. We use XYZ, RGB and normalized location on each point as input.
Network Configuration: The configuration of LSANet is shown in Tab. 2
. We use Adam optimizer, and the initial learning rate is 0.002 which is applied with exponential decay. The decay ratio is 0.7 applied every 40 epochs. We use the ReLU activation function. Batch normalization is applied after each MLP, and the batch size of data is 32. We train the LSANet for 250 epochs on two NVIDIA GTX 1080Ti GPUs.
Results: Tab. 1
compares our results with state-of-the-art works on the datasets mentioned above. For the task of Classification, We divide the settings into the pre-aligned and the unaligned according to whether they rotate randomly during the training or testing phase, due to a large portion of the 3D models from ModelNet40 are pre-aligned. To compare fairly, we report our LSANet’s performance in both settings. We use the overall accuracy as the evaluation metric. For the input of 1024 points without surface normal, in terms of the overall accuracy and Unaligned setting, our method achieves 1.6% higher than the multi-scale grouping (MSG) network of PointNet++ even though we do not use multi-scale grouping (MSG) in the LSA layer. Our LSANet also outperforms the PointNet++’s MSG architecture which uses both 5000 points and surface normal as input. These results show the effectiveness of our LSA Layer, and in general, we realize better accuracy than other methods in both settings. In the segmentation task, we evaluate our LSANet on the ShapeNet, ScanNet,and S3DIS. We note that our method outperforms all the compared methods, such as PointNet++ which does not have ourLSA layer and additional branch. Our LSANet also outperforms the approaches based on [Qi et al.2017a] such as SpecGCN [Wang et al.2018] and SpiderCNN [Xu et al.2018]. The visualizations of segmentation results on ShapeNets are shown in Fig. 4.
We now validate our proposed LSANet design by control experiments with classification task on the ModelNet40 [Wu et al.2015] dataset under unaligned settings, and then we visualize the SDWs generated by our LSA module.
Module validation: We demonstrate the positive effects of our LSA Layer and network architecture by ablation experiment. We also remove the integration of SDWs from max pooling and the region spatial encoder part of LSA Layer in Fig. 3 to verify their effectiveness. The detailed results are shown in Tab. 3.
As shown in these experiments, the LSA Layer and SFE bring 1.1% and 0.8% accuracy improvement respectively, illustrating the effectiveness of our design. We also observe that the region spatial encoder of LSA module improves the results, which shows the validity of the whole region information. The results also show that the max pooling combined with SDWs can select the optimal point based on its spatial distribution and achieve better effects.
|baseline + SFE||91.4%|
|baseline + LSA(w/o region spatial encoder)||91.5%|
|baseline + LSA(w/o max-pooling)||91.4%|
|baseline + LSA||91.7%|
|baseline + LSA + SFE (ours)||92.3%|
Sampling density : We test the robustness of our LSANet to sampling density by using 1024, 512, 256,128 and 64 points sampled on ModelNet40 dataset as input. Our LSANet is trained on ModelNet40 dataset using 1024 points. We use random input dropout in for a fair comparison. The test results are shown in Fig. 6. We compare our LSANet with PointNet [Qi et al.2017a] and PointNet++ [Qi et al.2017b]. We can see that as the number of points decreases, it becomes more and more difficult for the people to judge. But our LSANet performs well at different number of points.
Visualization of the SDWs: In Fig. 5, we randomly pick 512 representative points with their neighboring ones of an object in the test set of ModelNet40 [Wu et al.2015] dataset, and visualize the response of LSA Layer to these local regions before MLP in each channel (as discussed in Sec. 3.3). It is obvious to see that our SDWs obtain different preferences for directions in each channel. This module guarantees that our LSANet can effectively perceive fine-grained patterns by learning SDWs.
We further compare both space and time complexities with other methods, in which the classification network is used. Tab. 4 shows that our LSANet has proper parameters with fast inference time. In addition, our segmentation network involves fewer parameters than our classification network (see Tab. 5).
|PointNet++ (SSG) [Qi et al.2017b]||1.48M||0.027s|
|SpecGCN [Wang et al.2018]||2.05M||11.254s|
|SpiderCNN [Xu et al.2018]||5.84M||0.085s|
In this work, we propose a novel LSA Layer and LSANet. Based on such new design, our LSANet has more powerful spatial information extraction capabilities and provides on par or better results than state-of-the-art approaches on standard benchmarks for different 3D recognition tasks including object classification, part segmentation, and semantic segmentation. We also provide ablation experiments and visualizations to illustrate the effectiveness of our LSANet design.