LSANet: Feature Learning on Point Sets by Local Spatial Attention

05/14/2019 ∙ by Lin-Zhuo Chen, et al. ∙ 0

Directly learning features from the point cloud has become an active research direction in 3D understanding. Existing learning-based methods usually construct local regions from the point cloud and extract the corresponding features using shared Multi-Layer Perceptron (MLP) and max pooling. However, most of these processes do not adequately take the spatial distribution of the point cloud into account, limiting the ability to perceive fine-grained patterns. We design a novel Local Spatial Attention (LSA) module to adaptively generate attention maps according to the spatial distribution of local regions. The feature learning process which integrates with these attention maps can effectively capture the local geometric structure. We further propose the Spatial Feature Extractor (SFE), which constructs a branch architecture, to aggregate the spatial information with associated features in each layer of the network better.The experiments show that our network, named LSANet, can achieve on par or better performance than the state-of-the-art methods when evaluating on the challenging benchmark datasets. The source code is available at



There are no comments yet.


page 6

Code Repositories


The official implementation of LSANet in tensorflow

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

footnotetext: * Equal contribution

With the rapid growth of various 3D sensors, how to effectively understand the 3D point cloud data captured from those 3D sensors is becoming a fundamental requirement. In the 2D image processing domain, deep convolutional neural network (CNN) based methods have achieved great success in almost all computer vision tasks. Unfortunately, it is still tricky to directly migrate these CNN based techniques to 3D point sets oriented research. Point sets have their unique property of invariance to permutations and cannot be accurately represented by regular lattices, making those successful methods in 2D image domain unsuitable to be applied in 3D cases. The most common direction is transforming 3D data to voxel grids 

[Maturana and Scherer2015, Riegler et al.2017, Wang et al.2017, Engelcke et al.2017, Graham et al.2018] or multiple views of 2D images [Su et al.2015] to take advantage of existing operations used in 2D images. However, it would lead to some negative issues such as quantization artifacts and inefficient computation [Qi et al.2017a, Li et al.2018a].

[width=1.]figure/LSA1.pdf (a)(b)(c)

Figure 1:

we illustrate the feature learning process for the target point (hollow circle) using its neighbor points (colored circles). (a), (b), and (c) are three identical local regions. Different colors of points represent different weights. (a) shows the feature extraction process of PointNet++ 

[Qi et al.2017b]: the weight of each point is fixed and independent with the spatial information, making it limited to extract geometric patterns. (b) shows the feature extraction process of SpiderCNN [Xu et al.2018]

: the weight of each point is related to the vector to the center point, while it does not fully consider the spatial distribution of the whole region, leading to sensitivity to spatial transformation. (c) is an example operation in our new

LSA Layer: we can integrate our spatially-oriented SDWs into a shared MLP. In our model, the weight of each point is related to all points in the local region, so it can capture the local geometric structure sufficiently and obtain much robustness to geometric transform.
Figure 2: The architecture of LSANet for classification and segmentation: the backbone of our network is composed of SFE and LSA layers. The LSA layer, which consists of SDWs generator and feature learning process, generates SDWs according to the spatial structure of each local region and integrates them with feature learning process. The details of LSA layer are shown in Fig. 3. For SFE, we sample and group the spatial coordinates as input which is shown in A, lift dimensions of the coordinates as output shown in B. The spatial feature as shown in C also flows into the next SFE for a hierarchical feature representation. represents the number of points, denotes the number of points in the local region, indicates the dimension of each point feature, and is the index of the LSA layer.

Recently, some seminal researches attempted to process point cloud data directly by developing specific deep learning methods,

e.g. PointNet [Qi et al.2017a] and PointNet++  [Qi et al.2017b]. As a pioneering work, PointNet introduces a simple yet efficient network-based architecture, while its feature extraction is point-wise and thus cannot exploit the local region information. PointNet++ [Qi et al.2017b] gets the local regions by using farthest point sampling (FPS) and ball query algorithms, then extracts the features of each local region, achieving excellent results on different 3D datasets. However, the feature extraction operations in PointNet++, which are completed by shared Multi-Layer Perceptron (MLP) and max-pooling, are independent with spatial structure in the local region thus can not capture the geometric pattern explicitly as shown in Fig. 1 (a). To overcome this difficulty, SpiderCNN uses a complicated family of parametrized non-linear functions, where the parameters of convolution are determined according to the spatial coordinates in the local region. However, these operations only consider the spatial information of the single point, instead of the entire spatial distribution of the local region as shown in Fig. 1 (b), thus dealing with geometric transform poorly. Moreover, in PointNet++ and its improved versions, the raw spatial coordinates of points, which are relative to their center point in the local region, are concatenated with features of points in each layer of networks to alleviate the limitations of per-point operation. However, the local coordinates have a different dimension and representation from the associated features. PointCNN [Li et al.2018b] alleviates this problem by lifting them into higher dimension and more abstract representation. In this way, with the deepening of the network, semantic information is gradually enriched in associated features, but fixed in coordinates.

In this paper, we propose a new network layer, named Local Spatial Aware (LSA) Layer, to model geometric structure in local region accurately and robustly. Each feature extracting operation in LSA layer is related to Spatial Distribution Weights (SDWs), which are learned based on the spatial distribution in local region, to establish a strong link with inherent geometric shape. As a result, these processes can consider the local spatial distribution as shown in Fig. 1 (c), thus perceiving fine-grained shape patterns. We have also solve the problem of fixed semantic information of coordinates with the deepening of the network by using hierarchical Spatial Feature Extractor (SFE). Our new network architecture, named LSANet, which is composed of LSA layer, is shown in  Fig. 2. In summary, our contributions are as follows:

  • A novel Local Spatial Aware (LSA) Layer is proposed, it establishes the relationship between operation and spatial distribution by SDWs, which can capture geometric structures more accurately and robustly.

  • Our LSANet, taking LSA layer as its basic unit, considers the better integration of space coordinates and middle layer features in design, and achieves the state-of-the-art results on benchmark datasets.

Extensive experiments show that the performance of our LSANet is better than state-of-the-art methods. We further explain the details of the proposed LSA layer and the network structure explicitly in Sec. 3. Our results on multiple challenging datasets and ablation study are shown in Sec. 4.

[width=18cm]figure/LSA.png (a)(b)

Figure 3: The process of LSA layer: (a) shows the architecture of LSA layer. (b) shows the details of SDWs generator. The LSA layer is composed of SDWs generator and the spatial independent feature learning process. represents the number of points, denotes the number of points in the local region, indicates the dimension of each point feature.

2 Related Work

Volumetric and Multi-view approach: Volumetric approach converts the point sets to a regular 3D grid where the 3D convolution can be applied [Maturana and Scherer2015, Qi et al.2016]. However, the 3D convolution usually introduces high computation cost, and the volumetric representations are often inefficient due to the sparse property of the point sets. Some existing works  [Riegler et al.2017, Wang et al.2017, Engelcke et al.2017, Graham et al.2018, Klokov and Lempitsky2017] aim at improving computational performance. For instance, some representations for deep learning with sparse 3D data are proposed such as Octree  [Wang et al.2017], Kd-Tree [Klokov and Lempitsky2017]. In [Engelcke et al.2017], the authors use a feature-centric voting scheme to implement a fast 3D convolution. While in [Graham et al.2018], a new sparse convolutional operation is introduced to perform efficient 3D convolution on sparse data. Multi-view approaches convert the 3D point sets to a collection of 2D views so that the popular 2D convolutional operations can be applied on the converted data [Su et al.2015, Kalogerakis et al.2017]. As an example, the multi-view CNN [Su et al.2015] constructs the CNN for each view, and a view pooling procedure is used to aggregate the extracted features of each view.

Point-based approach: PointNet [Qi et al.2017a] is the milestone work for directly processing point sets using the deep neural network. It extracts each point’s feature with a shared MLP and aggregates them with a symmetric function, such as max pooling, which is independent of input order. However, PointNet [Qi et al.2017a] cannot combine the information of neighbor points. To address this issue, PointNet++ [Qi et al.2017b] uses FPS and neighborhood query algorithms to sample centroids and their neighbor points and then extracts their features using a shared MLP and max pooling. The feature extraction operations mentioned above still do not take the local spatial distribution into account as shown in Fig. 1 (a). That is, in existing methods, the operations on points at different spatial locations use the same weighting factors. On the contrary, by combining the SDWs with subsequent operations, we can make such process spatially variable.

There are some other concurrent point-based approaches to process point sets using deep learning, such as  [Li et al.2018a, Huang et al.2018, Shen et al.2018, Li et al.2018b, Su et al.2018]. Especially, SO-Net [Li et al.2018a] applies the self-organizing network on the point sets processing. RSNet [Huang et al.2018]

uses Recurrent Neural Network (RNN) to process point sets. KCNet 

[Shen et al.2018] introduces the kernel correlation to combine the information of the neighborhood. And PointCNN [Li et al.2018b] learns a transform from the point sets to permute them in canonical order. In [Su et al.2018], they project the point features into regular domains, so that the typical CNNs can be applied. The sparse data can also be represented as meshes  [Monti et al.2017] or graphs  [Defferrard et al.2016, Yi et al.2017], and there are some works that aim at learning feature from these representations. We refer the reader to [Masci et al.2016] for a more comprehensive survey.

3 Our Method

Firstly, we introduce the method of extracting spatial distribution feature of the local region; then the generation of Spatial Distribution Weights (SDWs), which based on the spatial distribution feature, is described in depth. We elaborate on the integration of Spatial Distribution Weights (SDWs) with other operations and introduce our LSANet finally.

3.1 Extract spatial distribution feature

Let the relative coordinate of each point in a local region is , where is the number of points in a local region. The spatial distribution feature consists of two parts, one is the spatial feature of the point itself, and the other is the spatial feature of the local region where the point is located.

The spatial feature of the point can be expressed as:


where , and , which is the spatial feature of the point itself.

We use the following formula to encode the spatial distribution of the whole local region:


where . As shown above, encode spatial information of all points in the local region. To preserve permutation invariance, we apply the same weight to all points in the local region.

We concatenate the spatial feature of each point with the spatial distribution of the region and get the final spatial distribution feature:


where denotes the concatenation operation, is the spatial distribution feature of each point, which is generated by the above formula and associated with not only spatial location itself but also all points of the local region, encoding the spatial information explicitly. Different points in the same local region share the same . We will utilize each point’s spatial distribution feature to generate SDWs next.

3.2 Generation of Spatial Distribution Weights (SDWs)

Suppose the feature of a local region in the -th sub-layer is , where denotes the channel of in the -th layer, is the number of points in the local region, and is the index of sub-layers in the LSA Layer.

We use the SDWs generator to generate SDWs for their subsequent feature extraction operations. The SDWs generator takes spatial distribution feature of local region as input, expressed in , where , is the index of the neighboring points. Note that is related to the spatial structure of and its local region. In order to generate the first SDWs for the corresponding operation, we define the SDWs as , where is a non-linear function which is determined by the learnable parameters . In this work, we use a fully connected network as to get the first SDWs , which can be expressed as:



denotes the sigmoid function,

, and means that the belongs to our SDWs generator. For the output , it has the same dimension as the point feature . We can use the following formulation to generate new SDWs for the further feature learning process:


where . Note that shares the same dimension as . We use the activation function after each to introduce nonlinearity. Therefore, the formula mentioned above generates the expected SDWs which are related to the spatial distribution in each local region. Note that the process mentioned above can be easily extended with multiple local regions. Fig. 3 (b) shows the whole processes.

3.3 Combine SDWs with other operations

Next, we show how our SDWs participate in other feature extraction operations, which allows the feature extraction processes to take the local spatial distribution into account. For example, combining the SDWs with the shared MLP can be expressed as follows:


where denotes element-wise multiplication, , the parameter means that the weight belongs to the shared MLP operation which is shown in Fig. 3 (a), , and . As shown in Equ. (6), the value of is independent with the spatial coordinate of and shared across different points in the local region. After combined with the SDWs , the value of the updated weight is related to the spatial distribution. For each point in the local region, can adaptively learn to assign different weights according to its spatial distribution, with which the local shape pattern can be captured better. The entire process is shown in Fig. 3 (a).

The max pooling operation selects the point with the strongest response in each channel regardless of spatial relationship in the local region. However, by combining the SDWs, the pooling operation can be guided to select the optimal point based on its spatial distribution, which can be formulated as:


where , , and . We will further investigate it by experiments to show that the combination of the SDWs with the feature extraction operations can lead to better results, which is shown in Tab. 3.

3.4 LSANet Architecture

To combine spatial coordinates with the features in each layer better, we propose an additional branch architecture, in which Spatial Feature Extractor (SFE) is mounted to get high-dimension spatial representation as shown in Fig. 2.

The input of SFE are the raw coordinates of local regions or the spatial feature from the previous SFE. To improve the dimensions of the coordinates, we send the spatial coordinates information of input to shared MLP. Then we combine the output of the shared MLPs with the input and use it as the spatial information that flows into the backbone network for abstract representation. Finally, we use shared MLP to enhance the representation of spatial information further and inflow it into the next SFE. In this way, we can lift the dimension of raw coordinates and get more abstract representation layer by layer.

The architecture of LSANet is shown in Fig. 2. Note that we use LSA layer as our basic unit, and add the additional branch to enhance the spatial feature representation using the SFE. We use farthest point sampling (FPS) and ball query algorithms to sample and group, which are the same as PointNet++. The output features of the last LSA layer are aggregated by a fully connected network for classification. The segmentation model extends the classification model using the FP module in PointNet++ [Qi et al.2017b] to upsample the reduced points and outputs per-point scores for semantic labels.

4 Experiments

We evaluate the performance of the proposed LSA Layer and LSANet with extensive experiments. First, the experimental results of our LSANet and other state-of-the-art point-based approaches on the ModelNet40 [Wu et al.2015], ShapeNet [Yi et al.2016], ScanNet [Dai et al.2017], and S3DIS [Armeni et al.2016] are shown in Sec. 4.1. Second, we perform the ablation study to validate our LSANet design, and then visualize what our LSA layer learns in Sec. 4.2. At last, we analyze the space and time complexity in Sec. 4.3.

Task Classification Segmentation
Dataset ModelNet40 ShapeNet ScanNet S3DIS
Pre-aligned Unaligned
Metric mA OA mA OA mpIOU pIOU OA OA mIoU
KCNet [Shen et al.2018] - 91.0% - - 82.2% 84.7% - - -
Kd-Net[Klokov and Lempitsky2017] 88.5% 91.8% - - 77.4% 82.3% - - -
SO-Net [Li et al.2018a] - 90.9% - - 81.0% 84.9% - - -
PCNN [Atzmon et al.2018] - 92.3% - - 81.8% 85.1% - - -
SPLATNet[Su et al.2018] - - - - 83.7% 85.4% - - -
SpecGCN [Wang et al.2018] - - - 91.5% - 85.4% 84.8% - -
SpiderCNN [Xu et al.2018] - - - 90.5% 81.7% 85.3% 81.7% - -
SCN [Xie et al.2018] 87.6% 90.0% - 84.6% - - -
PointNet [Qi et al.2017a] - - 86.2% 89.2% 80.4% 83.7% - 78.5% 47.6%
PointNet++ [Qi et al.2017b] - - - 90.7% 81.9% 85.1% 84.5% - -
SyncSpecCNN [Yi et al.2017] - - - - 82.0% 84.8% - - -
PointCNN [Li et al.2018b] 88.8% 92.5% 88.1% 92.2% 84.6% 86.1% 85.1% 85.14% 65.39%
RSNet [Huang et al.2018] - - - - 81.4% 84.9% - - 56.5%
SPG [Landrieu and Simonovsky2018] - - - - - - - 85.5% 62.1%
LSANet (ours) 90.3% 93.2% 89.2% 92.3% 83.2% 85.6% 85.1% 86.8% 62.2%
Table 1: Comparisons with other point-based networks on ModelNet40 [Wu et al.2015] in per-class accuracy (mA) and overall accuracy (OA), ShapeNet [Yi et al.2016] in part-averaged IoU (pIoU) and mean per-class pIoU (mpIoU), Scannet [Dai et al.2017] in per voxel overall accuracy (OA), and S3DIS [Armeni et al.2016] in mean per-class IoU (mIoU) and overall accuracy(OA).

4.1 Classification and Segmentation Tasks

Dataset: We apply our LSANet on the following datasets:

  • ModelNet40 [Wu et al.2015]: This dataset includes 12,311 CAD models from the 40 categories, and we use the official split with 9,843 for training and 2,468 for testing. To get the 3D points, we sample 1,024 points uniformly from the mesh model.

  • ShapeNet [Yi et al.2016]: 6,880 models from 16 shape categories and 50 different parts consist in the ShapeNet [Yi et al.2016], and each shape is annotated with 2 to 6 parts. Following [Qi et al.2017b], we use 14,006 models for training and 2,874 for testing, 2,048 points are sampled uniformly from each CAD models, and each point is associated with a part label. These points with their surface normals are used as input, assuming that the category labels are known.

  • ScanNet [Dai et al.2017]: The ScanNet [Dai et al.2017] is a large-scale semantic segmentation dataset containing 2.5M views in 1513 scenes. Since ScanNet is constructed from real-world 3D scans of indoor scenes, it is more challenging than the synthesized 3D datasets. In our experiment, we follow the configuration in [Qi et al.2017b] and use 1201 scenes for training, 312 scenes for testing with 8192 points as our inputs. We remove the RGB information in this experiments and only use the spatial coordinates as input.

  • S3DIS [Armeni et al.2016]: The S3DIS dataset contains 3D scans in 6 areas including 271 rooms. Each point is annotated with the label from 13 categories. We follow the way in  [Qi et al.2017a] to prepare training data and split the training and testing set with k-fold strategy. 8192 points are sampled in each block randomly for training. We use XYZ, RGB and normalized location on each point as input.

Network Configuration: The configuration of LSANet is shown in Tab. 2

. We use Adam optimizer, and the initial learning rate is 0.002 which is applied with exponential decay. The decay ratio is 0.7 applied every 40 epochs. We use the ReLU activation function. Batch normalization is applied after each MLP, and the batch size of data is 32. We train the LSANet for 250 epochs on two NVIDIA GTX 1080Ti GPUs.

Datasets L1 L2 L3 L4
ModelNet40 512 128 1 -
32 64 128 -
(64,64,128) (128,128,256) (256,512,1024)
ShapeNet 512 128 1 -
64 64 128 -
(64,64,128) (128,128,256) (256,512,1024)
ScanNet 1024 256 64 16
32 32 32 32
(32,32,64) (64,64,128) (128,128,256) (256,256,512)
S3DIS 1024 256 64 16
32 32 32 32
(32,32,64) (64,64,128) (128,128,256) (256,256,512)
Table 2: The backbone architecture of our LSANet for each dataset. In each LSA layer, stands for the number of local regions, represents the number of points in each local region, and stands for the output dimensions of shared MLP in LSA layer.

[width=.95]figure/partseg.pdf GTPointNet++Ours


Figure 4: The visualizations on ShapeNet.
Figure 5: Visualization of the SDWs: We visualize the channel-wise SDWs of LSA layer to these local regions’ spatial coordinates before shared MLP operation. In this figure, we randomly sample 6 channels from 64 feature channels to show. It can be observed that our SDWs are spatially related in each feature channel.

Results: Tab. 1

compares our results with state-of-the-art works on the datasets mentioned above. For the task of Classification, We divide the settings into the pre-aligned and the unaligned according to whether they rotate randomly during the training or testing phase, due to a large portion of the 3D models from ModelNet40 are pre-aligned. To compare fairly, we report our LSANet’s performance in both settings. We use the overall accuracy as the evaluation metric. For the input of 1024 points without surface normal, in terms of the overall accuracy and Unaligned setting, our method achieves 1.6% higher than the multi-scale grouping (MSG) network of PointNet++ even though we do not use multi-scale grouping (MSG) in the LSA layer. Our LSANet also outperforms the PointNet++’s MSG architecture which uses both 5000 points and surface normal as input. These results show the effectiveness of our LSA Layer, and in general, we realize better accuracy than other methods in both settings. In the segmentation task, we evaluate our LSANet on the ShapeNet, ScanNet,and S3DIS. We note that our method outperforms all the compared methods, such as PointNet++ which does not have our

LSA layer and additional branch. Our LSANet also outperforms the approaches based on [Qi et al.2017a] such as SpecGCN [Wang et al.2018] and SpiderCNN [Xu et al.2018]. The visualizations of segmentation results on ShapeNets are shown in Fig. 4.

4.2 Analysis and Visualization

We now validate our proposed LSANet design by control experiments with classification task on the ModelNet40 [Wu et al.2015] dataset under unaligned settings, and then we visualize the SDWs generated by our LSA module.

Module validation: We demonstrate the positive effects of our LSA Layer and network architecture by ablation experiment. We also remove the integration of SDWs from max pooling and the region spatial encoder part of LSA Layer in Fig. 3 to verify their effectiveness. The detailed results are shown in Tab. 3.

As shown in these experiments, the LSA Layer and SFE bring 1.1% and 0.8% accuracy improvement respectively, illustrating the effectiveness of our design. We also observe that the region spatial encoder of LSA module improves the results, which shows the validity of the whole region information. The results also show that the max pooling combined with SDWs can select the optimal point based on its spatial distribution and achieve better effects.

Network Configuration OA
baseline 90.6%
baseline + SFE 91.4%
baseline + LSA(w/o region spatial encoder) 91.5%
baseline + LSA(w/o max-pooling) 91.4%
baseline + LSA 91.7%
baseline + LSA + SFE (ours) 92.3%
Table 3: Ablation study on ModelNet40 classification task under unaligned settings.

Sampling density : We test the robustness of our LSANet to sampling density by using 1024, 512, 256,128 and 64 points sampled on ModelNet40 dataset as input. Our LSANet is trained on ModelNet40 dataset using 1024 points. We use random input dropout in for a fair comparison. The test results are shown in Fig. 6. We compare our LSANet with PointNet [Qi et al.2017a] and PointNet++ [Qi et al.2017b]. We can see that as the number of points decreases, it becomes more and more difficult for the people to judge. But our LSANet performs well at different number of points.


Figure 6: The test results of using different number of points as input to the same model trained with 1024 points.

Visualization of the SDWs: In Fig. 5, we randomly pick 512 representative points with their neighboring ones of an object in the test set of ModelNet40 [Wu et al.2015] dataset, and visualize the response of LSA Layer to these local regions before MLP in each channel (as discussed in Sec. 3.3). It is obvious to see that our SDWs obtain different preferences for directions in each channel. This module guarantees that our LSANet can effectively perceive fine-grained patterns by learning SDWs.

4.3 Complexity Analysis

We further compare both space and time complexities with other methods, in which the classification network is used. Tab. 4 shows that our LSANet has proper parameters with fast inference time. In addition, our segmentation network involves fewer parameters than our classification network (see Tab. 5).

Method Parameters Inference time
PointNet++ (SSG) [Qi et al.2017b] 1.48M 0.027s
SpecGCN [Wang et al.2018] 2.05M 11.254s
SpiderCNN [Xu et al.2018] 5.84M 0.085s
LSANet (ours) 2.30M 0.060s
Table 4: Comparison of different methods on the number of parameters and inference time for Classification task.
Datasets Task Parameters
Modelnet40 [Wu et al.2015] Classification 2.30M
ShapeNet [Yi et al.2016] Segmentation 2.24M
ScanNet [Dai et al.2017] Segmentation 1.36M
S3DIS [Armeni et al.2016] Segmentation 1.41M
Table 5: The number of our LSANet’s parameters on four datasets.

5 Conclusion

In this work, we propose a novel LSA Layer and LSANet. Based on such new design, our LSANet has more powerful spatial information extraction capabilities and provides on par or better results than state-of-the-art approaches on standard benchmarks for different 3D recognition tasks including object classification, part segmentation, and semantic segmentation. We also provide ablation experiments and visualizations to illustrate the effectiveness of our LSANet design.


  • [Armeni et al.2016] Iro Armeni, Ozan Sener, Amir R Zamir, Helen Jiang, Ioannis Brilakis, Martin Fischer, and Silvio Savarese. 3d semantic parsing of large-scale indoor spaces. In CVPR, pages 1534–1543, 2016.
  • [Atzmon et al.2018] Matan Atzmon, Haggai Maron, and Yaron Lipman. Point convolutional neural networks by extension operators. ACM TOG, 37(4):71, 2018.
  • [Dai et al.2017] Angela Dai, Angel X Chang, Manolis Savva, Maciej Halber, Thomas A Funkhouser, and Matthias Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In CVPR, volume 2, page 10, 2017.
  • [Defferrard et al.2016] Michaël Defferrard, Xavier Bresson, and Pierre Vandergheynst. Convolutional neural networks on graphs with fast localized spectral filtering. In NIPS, pages 3844–3852, 2016.
  • [Engelcke et al.2017] Martin Engelcke, Dushyant Rao, Dominic Zeng Wang, Chi Hay Tong, and Ingmar Posner. Vote3deep: Fast object detection in 3d point clouds using efficient convolutional neural networks. In ICRA, pages 1355–1361, 2017.
  • [Graham et al.2018] Benjamin Graham, Martin Engelcke, and Laurens van der Maaten. 3d semantic segmentation with submanifold sparse convolutional networks. In CVPR, June 2018.
  • [Huang et al.2018] Qiangui Huang, Weiyue Wang, and Ulrich Neumann. Recurrent slice networks for 3d segmentation of point clouds. In CVPR, pages 2626–2635, 2018.
  • [Kalogerakis et al.2017] Evangelos Kalogerakis, Melinos Averkiou, Subhransu Maji, and Siddhartha Chaudhuri. 3d shape segmentation with projective convolutional networks. In CVPR, volume 1, page 8, 2017.
  • [Klokov and Lempitsky2017] Roman Klokov and Victor Lempitsky. Escape from cells: Deep kd-networks for the recognition of 3d point cloud models. In ICCV, Oct 2017.
  • [Landrieu and Simonovsky2018] Loic Landrieu and Martin Simonovsky. Large-scale point cloud semantic segmentation with superpoint graphs. In CVPR, June 2018.
  • [Li et al.2018a] Jiaxin Li, Ben M Chen, and Gim Hee Lee. So-net: Self-organizing network for point cloud analysis. In CVPR, pages 9397–9406, 2018.
  • [Li et al.2018b] Yangyan Li, Rui Bu, Mingchao Sun, Wei Wu, Xinhan Di, and Baoquan Chen. Pointcnn: Convolution on x-transformed points. In NIPS, pages 828–838, 2018.
  • [Masci et al.2016] Jonathan Masci, Emanuele Rodolà, Davide Boscaini, Michael M Bronstein, and Hao Li. Geometric deep learning. In SIGGRAPH ASIA 2016 Courses, page 1. ACM, 2016.
  • [Maturana and Scherer2015] Daniel Maturana and Sebastian Scherer. Voxnet: A 3d convolutional neural network for real-time object recognition. In IROS, pages 922–928, 2015.
  • [Monti et al.2017] Federico Monti, Davide Boscaini, Jonathan Masci, Emanuele Rodola, Jan Svoboda, and Michael M Bronstein. Geometric deep learning on graphs and manifolds using mixture model cnns. In CVPR, volume 1, page 3, 2017.
  • [Qi et al.2016] Charles R Qi, Hao Su, Matthias Nießner, Angela Dai, Mengyuan Yan, and Leonidas J Guibas. Volumetric and multi-view cnns for object classification on 3d data. In CVPR, pages 5648–5656, 2016.
  • [Qi et al.2017a] Charles R. Qi, Hao Su, Kaichun Mo, and Leonidas J. Guibas. Pointnet: Deep learning on point sets for 3d classification and segmentation. In CVPR, July 2017.
  • [Qi et al.2017b] Charles Ruizhongtai Qi, Li Yi, Hao Su, and Leonidas J Guibas. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. In NIPS, pages 5099–5108, 2017.
  • [Riegler et al.2017] Gernot Riegler, Ali Osman Ulusoy, and Andreas Geiger. Octnet: Learning deep 3d representations at high resolutions. In CVPR, volume 3, 2017.
  • [Shen et al.2018] Yiru Shen, Chen Feng, Yaoqing Yang, and Dong Tian. Mining point cloud local structures by kernel correlation and graph pooling. In CVPR, June 2018.
  • [Su et al.2015] Hang Su, Subhransu Maji, Evangelos Kalogerakis, and Erik Learned-Miller. Multi-view convolutional neural networks for 3d shape recognition. In ICCV, pages 945–953, 2015.
  • [Su et al.2018] Hang Su, Varun Jampani, Deqing Sun, Subhransu Maji, Evangelos Kalogerakis, Ming-Hsuan Yang, and Jan Kautz. Splatnet: Sparse lattice networks for point cloud processing. In CVPR, pages 2530–2539, 2018.
  • [Wang et al.2017] Peng-Shuai Wang, Yang Liu, Yu-Xiao Guo, Chun-Yu Sun, and Xin Tong. O-cnn: Octree-based convolutional neural networks for 3d shape analysis. TOG, 36(4):72, 2017.
  • [Wang et al.2018] Chu Wang, Babak Samari, and Kaleem Siddiqi. Local spectral graph convolution for point set feature learning. In ECCV, 2018.
  • [Wu et al.2015] Zhirong Wu, Shuran Song, Aditya Khosla, Fisher Yu, Linguang Zhang, Xiaoou Tang, and Jianxiong Xiao. 3d shapenets: A deep representation for volumetric shapes. In CVPR, pages 1912–1920, 2015.
  • [Xie et al.2018] Saining Xie, Sainan Liu, Zeyu Chen, and Zhuowen Tu. Attentional shapecontextnet for point cloud recognition. In CVPR, pages 4606–4615, 2018.
  • [Xu et al.2018] Yifan Xu, Tianqi Fan, Mingye Xu, Long Zeng, and Yu Qiao. Spidercnn: Deep learning on point sets with parameterized convolutional filters. In ECCV, 2018.
  • [Yi et al.2016] Li Yi, Vladimir G Kim, Duygu Ceylan, I Shen, Mengyan Yan, Hao Su, Cewu Lu, Qixing Huang, Alla Sheffer, Leonidas Guibas, et al. A scalable active framework for region annotation in 3d shape collections. TOG, 35(6):210, 2016.
  • [Yi et al.2017] Li Yi, Hao Su, Xingwen Guo, and Leonidas J Guibas. Syncspeccnn: Synchronized spectral cnn for 3d shape segmentation. In CVPR, pages 6584–6592, 2017.