1 Introduction
In recent robotics, autonomous driving and virtual/augmented reality applications, sensors that can directly obtain 3D data are increasingly ubiquitous. This includes indoor sensors such as laser scanners, timeofflight sensors such as the Kinect, RealSense or Google Tango, structural light sensors such as those on the iPhoneX, as well as outdoor sensors such as LIDAR and MEMS sensors. The capability to directly measure 3D data is invaluable in those applications as depth information could remove a lot of the segmentation ambiguities in 2D imagery, and surface normals provide important cues of the scene geometry.
In 2D images, convolutional neural networks (CNNs) have fundamentally changed the landscape of computer vision by greatly improving results on almost every vision task. CNNs succeed by utilizing its translation invariance property, so that the same set of convolutional filters can be applied on all the locations in an image, reducing the number of parameters and improving generalization. We would hope such successes to be transferred to the analysis of 3D data. However, 3D data often come in the form of point clouds, which is a set of unordered 3D points, with or without additional features (e.g. RGB) on each point. Point clouds are unordered and do not conform to the regular lattice grids as in 2D images. It is difficult to apply conventional CNNs on such unordered input. An alternative approach is to treat the 3D space as a volumetric grid, but in this case, the volume will be sparse and CNNs will be computationally intractable on highresolution volumes.
In this paper, we propose a novel approach to perform convolutions on 3D point clouds with nonuniform sampling. We note that the conventional 2D convolution operation can be viewed as a discrete approximation of a continuous convolution operator. In 3D space, we can treat the weights of this convolution operator to be a (Lipschitz) continuous function of the local 3D point coordinates with respect to a reference 3D point. The continuous function can be approximated by a multilayer perceptron(MLP), as done in [31] and [14]. But these algorithms did not take nonuniform sampling into account. We propose to use an inverse density scale to reweight the continuous function learned by MLP, which corresponds to the Monte Carlo approximation of the continuous convolution. We such an operation PointConv. PointConv involves taking the positions of point clouds as input and learning an MLP to approximate a weight function, as well as applying a inverse density scale on the learned weights to compensate the nonuniform sampling.
The naive implementation of PointConv is memory inefficient when the channel size of the output features is very large and hence hard to train and scale up to large networks. In order to reduce the memory consumption of PointConv, we introduce an approach which is able to greatly increase the memory efficiency using a reformulation that changes the summation order. The new structure is capable of building multilayer deep convolutional networks on 3D point clouds that have similar capabilities as 2D CNN on raster images. We can achieve the same translationinvariance as in 2D convolutional networks, and the invariance to permutations on the ordering of points in a point cloud.
In segmentation tasks, the ability to transfer information gradually from coarse layers to finer layer is important. Hence, a deconvolution operation [22] that can fully leverage the feature from a coarse layer to a finer layer is vital for the performance. Most stateoftheart algorithms [24, 26] are unable to perform deconvolution, which restricts their performance on segmentation tasks. Since our PointConv is a full approximation of convolution, it is natural to extend PointConv to a PointDeconv, which can fully untilize the information in coarse layers and propagate to finer layers. By using PointConv and PointDeconv, we can achieve improved performance on semantic segmentation tasks.
The key contributions of our work are:
We propose PointConv, a density reweighted convolution, which is able to fully approximate the 3D continuous convolution on any set of 3D points.
We design a memory efficient approach to implement PointConv using a change of summation order technique.
We extend our PointConv to a deconvolution version(PointDeconv) to achieve better segmentation results.
Experiments show that our deep network built on PointConv is highly competitive against other point cloud deep networks and achieve stateoftheart results in part segmentation [2] and indoor semantic segmentation benchmarks [5]. In order to demonstrate that our PointConv is indeed a true convolution operation, we also evaluate PointConv on CIFAR10 by converting all pixels in a 2D image into a point cloud with 2D coordinates along with RGB features on each point. Experiments on CIFAR10 show that the classification accuracy of our PointConv with a 5layer network is 89.13%, comparable to the results of AlexNet[17] which has similar depth, far outperforming previous best results achieved by a point cloud network. Being a basic approach to CNN on 3D data, we believe there could be many potential applications of this approach in the near future.
2 Related Work
Most work on 3D CNN networks convert 3D point clouds to 2D images or 3D volumetric grids. [34, 25] proposed to project 3D point clouds or shapes into several 2D images, and then apply 2D convolutional networks for classification. Although these approaches have achieved dominating performances on shape classification and retrieval tasks, it is nontrivial to extend them to highresolution scene segmentation tasks [5]. [40, 21, 25] represent another type of approach that voxelizes point clouds into volumetric grids by quantization and then apply 3D convolution networks. This type of approach is constrained by its 3D volumetric resolution and the computational cost of 3D convolutions. [29] improves the resolution significantly by using a set of unbalanced octrees where each leaf node stores a pooled feature representation. Kdnetworks[16] computes the representations in a feedforward bottomup fashion on a Kdtree with certain size. In a Kdnetwork, the input number of points in the point cloud needs to be the same during training and testing, which does not hold for many tasks.
Some latest work [28, 24, 26, 33, 35, 11, 8, 37] directly take raw point clouds as input without converting them to other formats. [24, 28]
proposed to use shared multilayer perceptrons and max pooling layers to obtain features of point clouds. Because the max pooling layers are applied across all the points in point cloud, it is difficult to capture local features. PointNet++
[26] improved the network in PointNet [24] by adding a hierarchical structure. The hierarchical structure is similar to the one used in image CNNs, which extracts features starting from small local regions and gradually extending to larger regions. The key structure used in both PointNet [24] and PointNet++ [26] to aggregate features from different points is maxpooling. However, maxpooling layers keep only the strongest reaction in features across a local or global region, which may lose some useful detailed information for segmentation tasks. [33] presents a method that projects the input features of the point clouds onto a highdimensional lattice, and then apply bilateral convolution on the highdimensional lattice to aggregate features, which called “SPLATNet”. The SPLATNet [33] is able to give comparable results as PointNet++ [26]. The tangent convolution [35] projects local surface geometry on a tangent plane around every point, which gives a set of planarconvolutionable tangent images. The pointwise convolution [11] queries nearest neighbors on the fly and bins the points into kernel cells, then applies kernel weights on the binned cells to convolve on point clouds. Flexconvolution [8] introduced a generalization of the conventional convolution layer along with an efficient GPU implementation, which can applied to point clouds with millions of points. FeaStNet [37] proposes to generalize conventional convolution layer to 3D point clouds by adding a softassignment matrix.The work [31, 14, 38] and [41] propose to learn continuous filters to perform convolution. [14] proposed that the weight filter in 2d convolution can be treated as a continuous function, which can be approximated by MLPs. [31] firstly introduced the idea into 3d graph structure. Both [14] and [31] are able to achieve very good results on classification tasks. Dynamic graph CNN [38] proposed a depthwise convolution on graph along with a method that can dynamically updating the graph. [41] presents a special family of filters to approximate the weight function instead of using MLPs. Our paper is different from these approachs in three aspects. First, we use density to compensate for the nonuniform sampling in point clouds. Second, we propose a memory efficient version of PointConv, which was not addressed in [31] and [14]; Third, we propose a deconvolution operator based on PointConv to perform semantic segmentation.
PointCNN [19] is proposed as another approach to perform convolution on point clouds. The idea in PointCNN [19] is to learn a transformation from the input points and then use it to simultaneously weight and permute the input features associated with the points. For a local region with points, a transformation [19] is a matrix learned from the coordinates of input points with multilayer percetrons. Comparing to our approach, PointCNN is unable to achieve permutationinvariance, which is desired for point clouds.
3 PointConv
We propose a convolution operation which extends traditional image convolution into the point cloud called PointConv. PointConv is a Monte Carlo approximation to the 3D continuous convolution operator. For each convolutional filter, it uses MLP to approximate a weight function, then applies an inverse density scale to reweight the learned weight functions. Sec. 3.1 introduces the structure of the PointConv layer. Sec. 3.2 introduces PointDeconv, using PointConv layers to deconvolve features.
3.1 Convolution on 3D Point Clouds
(1) 
Images can be interpreted as 2D discrete functions, which are usually represented as gridshaped matrices. In CNN, each filter is restricted to a small local region, such as , etc. Within each local region, the relative positions between different pixels are always fixed, as shown in Figure 1 (a). And the filter can be easily discretized to a summation with a realvalued weight for each location within the local region.
A point cloud is represented as a set of 3D points , where each point contains a position vector and its features such as color, surface normal, etc. Different from images, point clouds have more flexible shapes. The coordinate of a point in a point cloud is not located on a fixed grid but can take an arbitrary continuous value. Thus, the relative positions of different points are diverse in each local region. Conventional discretized convolution filters on raster images cannot be applied directly on the point cloud. Fig. 1 shows the difference between a local region in a image and a point cloud.
To make convolution compatible with point sets, we propose a permutationinvariant convolution operation, called PointConv. Our idea is to first go back to the continuous version of 3D convolution as:
(2) 
is the feature of a local region centered around point . A point cloud can be viewed as a nonuniform sample from the continuous space. In each local region, could be any possible position in the local region. We define PointConv as a Monte Carlo estimate of the continuous 3D convolution with respect to an importance sampling:
(3) 
is the inverse density at point . is required because the point cloud can be sampled very nonuniformly^{1}^{1}1To see this, note the Monte Carlo estimate with a biased sample: , for . Point clouds are often biased samples because many sensors have difficulties measuring points near plane boundaries, hence needing this reweighting. Intuitively, the number of points in the local region varies across the whole point cloud, as in Figure 2 (b) and (c). Besides, in Figure 2 (c), points are very close to one another, hence the contribution of each should be smaller.
Our main idea is to approximate the weight function by multilayer perceptrons from the 3D coordinates and and the inverse density by a kernelized density estimation [36] followed by a nonlinear transform implemented with MLP. Because the weight function highly depends on the distribution of input point cloud, we call the entire convolution operation PointConv. Prior work [14, 31] considered the approximation of the weight function but did not consider the approximation of the density scale, hence is not a full approximation of the continuous convolution operator.
The weights of the MLP in PointConv are shared across all the points to maintain the permutation invariance. In order to compute the inverse density scale , we first estimate the density of each point in a point cloud offline, then feed the density into a MLP for a 1D nonlinear transform. The reason to use a nonlinear transform is for the network to decide adaptively whether to use the density estimates. In our experiments, we use the kernel density estimation(KDE) to estimate the density of each point in a point cloud.
Figure 3 shows the PointConv operation on a point local region. Let be the channel size for the input feature and output feature, are the indices for th neighbor, th channel for input feature and th channel for output feature. The inputs are the 3D local positions of the points , which can be computed by subtracting the coordinate of the centroid of the local region and the feature of the local region. We use convolution to implement the MLP. The output of the weight function is . So, is a vector. The density scale is . After convolution, the feature from a local region with neighbour points are encoded into the output feature , as shown in Eq.(4).
(4) 
Our PointConv learns a network to approximate the continuous weights for convolution. For each input point, we can compute the weights from the MLPs using its relative coordinates. Figure 2 (a) shows an example continuous weight function for convolution. With a point cloud input as a discretization of the continuous input, a discrete convolution can be computed by Figure 2 (b) to extract the local features, which would work (with potentially different approximation accuracy) for different point cloud samples(Figure 2 (bd)), including a regular grid(Figure 2 (d)). Note that in a raster image, the relative positions in local region are fixed. Then PointConv (which takes only relative positions as input for the weight functions) would output the same weight and density across the whole image, which indicates that PointConv is equivalent to a traditional sense of convolution.
In order to aggregate the features in the entire point set, we use a hierarchical structure that is able to combine detailed small region features into abstract features that cover a larger spatial extent. The hierarchical structure we use is composed by several feature encoding modules, which is similar to the one used in PointNet++ [26]. Each module is roughly equivalent to one layer in a convolutional CNN. The key layers in each feature encoding module are the sampling layer, the grouping layer and the PointConv. More details for the feature encoding module can be found in the supplementary material.
The drawback of this approach is that each filter needs to be approximated by a network, hence is very inefficient. In Sec.4, we propose an efficient approach to implement PointConv.
3.2 Feature Propagation Using Deconvolution
For the segmentation task, we need pointwise prediction. In order to obtain features for all the input points, an approach to propagate features from a subsampled point cloud to a denser one is needed. PointNet++ [26]
proposes to use distance based interpolation to propagate features, which is reasonable due to local correlations inside of a local region. However, this does not take full advantage of the deconvolution operation that captures local correlations of propagated information from the coarse level. We propose to add a PointDeconv layer based on the PointConv, as a deconvolution operation to address this issue.
As shown in Figure 4, PointDeconv is composed of two parts: interpolation and PointConv. Firstly, we employ an interpolation to propagate coarse features from previous layer. Following [26], the interpolation is conducted by linearly interpolating features from the 3 nearest points. Then, the interpolated features are concatenated with features from the convolutional layers with the same resolution using skip links. After concatenation, we apply PointConv on the concatenated features to obtain the final deconvolution output, similar to the image deconvolution layer [22]. We apply this process until the features of all the input points have been propagated back to the original resolution.
4 Efficient PointConv
The naive implementation of the PointConv is memory consuming and inefficient. Different from [31]
, we propose a novel reformulation to implement PointConv by reducing it to two standard operations: matrix multiplication and 2d convolution. This novel trick not only takes advantage of the parallel computing of GPU, but also can be easily implemented using mainstream deep learning frameworks. Because the inverse density scale does not have such memory issues, the following discussion mainly focuses on the weight function.
Specifically, let be the minibatch size in the training stage, be the number of points in a point cloud, be the number of points in each local region, be the number of input channels, and be the number of output channels. For a point cloud, each local region shares the same weight functions which can be learned using MLP. However, weights computed from the weight functions at different points are different. The size of the weights filters generated by the MLP is . Suppose , , , , , and the filters are stored with single point precision. Then, the memory size for the filters is for only one layer. The network would be hard to train with such high memory consumption. [31] used very small network with few filters which significantly degraded its performance. To resolve this problem, we propose a memory efficient version of PointConv based on Lemma 1.
Lemma 1
The PointConv is equivalent to the following formula: where is the input to the last layer in the MLP for computing the weight function, and is the weights of the last layer in the same MLP, is convolution.
Proof: Generally, the last layer of the MLP is a linear layer. In one local region, let and rewrite the MLP as a convolution so that the output of the weight function is . Let is the index of the points in a local region, and are the indices of the input, middle layer and the filter output, respectively. Then is a vector from . And the is a vector from . According to Eq.(4), the PointConv can be expressed in Eq.(5).
(5) 
Let’s explore Eq.(5) in a more detailed manner. The output of the weight function can be expressed as:
(6) 
(7) 
Thus, the original PointConv can be equivalently reduced to a matrix multiplication and a convolution. Figure 5 shows the efficient version of PointConv.
In this method, instead of storing the generated filters in memory, we divide the weights filters into two parts: the intermediate result and the convolution kernel . As we can see, the memory consumption reduces to of the original version. With the same input setup as the Figure 3 and let , the memory consumption is , which is about of the original PointConv.
5 Experiments
In order to evaluate our new PointConv network, we conduct experiments on some widely used datasets, ModelNet40 [40], ShapeNet [2] and ScanNet [5]. In order to demonstrate that our PointConv is able to fully approximate conventional convolution, we also report the experiment results on the CIFAR10 dataset [17]
. In all experiments, we implement the models with Tensorflow on a GTX 1080Ti GPU. The Adam optimizer is used. ReLU and batch normalization are applied after each layer except the last fully connected layer. The detailed network structures and more experiments results can be found in supplementary.
5.1 Classification on ModelNet40
ModelNet40 contains 12,311 CAD models from 40 manmade object categories. We use the official split with 9,843 shapes for training and 2,468 for testing. Following the configuration in [24], we use the source code for PointNet[24] to sample 1,024 points uniformly and compute the normal vectors from the mesh models. For fair comparison, we employ the same data augmentation strategy as [24] by randomly rotating the point cloud along the
axis and jittering each point by a Gaussian noise with zero mean and 0.02 standard deviation. In Table
1, our PointConv achieved stateoftheart performance among methods based on 3D input. ECC[31] which is similar to our approach, cannot scale to a large network, which limited their performance.Method  Input  Accuracy(%) 

Subvolume [25]  voxels  89.2 
ECC [31]  graphs  87.4 
KdNetwork [16]  1024 points  91.8 
PointNet [24]  1024 points  89.2 
PointNet++ [26]  1024 points  90.2 
PointNet++ [26]  5000 points+normal  91.9 
SpiderCNN [41]  1024 points+normal  92.4 
PointConv  1024 points+normal  92.5 
5.2 ShapeNet Part Segmentation
Part segmentation is a challenging finegrained 3D recognition task. The ShapeNet dataset contains 16,881 shapes from 16 classes and 50 parts in total. The input of the task is shapes represented by a point cloud, and the goal is to assign a part category label to each point in the point cloud. The category label for each shape is given. We follow the experiment setup in most related work [26, 33, 41, 16]. It is common to narrow the possible part labels to the ones specific to the given object category by using the known input 3D object category. And we also compute the normal direction on each point as input features to better describe the underlying shape. Figure 6 visualizes some sample results.
class avg.  instance avg.  

SSCNN [42]  82.0  84.7 
Kdnet [16]  77.4  82.3 
PointNet [24]  80.4  83.7 
PointNet++[26]  81.9  85.1 
SpiderCNN [41]  82.4  85.3 
SPLATNet [33]  82.0  84.6 
PointConv  82.8  85.7 
We use point intersectionoverunion(IoU) to evaluate our PointConv network, same as PointNet++ [26], SPLATNet [33] and some other part segmentation algorithms [42], [16], [41]. The results are shown in Table 2. PointConv obtains a class average mIoU of 82.8% and an instance average mIoU of 85.7%, which are on par with the stateoftheart algorithms which only take point clouds as input. According to [33], the SPLATNet also takes rendered 2D views as input. Since our PointConv only takes 3D point clouds as input, for fair comparison, we only compare our result with the SPLATNet in [33].
5.3 Semantic Scene Labeling
Datasets such as ModelNet40 [40] and ShapeNet [2] are manmade synthetic datasets. As we can see in the previous section, most stateoftheart algorithms are able to obtain relatively good results on such datasets. To evaluate the capability of our approach in processing realistic point clouds, which contains a lot of noisy data, we evaluate our PointConv on semantic scene segmentation using the ScanNet dataset. The task is to predict semantic object labels on each 3D point given indoor scenes represented by point clouds. The newest version of ScanNet [5] includes updated annotations for all 1513 ScanNet scans and 100 new test scans with all semantic labels publicly unavailable and we submitted our results to the official evaluation server to compare against other approaches.
We compare our algorithm with Tangent Convolutions [35], SPLAT Net [33], PointNet++ [26] and ScanNet [5]. All the algorithm mentioned reported their results on the new ScanNet dataset to the benchmark, and the inputs of the algorithms only uses 3D coordinates data plus RGB. In our experiments, we generate training samples by randomly sample cubes from the indoor rooms, and evaluate using a sliding window over the entire scan. We report intersection over union (IoU) as our main measures, which is the same as the benchmark. We visualize some example semantic segmentation results in Figure 7. The mIoU is reported in Table 3. The mIoU is the mean of IoU across all the categories. Our PointConv outperforms other algorithm by a significant margin (Table 3).
5.4 Classification on CIFAR10
In Sec.3.1, we claimed that PointConv can be equivalent with 2D CNN. If this is true, then the performance of a network based on PointConv should be equivalent to that of a raster image CNN. In order to verify that, we use the CIFAR10 dataset as a comparison benchmark. We treat each pixel in CIFAR10 as a 2D point with coordinates and RGB features. The point clouds are scaled onto the unit ball before training and testing.
Experiments show that PointConv on CIFAR10 indeed has the same learning capacities as a 2D CNN. Table 4 shows the results of image convolution and PointConv. From the table, we can see that the accuracy of PointCNN[19] on CIFAR10 is only , which is much worse than image CNN. However, the network using PointConv is able to achieve , which is similar to the network using image convolution.
6 Ablation Experiments and Visualizations
In this section, we conduct additional experiments to evaluate the effectiveness of each aspect of PointConv. Besides the ablation study on the structure of the PointConv, we also give an indepth breakdown on the performance of PointConv on the ScanNet dataset. Finally, we provide some learned filters for visualization.
6.1 The Structure of MLP
In this section, we design experiments to evaluate the choice of MLP parameters in PointConv. For fast evaluation, we generate a subset from the ScanNet dataset as a classification task. Each example in the subset is randomly sampled from the original scene scans with 1,024 points. There are 20 different scene types for the ScanNet dataset. We empirically sweep over different choices of and different number of layers of the MLP in PointConv. Each experiment was ran for 3 random trials.The results can be find in supplementary. From the results, we find that larger does not necessarily give better classification results. And the different number of layers in MLP does not give much difference in classification results. Since is linearly correlated with the memory consumption of each PointConv layer, this results shows that we can choose a reasonably small for greater memory efficiency.
6.2 Inverse Density Scale
In this section, we study the effectiveness of the inverse density scale . We choose the ScanNet dataset as our evaluation task since the point clouds in ScanNet are generated from real indoor scenes. We follow the standard training/validation split provided by the authors. We train the network using two different kind of PointConv: one with the inverse density scale as described in Sec.3.1; one without the density scale. Table 5 shows the results. As we can see, PointConv with inverse density scale performs better than the one without by about
, which proves the effectiveness of inverse density scale. In our experiments, we observe that inverse density scale tend to be more effective in layers closer to the input. In deep layers, the MLP tends to learn to diminish the effect of the density scale. One possible reason is that with farthest point sampling algorithm as our subsampling algorithm, the point cloud in deeper layer tend to be more uniformly distributed.
6.3 Ablation Studies on ScanNet Semantic Segmentation
As one can see, our PointConv outperforms other approaches with a large margin. Since we are only allowed to submit one final result of our algorithm to the benchmark server of ScanNet, we perform more ablation studies for PointConv using the public validation set provide by [5]. For the segmentation task, we train our PointConv with 8192 points randomly sampled from a , and evaluate the model with exhaustively choose all points in the
cube in a sliding window fashion through the xyplane with different stride sizes. For robustness, we use majority vote with vote number
in all of our experiments. From Table 5, we can see that smaller stride size is able to improve the segmentation results, and the RGB information on ScanNet does not seem to significantly improve the segmentation results. Even without these additional improvements, PointConv still outperforms baselines by a large margin.Input  Stride Size(m)  mIoU 

xyz  0.5  60.4 
1.0  58.5  
1.5  57.9  
xyz+No density  1.5  56.9 
xyz+RGB  0.5  60.8 
1.0  58.6  
1.5  57.5 
6.4 Visualization
7 Conclusion
In this work, we proposed a novel approach to perform convolution operation on 3D point clouds, called PointConv. PointConv trains multilayer perceptrons on local point coordinates to approximate continuous weight and density functions in convolutional filters, which makes it naturally permutationinvariant and translationinvariant. This allows deep convolutional networks to be built directly on 3D point clouds. We proposed an efficient implementation of it which greatly improved its scalability. We demonstrated its strong performance on multiple challenging benchmarks and capability of matching the performance of a gridbased convolutional network in 2D images. In future work, we would like to adopt more mainstream image convolution network architectures into point cloud data using PointConv, such as ResNet and DenseNet.
References

[1]
M. M. Bronstein and I. Kokkinos.
Scaleinvariant heat kernel signatures for nonrigid shape
recognition.
In
Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on
, pages 1704–1711. IEEE, 2010.  [2] A. X. Chang, T. Funkhouser, L. Guibas, P. Hanrahan, Q. Huang, Z. Li, S. Savarese, M. Savva, S. Song, H. Su, et al. Shapenet: An informationrich 3d model repository. arXiv preprint arXiv:1512.03012, 2015.

[3]
D.Y. Chen, X.P. Tian, Y.T. Shen, and M. Ouhyoung.
On visual similarity based 3d model retrieval.
In Computer graphics forum, volume 22, pages 223–232. Wiley Online Library, 2003.  [4] H. Chu, W.C. M. K. Kundu, R. Urtasun, and S. Fidler. Surfconv: Bridging 3d and 2d convolution for rgbd images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3002–3011, 2018.
 [5] A. Dai, A. X. Chang, M. Savva, M. Halber, T. Funkhouser, and M. Nießner. Scannet: Richlyannotated 3d reconstructions of indoor scenes. In Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), volume 1, 2017.
 [6] Y. Fang, J. Xie, G. Dai, M. Wang, F. Zhu, T. Xu, and E. Wong. 3d deep shape descriptor. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2319–2328, 2015.
 [7] A. Gressin, C. Mallet, J. Demantké, and N. David. Towards 3d lidar point cloud registration improvement using optimal neighborhood knowledge. ISPRS journal of photogrammetry and remote sensing, 79:240–251, 2013.
 [8] F. Groh, P. Wieschollek, and H. Lensch. Flexconvolution (deep learning beyond gridworlds). arXiv preprint arXiv:1803.07289, 2018.
 [9] K. Guo, D. Zou, and X. Chen. 3d mesh labeling via deep convolutional neural networks. ACM Transactions on Graphics (TOG), 35(1):3, 2015.
 [10] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
 [11] B.S. Hua, M.K. Tran, and S.K. Yeung. Pointwise convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 984–993, 2018.
 [12] Q. Huang, W. Wang, and U. Neumann. Recurrent slice networks for 3d segmentation of point clouds. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2626–2635, 2018.
 [13] J.H. Jacobsen, J. van Gemert, Z. Lou, and A. W. Smeulders. Structured receptive fields in cnns. In Computer Vision and Pattern Recognition (CVPR), 2016 IEEE Conference on, pages 2610–2619. IEEE, 2016.
 [14] X. Jia, B. De Brabandere, T. Tuytelaars, and L. V. Gool. Dynamic filter networks. In Advances in Neural Information Processing Systems, pages 667–675, 2016.
 [15] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 [16] R. Klokov and V. Lempitsky. Escape from cells: Deep kdnetworks for the recognition of 3d point cloud models. In 2017 IEEE International Conference on Computer Vision (ICCV), pages 863–872. IEEE, 2017.
 [17] A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009.
 [18] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
 [19] Y. Li, R. Bu, M. Sun, and B. Chen. Pointcnn. arXiv preprint arXiv:1801.07791, 2018.
 [20] H. Ling and D. W. Jacobs. Shape classification using the innerdistance. IEEE transactions on pattern analysis and machine intelligence, 29(2):286–299, 2007.
 [21] D. Maturana and S. Scherer. Voxnet: A 3d convolutional neural network for realtime object recognition. In Intelligent Robots and Systems (IROS), 2015 IEEE/RSJ International Conference on, pages 922–928. IEEE, 2015.
 [22] H. Noh, S. Hong, and B. Han. Learning deconvolution network for semantic segmentation. In Proceedings of the IEEE International Conference on Computer Vision, pages 1520–1528, 2015.
 [23] C. R. Qi, W. Liu, C. Wu, H. Su, and L. J. Guibas. Frustum pointnets for 3d object detection from rgbd data. arXiv preprint arXiv:1711.08488, 2017.
 [24] C. R. Qi, H. Su, K. Mo, and L. J. Guibas. Pointnet: Deep learning on point sets for 3d classification and segmentation. Proc. Computer Vision and Pattern Recognition (CVPR), IEEE, 1(2):4, 2017.
 [25] C. R. Qi, H. Su, M. Nießner, A. Dai, M. Yan, and L. J. Guibas. Volumetric and multiview cnns for object classification on 3d data. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5648–5656, 2016.
 [26] C. R. Qi, L. Yi, H. Su, and L. J. Guibas. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. In Advances in Neural Information Processing Systems, pages 5105–5114, 2017.
 [27] X. Qi, R. Liao, J. Jia, S. Fidler, and R. Urtasun. 3d graph neural networks for rgbd semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5199–5208, 2017.
 [28] S. Ravanbakhsh, J. Schneider, and B. Poczos. Deep learning with sets and point clouds. arXiv preprint arXiv:1611.04500, 2016.
 [29] G. Riegler, A. O. Ulusoy, and A. Geiger. Octnet: Learning deep 3d representations at high resolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, volume 3, 2017.
 [30] R. B. Rusu, N. Blodow, and M. Beetz. Fast point feature histograms (fpfh) for 3d registration. In Robotics and Automation, 2009. ICRA’09. IEEE International Conference on, pages 3212–3217. IEEE, 2009.
 [31] M. Simonovsky and N. Komodakis. Dynamic edgeconditioned filters in convolutional neural networks on graphs. In Proc. CVPR, 2017.
 [32] K. Simonyan and A. Zisserman. Very deep convolutional networks for largescale image recognition. arXiv preprint arXiv:1409.1556, 2014.
 [33] H. Su, V. Jampani, D. Sun, S. Maji, E. Kalogerakis, M.H. Yang, and J. Kautz. Splatnet: Sparse lattice networks for point cloud processing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2530–2539, 2018.
 [34] H. Su, S. Maji, E. Kalogerakis, and E. LearnedMiller. Multiview convolutional neural networks for 3d shape recognition. In Proceedings of the IEEE international conference on computer vision, pages 945–953, 2015.
 [35] M. Tatarchenko, J. Park, V. Koltun, and Q.Y. Zhou. Tangent convolutions for dense prediction in 3d. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3887–3896, 2018.
 [36] B. A. Turlach. Bandwidth selection in kernel density estimation: A review. In CORE and Institut de Statistique. Citeseer, 1993.
 [37] N. Verma, E. Boyer, and J. Verbeek. Feastnet: Featuresteered graph convolutions for 3d shape analysis. In CVPR 2018IEEE Conference on Computer Vision & Pattern Recognition, 2018.
 [38] Y. Wang, Y. Sun, Z. Liu, S. E. Sarma, M. M. Bronstein, and J. M. Solomon. Dynamic graph cnn for learning on point clouds. arXiv preprint arXiv:1801.07829, 2018.
 [39] Z. Wu, R. Shou, Y. Wang, and X. Liu. Interactive shape cosegmentation via label propagation. Computers & Graphics, 38:248–254, 2014.
 [40] Z. Wu, S. Song, A. Khosla, F. Yu, L. Zhang, X. Tang, and J. Xiao. 3d shapenets: A deep representation for volumetric shapes. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1912–1920, 2015.
 [41] Y. Xu, T. Fan, M. Xu, L. Zeng, and Y. Qiao. Spidercnn: Deep learning on point sets with parameterized convolutional filters. arXiv preprint arXiv:1803.11527, 2018.
 [42] L. Yi, H. Su, X. Guo, and L. Guibas. Syncspeccnn: Synchronized spectral cnn for 3d shape segmentation. In Computer Vision and Pattern Recognition (CVPR), 2017.
 [43] T. Zhou, M. Brown, N. Snavely, and D. G. Lowe. Unsupervised learning of depth and egomotion from video. In CVPR, volume 2, page 7, 2017.
 [44] Y. Zhou and O. Tuzel. Voxelnet: Endtoend learning for point cloud based 3d object detection. arXiv preprint arXiv:1711.06396, 2017.