1 Introduction
Point cloud analysis has taken on increasing significance in recent years, as the 3D vision that relies on point cloud data becomes essential in a large range of applications such as autonomous driving, robotics, and augmented reality. However, it is not straightforward to process the point cloud due to its unique properties being sparse, unordered, and irregular. This nongrid structured data cannot be directly handled by superior 2D deep learning methods which are designed for grid structured data.
One simple approach is to rasterize point cloud into 3D voxel grids [24, 43, 12, 34, 8]. Standard discrete convolution then can be performed on discrete voxels by encoding geometric attributes of points contained therein. However, voxelization of the raw point cloud produces a vast number of voxels [26]. 3D convolutions on such data representation require high complexity and inefficient memory consumption that increase with resolution. Moreover, it inherently fails to make full use of the available data since voxelization inevitably produces discretization artifacts, resulting in information loss [26, 22, 13].
To overcome these limitations, the raw point cloud needs to be processed without intermediate data representation. For direct processing, PointNet [26]
proposes to learn pointwise features with shared multilayer perceptron (MLP) and symmetric aggregation function to be immune to point order ambiguity. Due to its simple but powerful description ability, this approach has become recognized, adopted, and improved by following works as a basic concept for point cloud analysis
[28, 50, 13, 14, 7, 49, 10, 45].Taking it further, various studies have focused on applying convolution directly to the points. A key challenge for point convolution is that kernel weights for an arbitrary position in the continuous receptive field should be obtainable to cope with the irregularity of the point cloud. So as to construct spatiallycontinuous kernels, several works [9, 42, 38, 22, 20] utilize MLP that takes a relative position in the kernel area as input and outputs the weights for the position. In this way, weights for neighboring points can be obtained by learning from their spatial distributions with respect to the center point. Alternatively, [36, 4, 5] devise kernel points of which locations are learnable. Weights for the kernel points are parameterized similarly to the kernel of image convolution, and a Gaussian or linear correlation function is used to define the continuous kernel from kernel points. However, it is problematical whether the continuous kernel can be effectively approximated only by MLP or a customized correlation function. MLP trained from a limited number of sample points must be able to handle an arbitrary point in the continuous receptive field, and customized correlation function is heavily reliant upon human decision. Thus, continuous convolution alone is not enough to reach optimal learning.
Based on these observations, we propose a novel point convolution operator named Cubic Kernel Convolution (CKConv) to address the aforementioned issues. The key idea of CKConv is learning to voxelize the features of local point set by exploiting both continuous and discrete convolutions, which complement each other in terms of remedying their drawbacks. Our point convolution voxelizes the feature without information loss and 3D convolution encodes features explicitly from fixed grids, resolving the discretization and approximation problems, respectively.
Specifically, point convolution in CKConv utilizes a spatially extended form of kernel representation, which we call the cubic kernel. The cubic kernel can be derived by simply extending the spatial dimension of a weight, as shown in Figure 1. Convolution with this kernel representation splits the feature with each weight in the cubic kernel, producing the voxelized features as an output. Discrete 3D convolutions are then consecutively applied to these voxelized features. As the whole operation is performed in an endtoend manner, following discrete convolution forces preceding point convolution to learn spatial feature mapping since discrete convolution is consistently operated on spatially adjoining voxelized features during training. This process can be interpreted as learning feature voxelization, i.e., extending the spatial dimension of feature. We believe that this spatially extended feature representation is more suitable for point cloud analysis since it enables spatial geometry in the local point set to be better represented.
Furthermore, we propose Local Set Attention (LSA) that provides additional spatial attention to the voxelized features with comprehensive structure awareness of the local point set. Every set of local points in the point cloud has a different structure due to the inherent irregularity, hence LSA can play a crucial role in capturing the respective structures by adaptively learning from overall features of the local point set. With datadependent guidance given by LSA, spatially extended features can be more representative.
We evaluate the proposed approach on three tasks including object classification, object part segmentation, and scene semantic segmentation. Experimental results verify that CKConv outperforms previous approaches.
Our main contributions are summarized as follows:

We propose a novel Cubic Kernel Convolution (CKConv) for effective point cloud analysis. It losslessly voxelizes the features with both continuous and discrete convolutions, and local geometry of points can be explicitly encoded with spatially extended features;

We introduce a learnable Local Set Attention (LSA) that provides comprehensive structural information for representative feature learning;

We provide extensive experiments on various point cloud processing tasks with theoretical analysis while achieving stateoftheart performances.
2 Related Works
Discrete convolution methods To take advantage of regular grid convolution, early works convert point cloud into the grid representation such as 2D pixels or 3D voxels. For 2D representation, point cloud can be transformed into multiview images [17, 3, 2, 33, 46] or a range image [40, 41, 25]
. Then 2D convolutional neural network (CNN) is applied to the produced images, making feature learning comparatively fast and uncomplicated. For 3D representation, 3D space is discretized into a set of occupancy voxels
[24, 43, 12, 34]. Since volumetric data has empty space where no value is assigned, [8, 6, 32] propose the method to focus on learning from occupied voxels, efficiently reducing memory and computational cost. However, discretizing nongrid structured data inevitably losses the detailed geometric information. On the other hand, our approach does not suffer from information loss since it takes raw point cloud without transforming the data representation. Instead, CKConv learns to voxelize features in embedding space, enabling 3D convolution on voxels.Pointwise MLP methods As a pioneering work, PointNet [26]
proposes to exploit shared MLP and symmetric aggregation function to directly process the point cloud. Pointwise features are learned independently from MLP that is shared over points, and global features are extracted by a maxpooling operation to achieve permutation invariance. PointNet++
[28] designs a hierarchical network to further capture neighborhood information for each point. Local geometric features are learned by applying PointNet on local groups of points. To enrich local region features, PointWeb [50] constructs a locally fullylinked web by connecting points in neighborhood region. Then MLPbased adaptive feature adjustment (AFA) module that learns contextual information from the region is formulated through the web. ShellNet [49] partitions the point cloud with a set of concentric spherical shells, and features of points in each shell are encoded by MLP and summarized by max pooling. However, although MLP is an appropriate solution to handle irregular data, these methods are overly dependent on it. Encoding with only MLP makes a network hard to converge well with generalization capability. In our case, we apply discrete convolution after MLP produces voxelized features. Thus, MLP can focus on feature mapping and discrete convolution learns highlevel encoding in fixed grids.Graph convolution methods Graphbased networks construct a graph from the point cloud according to spatial neighbors for each point. Feature learning on the graph can be performed in spatial or spectral domains. In the spatial domain, convolution is generally defined with MLP and graph information for each point is aggregated by pooling operation upon features of its neighbors [30]. DGCNN [39] progressively updates the graph in feature space after the edge convolution layer composed of MLP and pooling operation. DPAM [21] takes points similarity graph as input for a graph convolution network and learns the agglomeration matrix that is multiplied with the points feature matrix. Alternatively, graphbased networks in the spectral domain exploit spectral filtering to define convolution. RGCNN [35] updates the graph Laplacian matrix in each layer based on Chebyshev polynomial approximation. LocalSpecGCN [37] applies spectral graph convolution on a local graph for nearest neighbors of each point to learn relative layout. Although these graphbased networks have strong capability to handle irregular data, graph should be constructed in advance and the connectivity patterns of the graph are often complicated.
Point convolution methods Point convolution methods tend to extend the image convolution concept. These methods define continuous kernel function to obtain pointwise weight. [9, 42] approximate the continuous convolution with MonteCarlo integration from a finite number of input points, and MLP is utilized to construct the kernel function. RSCNN [22] also uses MLP for continuous kernel function but it performs as a mapping function. Another MLP lifts the mapped feature to highlevel learning by raising the channel. Our CKConv similarly utilizes a continuous kernel for feature mapping, but can better capture local geometry since features in CKConv have three spatial dimension whereas features in RSCNN have only one. KPConv [36] defines kernel points that carry kernel weights. The weights for kernel points can be directly parameterized, but correlation function that approximates the whole continuous kernel from kernel points is manually defined. In contrast, CKConv can lead to more optimized parameters since the entire kernel function is learnable. FPConv [20] learns a weight map to project points onto a 2D grid, and applies 2D convolution on the image form of features. This idea is shared with CKConv in view of discretizing local point set in feature space. However, FPConv has the severe limitation that its 2D output features lack representation ability regarding object curvature. On the other hand, CKConv can retain detailed local geometry without loss of dimension, since it learns to map features in 3D voxels. We prove our reasoning with various tasks in Section 4.
3 Method
We first briefly introduce the principle of point convolution operation. Afterwards, we expound on our proposed Cubic Kernel Convolution (CKConv).
3.1 Convolution on Points
Notations For the sake of clarity, we define the notations employed in the paper as follows.
Neighboring points around a point within a predefined radius are denoted as , where are cartesian coordinates. We call the set of these points a local point set, where single convolution operation is performed. For CKConv, neighboring points are randomly selected in to be robust to point cloud density, thus . The feature derived at point y is denoted as , where is the number of input channels, and
can be initialized with additional information such as normal vector or RGB color.
is the standard convolution kernel function and is our proposed cubic kernel function. Both convolution kernel functions determine kernel weights in continuous receptive fields.Point convolution formula As previous works [38, 9, 42, 36] showed, standard convolution operation on an arbitrary point can be formulated as
(1) 
where produces a weight vector for the neighboring point with size identical to feature , and “” denotes dot product. This formulation is essentially identical to image convolution except for the characteristic of kernel function . Since point cloud is nongrid structured data without fixed positions, should be able to handle any point in continuous space. Thus, the kernel function needs to be designed to obtain pointdependent kernel weight from the relation between points. Generally, MLP is employed for kernel function and weights for neighbor point are learned from its relative position .
3.2 Cubic Kernel Convolution
Standard image convolution represents each pixel as a feature vector with size equal to the number of channels. To be more concrete, each pixel has a feature of nondimensional scalar per channel, i.e., scalar feature. A kernel weight applied to each scalar feature is also scalar, i.e., scalar weight. Likewise, existing point convolution methods adopt this standard convolution concept for feature and weight. On the other hand, we propose to use 3D cubic form of kernel weights instead of scalar weight, to enrich geometric information of the feature. Figure 1 compares the standard convolution and cubic kernel convolution.
Point convolution with cubic kernel Our proposed cubic kernel is implemented with shared MLP to be invariant to the input order of points as previous works [42, 38, 22, 20]. Extending kernel weights from scalar to 3D can be simply enforced by increasing the output dimension of MLP. Threedimensional cubic kernel weights imply that the scalar weight is distributed into multiple weights by each voxel of a cube, which is a 3D matrix in fact. The spatial size of cubic kernel weights per channel is then changed from to , where is a predefined number of voxels along each axis. The scalar feature of a point is multiplied equally to each weight of the cubic kernel, producing output features with the same size as the cubic kernel.
While the weight representation is spatially extended, identical cubic weights are applied to all channels of the feature (see Figure 2), contrary to standard convolution (Eq. 1) that applies different weights to each channel. This operation is based on the concept that our point convolution aims to learn feature voxelization in a spatial manner, which is irrelevant to the channel dimension. Thus, our point convolution with cubic kernel can be formulated as
(2) 
where produces cubic weights for the neighboring point whose transposed feature is . Then each output size of standard convolution (Eq. 1) and point convolution in CKConv (Eq. 2) becomes
(3) 
Despite producing the large size of output feature, our point convolution requires fewer parameters since sizes of kernel weights obtained from and are and respectively. The size of cubic kernel weights remains constantly independent of the input feature channel , and the predefined need not be large as discussed in Section 4.4. We set for our final model, based on experimental results shown in Table 4.
Forcing to learn feature voxelization
Our point convolution with cubic kernel produces spatially extended features, but the cubic feature itself does not signify that the feature has taken on a spatial meaning. Only when 3D convolution is applied, preceding point convolution is then forced to infuse spatial relation between voxelized features. This interaction between convolutions is expected from the fact that they affect each other during gradient backpropagation in training, since CKConv has the architecture of endtoend learning which is shown in Figure
2. During training, point convolution becomes to produce spatially significant features as 3D convolutions are performed on the voxelized features in a spatial manner.Since the number of input channels is maintained in point convolution and increased to in 3D convolution, we can interpret the role of each convolution as a spatial feature mapping and highlevel encoding, respectively. In other words, point convolution losslessly voxelizes the features of the local point set in embedding space, and discrete convolution extracts the final output feature by operating on voxelized features. In this way, we also alleviate the problem that the continuous kernel is difficult to approximate with MLP [44, 23], by giving the continuous kernel a relatively intelligible task to learn.
Furthermore, the spatially extended features are more capable of encoding detailed geometric information, compared with scalar feature. The structure of local point set can be better captured with cubic feature representation, since it inherently contains spatial attributes whereas scalar feature does not in itself. Thus, our cubic representation would be more desirable for point cloud analysis that focuses on learning spatial distribution of points.
Cubic kernel normalization
Since kernel weights are obtained from continuous kernel function approximated by MLP, scale and variance of the cubic weights can vary substantially from point to point. To prevent this weight imbalance from causing unstable training, we impose restrictions on the distribution of cubic weights by applying normalization schemes.
Let the cubic weights be for a given point. Then L2 normalization on each cubic weight can be formulated as
(4) 
to stabilize scale distributions of cubic weights over points. To further restrict the scale and variance distributions of cubic weights, standardization can be applied as
(5) 
where and . These cubic weight normalization schemes help to improve performance (see Section 4.4), and hence we apply normalization after every weight prediction in CKConv layers.
Local set attention Due to the irregularity of point clouds, every local point set has a different structure. Thus, it is significant that point convolution captures the unique structures of the respective local point sets. For CKConv, we construct an additional branch for proposed Local Set Attention (LSA) that provides comprehensive geometric information of the local point set. As shown in Figure 3, LSA shares the front MLP with the cubic kernel function since it also needs to learn feature mapping in company with the cubic kernel. The representative feature of the local point set is obtained by applying max pooling on intermediate features in the cubic kernel function, and another MLP is used to extract cubic attention. In this way, cubic attention with identical size to the cubic kernel contains overall structure information, since it is extracted by aggregating features of the local point set. The cubic attention is then elementwise multiplied to all cubic features by reproduction in the channel dimension (see Figure 2). For the formulation, let input matrix in the Figure 3 be
(6) 
Then our point convolution with LSA can be extended from Eq. 2 as
(7) 
where is the attention function and is elementwise product. is a matrix of ones with identical size to , and indicates skip connection for features without LSA (cubic attention in Figure 2 contains the skip connection). With this point convolution formula, our cubic kernel convolution can be written as
(8) 
where is 3D convolution operations on voxelized features with output channel .
4 Experiments
We evaluate our proposed CKConv on three tasks: object classification, object part segmentation, and scene semantic segmentation. Network architectures, configurations, and detailed training parameters are provided in the supplementary material.
Method  Input  mIoU  aero  bag  cap  car  chair  ear.  guit.  knife  lamp  lapt.  moto.  mug  pist.  rock.  skate.  table 

KdNet [15]  4k points  
SynSpecCNN [48]  graph  
SONet [18]  1k points  
PointNet++ [28]  2k points  
DGCNN [39]  2k points  85.1  
SpiderCNN [44]  2k points  85.3  
PointCNN [19]  2k points  
RSCNN [22]  2k points  
KPConv [36]  2k points  
CKConv  2k points 
Methods  Input  mAcc  OA 

3DShapeNets [43]  voxels  
VoxNet [24]  voxels    
Vol. CNN [27]  voxels    
SONet [18]  2k points  
PointNet++ [28]  5k points    
SONet [18]  5k points  
KPConv [36]  7k points    
Pointwise CNN [11]  1k points  
PointNet [26]  1k points  
KCNet [29]  1k points    
PointNet++ [28]  1k points    
PointCNN [19]  1k points  
DGCNN [39]  1k points  
SpiderCNN [44]  1k points    
PointConv [42]  1k points    
FPConv [20]  1k points    
ACNN [16]  1k points  
RSCNN [22]  1k points    
InterpCNN [23]  1k points    
ShellNet [49]  1k points    
CKConv  1k points 
4.1 Object Classification
We use ModelNet40 [43] dataset that contains 9,843 train and 2,468 test models in 40 classes for object classification. Point cloud data trained and tested with our model is provided by [26], and 1,024 points are uniformly sampled as input. We use the normal as an additional feature. For training, random scaling and translation are used for augmentation strategy, as in [15, 22]. A dropout layer [31]
with 0.5 probability is used in the final fully connected layers to reduce the overfitting problem. For evaluation, we do not apply voting strategy that repeatedly predicts an object’s class with random scaling or sampling.
We compares overall accuracy (OA) and mean of classwise accuracy (mAcc) for the proposed CKConv with relevant previous stateoftheart models in Table 1. CKConv achieves higher accuracy than all other considered methods. In the case of RSCNN [22], we report the performance without the voting strategy as provided in the paper. Among methods that exploit 3D convolution for feature learning, our approach performs much better than those [43, 24, 27] rasterizing raw point cloud into voxels. The reason of the performance gap is based on the fact that CKConv learns to voxelize the feature in embedding space, which prevents information loss in the discretization process. CKConv also outperforms methods with points as input, which proves the effectiveness of our unique feature representation.
Method  OA  mAcc  mIoU  ceil.  floor  wall  beam  col.  wind.  door  chair  table  book.  sofa  board  clut. 

PointNet [26]    
PointCNN [19]  
PointWeb [50]  
FPConv [20]  
KPConv [36]    
CKConv 
4.2 Object Part Segmentation
We evaluate CKConv on ShapeNetPart [47] dataset for object part segmentation. ShapeNetPart contains 16,881 point clouds from 16 classes. Each data is annotated with 26 parts, and there are 50 different parts in total. We follow the data split used in [26]
, and 2,048 points with the normal are randomly sampled as input. The onehot encoding of the object label is concatenated to the last layer as in
[26]. We adopt the same augmentation strategy used in object classification task. During testing, we apply a voting strategy with random scaling.Table 2 summarizes part segmentation results with mean of instancewise intersection over union (mIoU) and classwise intersection over union. CKConv outperforms the stateoftheart methods with mIoU of and achieves new best results for multiple classes. The examples of object part segmentation results are visualized in Figure 4, verifying that CKConv performs robustly on diverse objects. More examples are provided in the supplementary material.
4.3 Scene Semantic Segmentation
Different from synthetic datasets used in classification and part segmentation, datasets for scene segmentation are generally from realworld, making the task challenging. We use S3DIS [1] that contains point clouds from 6 largescale indoor areas. Each point is annotated with one semantic label from 13 classes. We follow the sampling strategy used in [20] to prepare training data. Each point is represented by a 9D vector combining XYZ, RGB, and the normalized location. For evaluation, we report the results tested on Area
while trained on the rest. We use evaluation metrics including overall pointwise accuracy (OA) and mean of classwise accuracy (mAcc), mean of classwise intersection over union (mIoU).
As shown in Table 3, CKConv outperforms all stateoftheart methods in terms of OA and mAcc, while achieving best mIoU with KPConv [36]. L2 normalization and local set attention are applied for scene segmentation. The performances with and without normalization and LSA can be found in the supplementary material. The results verify that our approach can better capture the semantic geometry than previous methods, i.e., spatially extended feature representation of CKConv contains more explicit structure information than scalar feature. The qualitative results are shown in Figure 5, and more visualizations including failure cases are available in the supplementary materials.
4.4 Ablation Study
To evaluate the influence of various components of CKConv, we further conduct ablation studies on object classification and object part segmentation. All results reported in this section are obtained without the voting scheme.
Cubic kernel unit size We first explore different settings of cubic kernel unit size in Table 4. Note that cubic kernel size determines the resolution of feature voxelization, since output feature size of our point convolution is (Eq. 3). For the experiment, normalization and LSA are employed for . For , the spatial size of output feature becomes 216 or more, which is unnecessarily large to represent the local point set. We apply different 3D convolution kernel sizes for each case of to extract final output features in Eq. 8 (details in the supplementary material). The result shows that 4 4 4 is the most suitable size for cubic kernel in object classification and object part segmentation. We also adopt for scene segmentation, since the similar number of neighboring points is used to define the local point set.
Weight normalization and local set attention We verify the effectiveness of cubic weight normalization and local set attention in Table 5, with cubic kernel size 4 4 4. Baseline model A is set to learn without cubic weight normalization and local set attention. It gets an OA of in the classification task and mIoU of in the part segmentation task. When L2 normalization is adopted for cubic weight normalization (model B), the results are slightly improved to and , respectively. Then, proposed local set attention boosts its performance to and
(model C), which shows great improvement in the segmentation task. When we alternatively apply standardization for weight normalization without local set attention (model D), the model achieves results of
and , which are similar to the results with L2 normalization. When local set attention is applied (model E), the classification accuracy is significantly improved to .Cubic kernel size  ModelNet40  ShapeNetPart 

OA  mIoU  
Model  LSA  ModelNet40  ShapeNetPart  
OA  mIoU  
A  
B  ✓  
C  ✓  ✓  
D  ✓  
E  ✓  ✓ 
Although we have proposed two weight normalization methods (L2 normalization and standardization), note that they could be replaced with other normalization schemes. We can find the positive influence of weight normalization during training in Figure 6 (a). Also from Figure 6
(b), we can analyze that features extracted with local set attention contain comprehensive structure information within the local point set, thus achieving better performance.
4.5 Feature visualization
In Figure 7, features learned in different layers are visualized. We can observe that features learned in the first layer exhibit high activation for lowlevel structures such as edges and corners, whereas features learned in later layers show high activation for semantic structures such as wings, tails, and legs. Thus, feature response shifts from point to part level by capturing more global geometry as layers deepen.
5 Conclusion
We have presented CKConv, a novel convolution operator for point clouds. CKConv exploits both continuous point convolution and discrete convolution to extract spatially extended features for local point set, with the proposed cubic kernel. Spatial extension of feature representation induced by feature voxelization enables detailed geometry of point clouds to be better captured, leading to enriched feature learning. Moreover, local set attention (LSA) has been proposed to encode the representative feature of the local point set by imparting additional spatial attention with comprehensive structure awareness. Experiments on three different tasks have verified that our approach achieves stateoftheart performances.
References

[1]
(2016)
3d semantic parsing of largescale indoor spaces.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pp. 1534–1543. Cited by: §4.3.  [2] (2016) Semantic segmentation of earth observation data using multimodal and multiscale deep networks. In Asian conference on computer vision, pp. 180–196. Cited by: §2.
 [3] (2017) Unstructured point cloud semantic labeling using deep segmentation networks.. 3DOR 2, pp. 7. Cited by: §2.
 [4] (2019) Generalizing discrete convolutions for unstructured point clouds.. In 3DOR, pp. 71–78. Cited by: §1.
 [5] (2020) ConvPoint: continuous convolutions for point cloud processing. Computers & Graphics 88, pp. 24–34. Cited by: §1.

[6]
(2019)
4d spatiotemporal convnets: minkowski convolutional neural networks
. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3075–3084. Cited by: §2.  [7] (2018) Know what your neighbors do: 3d semantic segmentation of point clouds. In Proceedings of the European Conference on Computer Vision (ECCV) Workshops, pp. 0–0. Cited by: §1.
 [8] (2018) 3d semantic segmentation with submanifold sparse convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 9224–9232. Cited by: §1, §2.
 [9] (2018) Monte carlo convolution for learning on nonuniformly sampled point clouds. ACM Transactions on Graphics (TOG) 37 (6), pp. 1–12. Cited by: §1, §2, §3.1.
 [10] (2020) Randlanet: efficient semantic segmentation of largescale point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11108–11117. Cited by: §1.
 [11] (2018) Pointwise convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 984–993. Cited by: Table 1.
 [12] (2016) Point cloud labeling using 3d convolutional neural network. In 2016 23rd International Conference on Pattern Recognition (ICPR), pp. 2670–2675. Cited by: §1, §2.
 [13] (2018) Pointsift: a siftlike network module for 3d point cloud semantic segmentation. arXiv preprint arXiv:1807.00652. Cited by: §1, §1.

[14]
(2019)
Momen (e) t: flavor the moments in learning to classify shapes
. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, pp. 0–0. Cited by: §1.  [15] (2017) Escape from cells: deep kdnetworks for the recognition of 3d point cloud models. In Proceedings of the IEEE International Conference on Computer Vision, pp. 863–872. Cited by: §4.1, Table 2.
 [16] (2019) Acnn: annularly convolutional neural networks on point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7421–7430. Cited by: Table 1.
 [17] (2017) Deep projective 3d semantic segmentation. In International Conference on Computer Analysis of Images and Patterns, pp. 95–107. Cited by: §2.
 [18] (2018) Sonet: selforganizing network for point cloud analysis. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 9397–9406. Cited by: Table 2, Table 1.
 [19] (2018) PointCNN: convolution on transformed points. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp. 828–838. Cited by: Table 2, Table 1, Table 3.
 [20] (2020) Fpconv: learning local flattening for point convolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4293–4302. Cited by: §1, §2, §3.2, §4.3, Table 1, Table 3.
 [21] (2019) Dynamic points agglomeration for hierarchical point sets learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7546–7555. Cited by: §2.
 [22] (2019) Relationshape convolutional neural network for point cloud analysis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8895–8904. Cited by: §1, §1, §2, §3.2, §4.1, §4.1, Table 2, Table 1.
 [23] (2019) Interpolated convolutional networks for 3d point cloud understanding. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1578–1587. Cited by: §3.2, Table 1.
 [24] (2015) Voxnet: a 3d convolutional neural network for realtime object recognition. In 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 922–928. Cited by: §1, §2, §4.1, Table 1.
 [25] (2019) Rangenet++: fast and accurate lidar semantic segmentation. In 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 4213–4220. Cited by: §2.
 [26] (2017) Pointnet: deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 652–660. Cited by: §1, §1, §2, §4.1, §4.2, Table 1, Table 3.
 [27] (2016) Volumetric and multiview cnns for object classification on 3d data. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5648–5656. Cited by: §4.1, Table 1.
 [28] (2017) Pointnet++: deep hierarchical feature learning on point sets in a metric space. arXiv preprint arXiv:1706.02413. Cited by: §1, §2, Table 2, Table 1.
 [29] (2018) Mining point cloud local structures by kernel correlation and graph pooling. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4548–4557. Cited by: Table 1.
 [30] (2017) Dynamic edgeconditioned filters in convolutional neural networks on graphs. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3693–3702. Cited by: §2.

[31]
(2014)
Dropout: a simple way to prevent neural networks from overfitting.
The journal of machine learning research
15 (1), pp. 1929–1958. Cited by: §4.1.  [32] (2018) Splatnet: sparse lattice networks for point cloud processing. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2530–2539. Cited by: §2.
 [33] (2015) Multiview convolutional neural networks for 3d shape recognition. In Proceedings of the IEEE international conference on computer vision, pp. 945–953. Cited by: §2.
 [34] (2017) Segcloud: semantic segmentation of 3d point clouds. In 2017 international conference on 3D vision (3DV), pp. 537–547. Cited by: §1, §2.
 [35] (2018) Rgcnn: regularized graph cnn for point cloud segmentation. In Proceedings of the 26th ACM international conference on Multimedia, pp. 746–754. Cited by: §2.
 [36] (2019) Kpconv: flexible and deformable convolution for point clouds. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6411–6420. Cited by: §1, §2, §3.1, §4.3, Table 2, Table 1, Table 3.
 [37] (2018) Local spectral graph convolution for point set feature learning. In Proceedings of the European conference on computer vision (ECCV), pp. 52–66. Cited by: §2.
 [38] (2018) Deep parametric continuous convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2589–2597. Cited by: §1, §3.1, §3.2.
 [39] (2019) Dynamic graph cnn for learning on point clouds. Acm Transactions On Graphics (tog) 38 (5), pp. 1–12. Cited by: §2, Table 2, Table 1.
 [40] (2018) Squeezeseg: convolutional neural nets with recurrent crf for realtime roadobject segmentation from 3d lidar point cloud. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 1887–1893. Cited by: §2.
 [41] (2019) Squeezesegv2: improved model structure and unsupervised domain adaptation for roadobject segmentation from a lidar point cloud. In 2019 International Conference on Robotics and Automation (ICRA), pp. 4376–4382. Cited by: §2.
 [42] (2019) Pointconv: deep convolutional networks on 3d point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9621–9630. Cited by: §1, §2, §3.1, §3.2, Table 1.
 [43] (2015) 3d shapenets: a deep representation for volumetric shapes. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1912–1920. Cited by: §1, §2, §4.1, §4.1, Table 1.
 [44] (2018) Spidercnn: deep learning on point sets with parameterized convolutional filters. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 87–102. Cited by: §3.2, Table 2, Table 1.
 [45] (2019) Modeling point clouds with selfattention and gumbel subset sampling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3323–3332. Cited by: §1.
 [46] (2019) Learning relationships for multiview 3d object recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7505–7514. Cited by: §2.
 [47] (2016) A scalable active framework for region annotation in 3d shape collections. ACM Transactions on Graphics (ToG) 35 (6), pp. 1–12. Cited by: §4.2.
 [48] (2017) Syncspeccnn: synchronized spectral cnn for 3d shape segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2282–2290. Cited by: Table 2.
 [49] (2019) Shellnet: efficient point cloud convolutional neural networks using concentric shells statistics. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1607–1616. Cited by: §1, §2, Table 1.
 [50] (2019) Pointweb: enhancing local neighborhood features for point cloud processing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5565–5573. Cited by: §1, §2, Table 3.