CKConv: Learning Feature Voxelization for Point Cloud Analysis

07/27/2021 ∙ by Sungmin Woo, et al. ∙ 9

Despite the remarkable success of deep learning, optimal convolution operation on point cloud remains indefinite due to its irregular data structure. In this paper, we present Cubic Kernel Convolution (CKConv) that learns to voxelize the features of local points by exploiting both continuous and discrete convolutions. Our continuous convolution uniquely employs a 3D cubic form of kernel weight representation that splits a feature into voxels in embedding space. By consecutively applying discrete 3D convolutions on the voxelized features in a spatial manner, preceding continuous convolution is forced to learn spatial feature mapping, i.e., feature voxelization. In this way, geometric information can be detailed by encoding with subdivided features, and our 3D convolutions on these fixed structured data do not suffer from discretization artifacts thanks to voxelization in embedding space. Furthermore, we propose a spatial attention module, Local Set Attention (LSA), to provide comprehensive structure awareness within the local point set and hence produce representative features. By learning feature voxelization with LSA, CKConv can extract enriched features for effective point cloud analysis. We show that CKConv has great applicability to point cloud processing tasks including object classification, object part segmentation, and scene semantic segmentation with state-of-the-art results.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 4

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Point cloud analysis has taken on increasing significance in recent years, as the 3D vision that relies on point cloud data becomes essential in a large range of applications such as autonomous driving, robotics, and augmented reality. However, it is not straightforward to process the point cloud due to its unique properties being sparse, unordered, and irregular. This non-grid structured data cannot be directly handled by superior 2D deep learning methods which are designed for grid structured data.

One simple approach is to rasterize point cloud into 3D voxel grids [24, 43, 12, 34, 8]. Standard discrete convolution then can be performed on discrete voxels by encoding geometric attributes of points contained therein. However, voxelization of the raw point cloud produces a vast number of voxels [26]. 3D convolutions on such data representation require high complexity and inefficient memory consumption that increase with resolution. Moreover, it inherently fails to make full use of the available data since voxelization inevitably produces discretization artifacts, resulting in information loss [26, 22, 13].

To overcome these limitations, the raw point cloud needs to be processed without intermediate data representation. For direct processing, PointNet [26]

proposes to learn pointwise features with shared multi-layer perceptron (MLP) and symmetric aggregation function to be immune to point order ambiguity. Due to its simple but powerful description ability, this approach has become recognized, adopted, and improved by following works as a basic concept for point cloud analysis 

[28, 50, 13, 14, 7, 49, 10, 45].

Taking it further, various studies have focused on applying convolution directly to the points. A key challenge for point convolution is that kernel weights for an arbitrary position in the continuous receptive field should be obtainable to cope with the irregularity of the point cloud. So as to construct spatially-continuous kernels, several works [9, 42, 38, 22, 20] utilize MLP that takes a relative position in the kernel area as input and outputs the weights for the position. In this way, weights for neighboring points can be obtained by learning from their spatial distributions with respect to the center point. Alternatively, [36, 4, 5] devise kernel points of which locations are learnable. Weights for the kernel points are parameterized similarly to the kernel of image convolution, and a Gaussian or linear correlation function is used to define the continuous kernel from kernel points. However, it is problematical whether the continuous kernel can be effectively approximated only by MLP or a customized correlation function. MLP trained from a limited number of sample points must be able to handle an arbitrary point in the continuous receptive field, and customized correlation function is heavily reliant upon human decision. Thus, continuous convolution alone is not enough to reach optimal learning.

Based on these observations, we propose a novel point convolution operator named Cubic Kernel Convolution (CKConv) to address the aforementioned issues. The key idea of CKConv is learning to voxelize the features of local point set by exploiting both continuous and discrete convolutions, which complement each other in terms of remedying their drawbacks. Our point convolution voxelizes the feature without information loss and 3D convolution encodes features explicitly from fixed grids, resolving the discretization and approximation problems, respectively.

Specifically, point convolution in CKConv utilizes a spatially extended form of kernel representation, which we call the cubic kernel. The cubic kernel can be derived by simply extending the spatial dimension of a weight, as shown in Figure 1. Convolution with this kernel representation splits the feature with each weight in the cubic kernel, producing the voxelized features as an output. Discrete 3D convolutions are then consecutively applied to these voxelized features. As the whole operation is performed in an end-to-end manner, following discrete convolution forces preceding point convolution to learn spatial feature mapping since discrete convolution is consistently operated on spatially adjoining voxelized features during training. This process can be interpreted as learning feature voxelization, i.e., extending the spatial dimension of feature. We believe that this spatially extended feature representation is more suitable for point cloud analysis since it enables spatial geometry in the local point set to be better represented.

Furthermore, we propose Local Set Attention (LSA) that provides additional spatial attention to the voxelized features with comprehensive structure awareness of the local point set. Every set of local points in the point cloud has a different structure due to the inherent irregularity, hence LSA can play a crucial role in capturing the respective structures by adaptively learning from overall features of the local point set. With data-dependent guidance given by LSA, spatially extended features can be more representative.

We evaluate the proposed approach on three tasks including object classification, object part segmentation, and scene semantic segmentation. Experimental results verify that CKConv outperforms previous approaches.

Our main contributions are summarized as follows:

  • We propose a novel Cubic Kernel Convolution (CKConv) for effective point cloud analysis. It losslessly voxelizes the features with both continuous and discrete convolutions, and local geometry of points can be explicitly encoded with spatially extended features;

  • We introduce a learnable Local Set Attention (LSA) that provides comprehensive structural information for representative feature learning;

  • We provide extensive experiments on various point cloud processing tasks with theoretical analysis while achieving state-of-the-art performances.

Figure 1: Standard convolution (top) and proposed CKConv (bottom) operation on a single channel feature. Note that cubic weights size can be adjusted and 3D convolutions on voxelized features are omitted in the illustration.

2 Related Works

Discrete convolution methods   To take advantage of regular grid convolution, early works convert point cloud into the grid representation such as 2D pixels or 3D voxels. For 2D representation, point cloud can be transformed into multi-view images [17, 3, 2, 33, 46] or a range image [40, 41, 25]

. Then 2D convolutional neural network (CNN) is applied to the produced images, making feature learning comparatively fast and uncomplicated. For 3D representation, 3D space is discretized into a set of occupancy voxels

[24, 43, 12, 34]. Since volumetric data has empty space where no value is assigned, [8, 6, 32] propose the method to focus on learning from occupied voxels, efficiently reducing memory and computational cost. However, discretizing non-grid structured data inevitably losses the detailed geometric information. On the other hand, our approach does not suffer from information loss since it takes raw point cloud without transforming the data representation. Instead, CKConv learns to voxelize features in embedding space, enabling 3D convolution on voxels.

Pointwise MLP methods   As a pioneering work, PointNet [26]

proposes to exploit shared MLP and symmetric aggregation function to directly process the point cloud. Pointwise features are learned independently from MLP that is shared over points, and global features are extracted by a max-pooling operation to achieve permutation invariance. PointNet++ 

[28] designs a hierarchical network to further capture neighborhood information for each point. Local geometric features are learned by applying PointNet on local groups of points. To enrich local region features, PointWeb [50] constructs a locally fully-linked web by connecting points in neighborhood region. Then MLP-based adaptive feature adjustment (AFA) module that learns contextual information from the region is formulated through the web. ShellNet [49] partitions the point cloud with a set of concentric spherical shells, and features of points in each shell are encoded by MLP and summarized by max pooling. However, although MLP is an appropriate solution to handle irregular data, these methods are overly dependent on it. Encoding with only MLP makes a network hard to converge well with generalization capability. In our case, we apply discrete convolution after MLP produces voxelized features. Thus, MLP can focus on feature mapping and discrete convolution learns high-level encoding in fixed grids.

Graph convolution methods   Graph-based networks construct a graph from the point cloud according to spatial neighbors for each point. Feature learning on the graph can be performed in spatial or spectral domains. In the spatial domain, convolution is generally defined with MLP and graph information for each point is aggregated by pooling operation upon features of its neighbors [30]. DGCNN [39] progressively updates the graph in feature space after the edge convolution layer composed of MLP and pooling operation. DPAM [21] takes points similarity graph as input for a graph convolution network and learns the agglomeration matrix that is multiplied with the points feature matrix. Alternatively, graph-based networks in the spectral domain exploit spectral filtering to define convolution. RGCNN [35] updates the graph Laplacian matrix in each layer based on Chebyshev polynomial approximation. LocalSpecGCN [37] applies spectral graph convolution on a local graph for nearest neighbors of each point to learn relative layout. Although these graph-based networks have strong capability to handle irregular data, graph should be constructed in advance and the connectivity patterns of the graph are often complicated.

Point convolution methods   Point convolution methods tend to extend the image convolution concept. These methods define continuous kernel function to obtain pointwise weight. [9, 42] approximate the continuous convolution with Monte-Carlo integration from a finite number of input points, and MLP is utilized to construct the kernel function. RS-CNN [22] also uses MLP for continuous kernel function but it performs as a mapping function. Another MLP lifts the mapped feature to high-level learning by raising the channel. Our CKConv similarly utilizes a continuous kernel for feature mapping, but can better capture local geometry since features in CKConv have three spatial dimension whereas features in RS-CNN have only one. KPConv [36] defines kernel points that carry kernel weights. The weights for kernel points can be directly parameterized, but correlation function that approximates the whole continuous kernel from kernel points is manually defined. In contrast, CKConv can lead to more optimized parameters since the entire kernel function is learnable. FPConv [20] learns a weight map to project points onto a 2D grid, and applies 2D convolution on the image form of features. This idea is shared with CKConv in view of discretizing local point set in feature space. However, FPConv has the severe limitation that its 2D output features lack representation ability regarding object curvature. On the other hand, CKConv can retain detailed local geometry without loss of dimension, since it learns to map features in 3D voxels. We prove our reasoning with various tasks in Section 4.

3 Method

We first briefly introduce the principle of point convolution operation. Afterwards, we expound on our proposed Cubic Kernel Convolution (CKConv).

3.1 Convolution on Points

Notations   For the sake of clarity, we define the notations employed in the paper as follows.

Neighboring points around a point within a predefined radius are denoted as , where are cartesian coordinates. We call the set of these points a local point set, where single convolution operation is performed. For CKConv, neighboring points are randomly selected in to be robust to point cloud density, thus . The feature derived at point y is denoted as , where is the number of input channels, and

can be initialized with additional information such as normal vector or RGB color.

is the standard convolution kernel function and is our proposed cubic kernel function. Both convolution kernel functions determine kernel weights in continuous receptive fields.

Point convolution formula   As previous works [38, 9, 42, 36] showed, standard convolution operation on an arbitrary point can be formulated as

(1)

where produces a weight vector for the neighboring point with size identical to feature , and “” denotes dot product. This formulation is essentially identical to image convolution except for the characteristic of kernel function . Since point cloud is non-grid structured data without fixed positions, should be able to handle any point in continuous space. Thus, the kernel function needs to be designed to obtain point-dependent kernel weight from the relation between points. Generally, MLP is employed for kernel function and weights for neighbor point are learned from its relative position .

Figure 2: Overall CKConv process on a local point set with number of neighboring points and kernel unit size . Point convolution with the cubic kernel is performed as in Eq. 2, and cubic attention is reproduced for an element-wise product with cubic feature. Detailed architectures of the cubic kernel and LSA are shown in Figure 3.

3.2 Cubic Kernel Convolution

Standard image convolution represents each pixel as a feature vector with size equal to the number of channels. To be more concrete, each pixel has a feature of non-dimensional scalar per channel, i.e., scalar feature. A kernel weight applied to each scalar feature is also scalar, i.e., scalar weight. Likewise, existing point convolution methods adopt this standard convolution concept for feature and weight. On the other hand, we propose to use 3D cubic form of kernel weights instead of scalar weight, to enrich geometric information of the feature. Figure 1 compares the standard convolution and cubic kernel convolution.

Point convolution with cubic kernel   Our proposed cubic kernel is implemented with shared MLP to be invariant to the input order of points as previous works [42, 38, 22, 20]. Extending kernel weights from scalar to 3D can be simply enforced by increasing the output dimension of MLP. Three-dimensional cubic kernel weights imply that the scalar weight is distributed into multiple weights by each voxel of a cube, which is a 3D matrix in fact. The spatial size of cubic kernel weights per channel is then changed from to , where is a predefined number of voxels along each axis. The scalar feature of a point is multiplied equally to each weight of the cubic kernel, producing output features with the same size as the cubic kernel.

While the weight representation is spatially extended, identical cubic weights are applied to all channels of the feature (see Figure 2), contrary to standard convolution (Eq. 1) that applies different weights to each channel. This operation is based on the concept that our point convolution aims to learn feature voxelization in a spatial manner, which is irrelevant to the channel dimension. Thus, our point convolution with cubic kernel can be formulated as

(2)

where produces cubic weights for the neighboring point whose transposed feature is . Then each output size of standard convolution (Eq. 1) and point convolution in CKConv (Eq. 2) becomes

(3)

Despite producing the large size of output feature, our point convolution requires fewer parameters since sizes of kernel weights obtained from and are and respectively. The size of cubic kernel weights remains constantly independent of the input feature channel , and the predefined need not be large as discussed in Section 4.4. We set for our final model, based on experimental results shown in Table 4.

Forcing to learn feature voxelization

    Our point convolution with cubic kernel produces spatially extended features, but the cubic feature itself does not signify that the feature has taken on a spatial meaning. Only when 3D convolution is applied, preceding point convolution is then forced to infuse spatial relation between voxelized features. This interaction between convolutions is expected from the fact that they affect each other during gradient backpropagation in training, since CKConv has the architecture of end-to-end learning which is shown in Figure 

2. During training, point convolution becomes to produce spatially significant features as 3D convolutions are performed on the voxelized features in a spatial manner.

Since the number of input channels is maintained in point convolution and increased to in 3D convolution, we can interpret the role of each convolution as a spatial feature mapping and high-level encoding, respectively. In other words, point convolution losslessly voxelizes the features of the local point set in embedding space, and discrete convolution extracts the final output feature by operating on voxelized features. In this way, we also alleviate the problem that the continuous kernel is difficult to approximate with MLP [44, 23], by giving the continuous kernel a relatively intelligible task to learn.

Furthermore, the spatially extended features are more capable of encoding detailed geometric information, compared with scalar feature. The structure of local point set can be better captured with cubic feature representation, since it inherently contains spatial attributes whereas scalar feature does not in itself. Thus, our cubic representation would be more desirable for point cloud analysis that focuses on learning spatial distribution of points.

Figure 3: Cubic kernel function (top) and local set attention (bottom) architectures.

Cubic kernel normalization

   Since kernel weights are obtained from continuous kernel function approximated by MLP, scale and variance of the cubic weights can vary substantially from point to point. To prevent this weight imbalance from causing unstable training, we impose restrictions on the distribution of cubic weights by applying normalization schemes.

Let the cubic weights be for a given point. Then L2 normalization on each cubic weight can be formulated as

(4)

to stabilize scale distributions of cubic weights over points. To further restrict the scale and variance distributions of cubic weights, standardization can be applied as

(5)

where and . These cubic weight normalization schemes help to improve performance (see Section 4.4), and hence we apply normalization after every weight prediction in CKConv layers.

Local set attention   Due to the irregularity of point clouds, every local point set has a different structure. Thus, it is significant that point convolution captures the unique structures of the respective local point sets. For CKConv, we construct an additional branch for proposed Local Set Attention (LSA) that provides comprehensive geometric information of the local point set. As shown in Figure 3, LSA shares the front MLP with the cubic kernel function since it also needs to learn feature mapping in company with the cubic kernel. The representative feature of the local point set is obtained by applying max pooling on intermediate features in the cubic kernel function, and another MLP is used to extract cubic attention. In this way, cubic attention with identical size to the cubic kernel contains overall structure information, since it is extracted by aggregating features of the local point set. The cubic attention is then element-wise multiplied to all cubic features by reproduction in the channel dimension (see Figure 2). For the formulation, let input matrix in the Figure 3 be

(6)

Then our point convolution with LSA can be extended from Eq. 2 as

(7)

where is the attention function and is element-wise product. is a matrix of ones with identical size to , and indicates skip connection for features without LSA (cubic attention in Figure 2 contains the skip connection). With this point convolution formula, our cubic kernel convolution can be written as

(8)

where is 3D convolution operations on voxelized features with output channel .

4 Experiments

We evaluate our proposed CKConv on three tasks: object classification, object part segmentation, and scene semantic segmentation. Network architectures, configurations, and detailed training parameters are provided in the supplementary material.

Method Input mIoU aero bag cap car chair ear. guit. knife lamp lapt. moto. mug pist. rock. skate. table
Kd-Net [15] 4k points  
SynSpecCNN [48] graph
SO-Net [18] 1k points
PointNet++ [28] 2k points
DGCNN [39] 2k points 85.1
SpiderCNN [44] 2k points 85.3
PointCNN [19] 2k points
RS-CNN [22] 2k points
KPConv [36] 2k points  
CKConv 2k points
Table 2: Object part segmentation results on ShapeNetPart. The mean of instance-wise IoU (mIoU, %) and class-wise IoU (%) are reported.
Methods Input mAcc OA
3DShapeNets [43] voxels  
VoxNet [24] voxels -
Vol. CNN [27] voxels -
SO-Net [18] 2k points
PointNet++ [28] 5k points -
SO-Net [18] 5k points
KPConv [36] 7k points -
Pointwise CNN [11] 1k points
PointNet [26] 1k points
KCNet [29] 1k points -
PointNet++ [28] 1k points -
PointCNN [19] 1k points
DGCNN [39] 1k points
SpiderCNN [44] 1k points -
PointConv [42] 1k points -
FPConv [20] 1k points -
A-CNN [16] 1k points
RS-CNN [22] 1k points -
InterpCNN [23] 1k points -
ShellNet [49] 1k points -  
CKConv 1k points
Table 1: Object classification results on ModelNet40. Overall accuracy (OA, %) and mean of class-wise accuracy (mAcc, %) are reported.

4.1 Object Classification

We use ModelNet40 [43] dataset that contains 9,843 train and 2,468 test models in 40 classes for object classification. Point cloud data trained and tested with our model is provided by [26], and 1,024 points are uniformly sampled as input. We use the normal as an additional feature. For training, random scaling and translation are used for augmentation strategy, as in [15, 22]. A dropout layer [31]

with 0.5 probability is used in the final fully connected layers to reduce the over-fitting problem. For evaluation, we do not apply voting strategy that repeatedly predicts an object’s class with random scaling or sampling.

We compares overall accuracy (OA) and mean of class-wise accuracy (mAcc) for the proposed CKConv with relevant previous state-of-the-art models in Table 1. CKConv achieves higher accuracy than all other considered methods. In the case of RS-CNN [22], we report the performance without the voting strategy as provided in the paper. Among methods that exploit 3D convolution for feature learning, our approach performs much better than those [43, 24, 27] rasterizing raw point cloud into voxels. The reason of the performance gap is based on the fact that CKConv learns to voxelize the feature in embedding space, which prevents information loss in the discretization process. CKConv also outperforms methods with points as input, which proves the effectiveness of our unique feature representation.

Figure 4: Visualization of part segmentation results on test data.
Method OA mAcc mIoU ceil. floor wall beam col. wind. door chair table book. sofa board clut.
PointNet [26] -  
PointCNN [19]
PointWeb [50]
FPConv [20]
KPConv [36] -  
CKConv
Table 3: Semantic segmentation results evaluated on S3DIS Area-5. Overall accuracy (OA, %), mean of class-wise accuracy (mAcc, %), mean of class-wise IoU (mIoU, %), and class-wise IoU (%) are reported.
Figure 5: Visualization of semantic segmentation results on S3DIS Area-5.

4.2 Object Part Segmentation

We evaluate CKConv on ShapeNetPart [47] dataset for object part segmentation. ShapeNetPart contains 16,881 point clouds from 16 classes. Each data is annotated with 2-6 parts, and there are 50 different parts in total. We follow the data split used in [26]

, and 2,048 points with the normal are randomly sampled as input. The one-hot encoding of the object label is concatenated to the last layer as in

[26]. We adopt the same augmentation strategy used in object classification task. During testing, we apply a voting strategy with random scaling.

Table 2 summarizes part segmentation results with mean of instance-wise intersection over union (mIoU) and class-wise intersection over union. CKConv outperforms the state-of-the-art methods with mIoU of and achieves new best results for multiple classes. The examples of object part segmentation results are visualized in Figure 4, verifying that CKConv performs robustly on diverse objects. More examples are provided in the supplementary material.

4.3 Scene Semantic Segmentation

Different from synthetic datasets used in classification and part segmentation, datasets for scene segmentation are generally from real-world, making the task challenging. We use S3DIS [1] that contains point clouds from 6 large-scale indoor areas. Each point is annotated with one semantic label from 13 classes. We follow the sampling strategy used in [20] to prepare training data. Each point is represented by a 9D vector combining XYZ, RGB, and the normalized location. For evaluation, we report the results tested on Area

while trained on the rest. We use evaluation metrics including overall point-wise accuracy (OA) and mean of class-wise accuracy (mAcc), mean of class-wise intersection over union (mIoU).

As shown in Table 3, CKConv outperforms all state-of-the-art methods in terms of OA and mAcc, while achieving best mIoU with KPConv [36]. L2 normalization and local set attention are applied for scene segmentation. The performances with and without normalization and LSA can be found in the supplementary material. The results verify that our approach can better capture the semantic geometry than previous methods, i.e., spatially extended feature representation of CKConv contains more explicit structure information than scalar feature. The qualitative results are shown in Figure 5, and more visualizations including failure cases are available in the supplementary materials.

4.4 Ablation Study

To evaluate the influence of various components of CKConv, we further conduct ablation studies on object classification and object part segmentation. All results reported in this section are obtained without the voting scheme.

Cubic kernel unit size    We first explore different settings of cubic kernel unit size in Table 4. Note that cubic kernel size determines the resolution of feature voxelization, since output feature size of our point convolution is (Eq. 3). For the experiment, normalization and LSA are employed for . For , the spatial size of output feature becomes 216 or more, which is unnecessarily large to represent the local point set. We apply different 3D convolution kernel sizes for each case of to extract final output features in Eq. 8 (details in the supplementary material). The result shows that 4 4 4 is the most suitable size for cubic kernel in object classification and object part segmentation. We also adopt for scene segmentation, since the similar number of neighboring points is used to define the local point set.

Weight normalization and local set attention   We verify the effectiveness of cubic weight normalization and local set attention in Table 5, with cubic kernel size 4 4 4. Baseline model A is set to learn without cubic weight normalization and local set attention. It gets an OA of in the classification task and mIoU of in the part segmentation task. When L2 normalization is adopted for cubic weight normalization (model B), the results are slightly improved to and , respectively. Then, proposed local set attention boosts its performance to and

(model C), which shows great improvement in the segmentation task. When we alternatively apply standardization for weight normalization without local set attention (model D), the model achieves results of

and , which are similar to the results with L2 normalization. When local set attention is applied (model E), the classification accuracy is significantly improved to .

Cubic kernel size ModelNet40 ShapeNetPart
OA mIoU
Table 4: Classification and part segmentation results (%) with different cubic kernel sizes. Overall accuracy (%) and mean of instance-wise IoU (%) are reported.
Model LSA ModelNet40 ShapeNetPart
OA mIoU
A
B
C
D
E
Table 5: Ablation study for CKConv on ModelNet40 and ShapeNetPart. Overall accuracy (%) and mean of instance-wise IoU (%) are reported. and are normalization methods defined in Eq. 4 and Eq. 5, respectively.
Figure 6:

Part segmentation performances (%) on ShapeNetPart versus epochs. L2 Normalization and ST Normalization are defined in Eq. 

4 and Eq. 5, respectively.

Although we have proposed two weight normalization methods (L2 normalization and standardization), note that they could be replaced with other normalization schemes. We can find the positive influence of weight normalization during training in Figure 6 (a). Also from Figure 6

(b), we can analyze that features extracted with local set attention contain comprehensive structure information within the local point set, thus achieving better performance.

4.5 Feature visualization

In Figure 7, features learned in different layers are visualized. We can observe that features learned in the first layer exhibit high activation for low-level structures such as edges and corners, whereas features learned in later layers show high activation for semantic structures such as wings, tails, and legs. Thus, feature response shifts from point to part level by capturing more global geometry as layers deepen.

5 Conclusion

We have presented CKConv, a novel convolution operator for point clouds. CKConv exploits both continuous point convolution and discrete convolution to extract spatially extended features for local point set, with the proposed cubic kernel. Spatial extension of feature representation induced by feature voxelization enables detailed geometry of point clouds to be better captured, leading to enriched feature learning. Moreover, local set attention (LSA) has been proposed to encode the representative feature of the local point set by imparting additional spatial attention with comprehensive structure awareness. Experiments on three different tasks have verified that our approach achieves state-of-the-art performances.

Figure 7: Feature activation from different layers for part segmentation on ShapeNetPart. Higher activation is shown as darker color.

References

  • [1] I. Armeni, O. Sener, A. R. Zamir, H. Jiang, I. Brilakis, M. Fischer, and S. Savarese (2016) 3d semantic parsing of large-scale indoor spaces. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    ,
    pp. 1534–1543. Cited by: §4.3.
  • [2] N. Audebert, B. Le Saux, and S. Lefèvre (2016) Semantic segmentation of earth observation data using multimodal and multi-scale deep networks. In Asian conference on computer vision, pp. 180–196. Cited by: §2.
  • [3] A. Boulch, B. Le Saux, and N. Audebert (2017) Unstructured point cloud semantic labeling using deep segmentation networks.. 3DOR 2, pp. 7. Cited by: §2.
  • [4] A. Boulch (2019) Generalizing discrete convolutions for unstructured point clouds.. In 3DOR, pp. 71–78. Cited by: §1.
  • [5] A. Boulch (2020) ConvPoint: continuous convolutions for point cloud processing. Computers & Graphics 88, pp. 24–34. Cited by: §1.
  • [6] C. Choy, J. Gwak, and S. Savarese (2019)

    4d spatio-temporal convnets: minkowski convolutional neural networks

    .
    In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3075–3084. Cited by: §2.
  • [7] F. Engelmann, T. Kontogianni, J. Schult, and B. Leibe (2018) Know what your neighbors do: 3d semantic segmentation of point clouds. In Proceedings of the European Conference on Computer Vision (ECCV) Workshops, pp. 0–0. Cited by: §1.
  • [8] B. Graham, M. Engelcke, and L. Van Der Maaten (2018) 3d semantic segmentation with submanifold sparse convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 9224–9232. Cited by: §1, §2.
  • [9] P. Hermosilla, T. Ritschel, P. Vázquez, À. Vinacua, and T. Ropinski (2018) Monte carlo convolution for learning on non-uniformly sampled point clouds. ACM Transactions on Graphics (TOG) 37 (6), pp. 1–12. Cited by: §1, §2, §3.1.
  • [10] Q. Hu, B. Yang, L. Xie, S. Rosa, Y. Guo, Z. Wang, N. Trigoni, and A. Markham (2020) Randla-net: efficient semantic segmentation of large-scale point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11108–11117. Cited by: §1.
  • [11] B. Hua, M. Tran, and S. Yeung (2018) Pointwise convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 984–993. Cited by: Table 1.
  • [12] J. Huang and S. You (2016) Point cloud labeling using 3d convolutional neural network. In 2016 23rd International Conference on Pattern Recognition (ICPR), pp. 2670–2675. Cited by: §1, §2.
  • [13] M. Jiang, Y. Wu, T. Zhao, Z. Zhao, and C. Lu (2018) Pointsift: a sift-like network module for 3d point cloud semantic segmentation. arXiv preprint arXiv:1807.00652. Cited by: §1, §1.
  • [14] M. Joseph-Rivlin, A. Zvirin, and R. Kimmel (2019)

    Momen (e) t: flavor the moments in learning to classify shapes

    .
    In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, pp. 0–0. Cited by: §1.
  • [15] R. Klokov and V. Lempitsky (2017) Escape from cells: deep kd-networks for the recognition of 3d point cloud models. In Proceedings of the IEEE International Conference on Computer Vision, pp. 863–872. Cited by: §4.1, Table 2.
  • [16] A. Komarichev, Z. Zhong, and J. Hua (2019) A-cnn: annularly convolutional neural networks on point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7421–7430. Cited by: Table 1.
  • [17] F. J. Lawin, M. Danelljan, P. Tosteberg, G. Bhat, F. S. Khan, and M. Felsberg (2017) Deep projective 3d semantic segmentation. In International Conference on Computer Analysis of Images and Patterns, pp. 95–107. Cited by: §2.
  • [18] J. Li, B. M. Chen, and G. H. Lee (2018) So-net: self-organizing network for point cloud analysis. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 9397–9406. Cited by: Table 2, Table 1.
  • [19] Y. Li, R. Bu, M. Sun, W. Wu, X. Di, and B. Chen (2018) PointCNN: convolution on -transformed points. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp. 828–838. Cited by: Table 2, Table 1, Table 3.
  • [20] Y. Lin, Z. Yan, H. Huang, D. Du, L. Liu, S. Cui, and X. Han (2020) Fpconv: learning local flattening for point convolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4293–4302. Cited by: §1, §2, §3.2, §4.3, Table 1, Table 3.
  • [21] J. Liu, B. Ni, C. Li, J. Yang, and Q. Tian (2019) Dynamic points agglomeration for hierarchical point sets learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7546–7555. Cited by: §2.
  • [22] Y. Liu, B. Fan, S. Xiang, and C. Pan (2019) Relation-shape convolutional neural network for point cloud analysis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8895–8904. Cited by: §1, §1, §2, §3.2, §4.1, §4.1, Table 2, Table 1.
  • [23] J. Mao, X. Wang, and H. Li (2019) Interpolated convolutional networks for 3d point cloud understanding. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1578–1587. Cited by: §3.2, Table 1.
  • [24] D. Maturana and S. Scherer (2015) Voxnet: a 3d convolutional neural network for real-time object recognition. In 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 922–928. Cited by: §1, §2, §4.1, Table 1.
  • [25] A. Milioto, I. Vizzo, J. Behley, and C. Stachniss (2019) Rangenet++: fast and accurate lidar semantic segmentation. In 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 4213–4220. Cited by: §2.
  • [26] C. R. Qi, H. Su, K. Mo, and L. J. Guibas (2017) Pointnet: deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 652–660. Cited by: §1, §1, §2, §4.1, §4.2, Table 1, Table 3.
  • [27] C. R. Qi, H. Su, M. Nießner, A. Dai, M. Yan, and L. J. Guibas (2016) Volumetric and multi-view cnns for object classification on 3d data. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5648–5656. Cited by: §4.1, Table 1.
  • [28] C. R. Qi, L. Yi, H. Su, and L. J. Guibas (2017) Pointnet++: deep hierarchical feature learning on point sets in a metric space. arXiv preprint arXiv:1706.02413. Cited by: §1, §2, Table 2, Table 1.
  • [29] Y. Shen, C. Feng, Y. Yang, and D. Tian (2018) Mining point cloud local structures by kernel correlation and graph pooling. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4548–4557. Cited by: Table 1.
  • [30] M. Simonovsky and N. Komodakis (2017) Dynamic edge-conditioned filters in convolutional neural networks on graphs. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3693–3702. Cited by: §2.
  • [31] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov (2014) Dropout: a simple way to prevent neural networks from overfitting.

    The journal of machine learning research

    15 (1), pp. 1929–1958.
    Cited by: §4.1.
  • [32] H. Su, V. Jampani, D. Sun, S. Maji, E. Kalogerakis, M. Yang, and J. Kautz (2018) Splatnet: sparse lattice networks for point cloud processing. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2530–2539. Cited by: §2.
  • [33] H. Su, S. Maji, E. Kalogerakis, and E. Learned-Miller (2015) Multi-view convolutional neural networks for 3d shape recognition. In Proceedings of the IEEE international conference on computer vision, pp. 945–953. Cited by: §2.
  • [34] L. Tchapmi, C. Choy, I. Armeni, J. Gwak, and S. Savarese (2017) Segcloud: semantic segmentation of 3d point clouds. In 2017 international conference on 3D vision (3DV), pp. 537–547. Cited by: §1, §2.
  • [35] G. Te, W. Hu, A. Zheng, and Z. Guo (2018) Rgcnn: regularized graph cnn for point cloud segmentation. In Proceedings of the 26th ACM international conference on Multimedia, pp. 746–754. Cited by: §2.
  • [36] H. Thomas, C. R. Qi, J. Deschaud, B. Marcotegui, F. Goulette, and L. J. Guibas (2019) Kpconv: flexible and deformable convolution for point clouds. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6411–6420. Cited by: §1, §2, §3.1, §4.3, Table 2, Table 1, Table 3.
  • [37] C. Wang, B. Samari, and K. Siddiqi (2018) Local spectral graph convolution for point set feature learning. In Proceedings of the European conference on computer vision (ECCV), pp. 52–66. Cited by: §2.
  • [38] S. Wang, S. Suo, W. Ma, A. Pokrovsky, and R. Urtasun (2018) Deep parametric continuous convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2589–2597. Cited by: §1, §3.1, §3.2.
  • [39] Y. Wang, Y. Sun, Z. Liu, S. E. Sarma, M. M. Bronstein, and J. M. Solomon (2019) Dynamic graph cnn for learning on point clouds. Acm Transactions On Graphics (tog) 38 (5), pp. 1–12. Cited by: §2, Table 2, Table 1.
  • [40] B. Wu, A. Wan, X. Yue, and K. Keutzer (2018) Squeezeseg: convolutional neural nets with recurrent crf for real-time road-object segmentation from 3d lidar point cloud. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 1887–1893. Cited by: §2.
  • [41] B. Wu, X. Zhou, S. Zhao, X. Yue, and K. Keutzer (2019) Squeezesegv2: improved model structure and unsupervised domain adaptation for road-object segmentation from a lidar point cloud. In 2019 International Conference on Robotics and Automation (ICRA), pp. 4376–4382. Cited by: §2.
  • [42] W. Wu, Z. Qi, and L. Fuxin (2019) Pointconv: deep convolutional networks on 3d point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9621–9630. Cited by: §1, §2, §3.1, §3.2, Table 1.
  • [43] Z. Wu, S. Song, A. Khosla, F. Yu, L. Zhang, X. Tang, and J. Xiao (2015) 3d shapenets: a deep representation for volumetric shapes. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1912–1920. Cited by: §1, §2, §4.1, §4.1, Table 1.
  • [44] Y. Xu, T. Fan, M. Xu, L. Zeng, and Y. Qiao (2018) Spidercnn: deep learning on point sets with parameterized convolutional filters. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 87–102. Cited by: §3.2, Table 2, Table 1.
  • [45] J. Yang, Q. Zhang, B. Ni, L. Li, J. Liu, M. Zhou, and Q. Tian (2019) Modeling point clouds with self-attention and gumbel subset sampling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3323–3332. Cited by: §1.
  • [46] Z. Yang and L. Wang (2019) Learning relationships for multi-view 3d object recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7505–7514. Cited by: §2.
  • [47] L. Yi, V. G. Kim, D. Ceylan, I. Shen, M. Yan, H. Su, C. Lu, Q. Huang, A. Sheffer, and L. Guibas (2016) A scalable active framework for region annotation in 3d shape collections. ACM Transactions on Graphics (ToG) 35 (6), pp. 1–12. Cited by: §4.2.
  • [48] L. Yi, H. Su, X. Guo, and L. J. Guibas (2017) Syncspeccnn: synchronized spectral cnn for 3d shape segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2282–2290. Cited by: Table 2.
  • [49] Z. Zhang, B. Hua, and S. Yeung (2019) Shellnet: efficient point cloud convolutional neural networks using concentric shells statistics. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1607–1616. Cited by: §1, §2, Table 1.
  • [50] H. Zhao, L. Jiang, C. Fu, and J. Jia (2019) Pointweb: enhancing local neighborhood features for point cloud processing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5565–5573. Cited by: §1, §2, Table 3.