DeepAI
Log In Sign Up

PointConv: Deep Convolutional Networks on 3D Point Clouds

11/17/2018
by   Wenxuan Wu, et al.
Oregon State University
0

Unlike images which are represented in regular dense grids, 3D point clouds are irregular and unordered, hence applying convolution on them can be difficult. In this paper, we extend the dynamic filter to a new convolution operation, named PointConv. PointConv can be applied on point clouds to build deep convolutional networks. We treat convolution kernels as nonlinear functions of the local coordinates of 3D points comprised of weight and density functions. With respect to a given point, the weight functions are learned with multi-layer perceptron networks and the density functions through kernel density estimation. A novel reformulation is proposed for efficiently computing the weight functions, which allowed us to dramatically scale up the network and significantly improve its performance. The learned convolution kernel can be used to compute translation-invariant and permutation-invariant convolution on any point set in the 3D space. Besides, PointConv can also be used as deconvolution operators to propagate features from a subsampled point cloud back to its original resolution. Experiments on ModelNet40, ShapeNet, and ScanNet show that deep convolutional neural networks built on PointConv are able to achieve state-of-the-art on challenging semantic segmentation benchmarks on 3D point clouds. Besides, our experiments converting CIFAR-10 into a point cloud showed that networks built on PointConv can match the performance of convolutional networks in 2D images of a similar structure.

READ FULL TEXT VIEW PDF
08/13/2019

Interpolated Convolutional Networks for 3D Point Cloud Understanding

Point cloud is an important type of 3D representation. However, directly...
03/27/2018

Point Convolutional Neural Networks by Extension Operators

This paper presents Point Convolutional Neural Networks (PCNN): a novel ...
08/24/2019

Blended Convolution and Synthesis for Efficient Discrimination of 3D Shapes

Existing networks directly learn feature representations on 3D point clo...
11/29/2021

diffConv: Analyzing Irregular Point Clouds with an Irregular View

Standard spatial convolutions assume input data with a regular neighborh...
01/19/2021

The Devils in the Point Clouds: Studying the Robustness of Point Cloud Convolutions

Recently, there has been a significant interest in performing convolutio...
03/26/2021

PAConv: Position Adaptive Convolution with Dynamic Kernel Assembling on Point Clouds

We introduce Position Adaptive Convolution (PAConv), a generic convoluti...
03/30/2018

SpiderCNN: Deep Learning on Point Sets with Parameterized Convolutional Filters

Deep neural networks have enjoyed remarkable success for various vision ...

1 Introduction

In recent robotics, autonomous driving and virtual/augmented reality applications, sensors that can directly obtain 3D data are increasingly ubiquitous. This includes indoor sensors such as laser scanners, time-of-flight sensors such as the Kinect, RealSense or Google Tango, structural light sensors such as those on the iPhoneX, as well as outdoor sensors such as LIDAR and MEMS sensors. The capability to directly measure 3D data is invaluable in those applications as depth information could remove a lot of the segmentation ambiguities in 2D imagery, and surface normals provide important cues of the scene geometry.

In 2D images, convolutional neural networks (CNNs) have fundamentally changed the landscape of computer vision by greatly improving results on almost every vision task. CNNs succeed by utilizing its translation invariance property, so that the same set of convolutional filters can be applied on all the locations in an image, reducing the number of parameters and improving generalization. We would hope such successes to be transferred to the analysis of 3D data. However, 3D data often come in the form of point clouds, which is a set of unordered 3D points, with or without additional features (e.g. RGB) on each point. Point clouds are unordered and do not conform to the regular lattice grids as in 2D images. It is difficult to apply conventional CNNs on such unordered input. An alternative approach is to treat the 3D space as a volumetric grid, but in this case, the volume will be sparse and CNNs will be computationally intractable on high-resolution volumes.

In this paper, we propose a novel approach to perform convolutions on 3D point clouds with non-uniform sampling. We note that the conventional 2D convolution operation can be viewed as a discrete approximation of a continuous convolution operator. In 3D space, we can treat the weights of this convolution operator to be a (Lipschitz) continuous function of the local 3D point coordinates with respect to a reference 3D point. The continuous function can be approximated by a multi-layer perceptron(MLP), as done in [31] and [14]. But these algorithms did not take non-uniform sampling into account. We propose to use an inverse density scale to re-weight the continuous function learned by MLP, which corresponds to the Monte Carlo approximation of the continuous convolution. We such an operation PointConv. PointConv involves taking the positions of point clouds as input and learning an MLP to approximate a weight function, as well as applying a inverse density scale on the learned weights to compensate the non-uniform sampling.

The naive implementation of PointConv is memory inefficient when the channel size of the output features is very large and hence hard to train and scale up to large networks. In order to reduce the memory consumption of PointConv, we introduce an approach which is able to greatly increase the memory efficiency using a reformulation that changes the summation order. The new structure is capable of building multi-layer deep convolutional networks on 3D point clouds that have similar capabilities as 2D CNN on raster images. We can achieve the same translation-invariance as in 2D convolutional networks, and the invariance to permutations on the ordering of points in a point cloud.

In segmentation tasks, the ability to transfer information gradually from coarse layers to finer layer is important. Hence, a deconvolution operation [22] that can fully leverage the feature from a coarse layer to a finer layer is vital for the performance. Most state-of-the-art algorithms [24, 26] are unable to perform deconvolution, which restricts their performance on segmentation tasks. Since our PointConv is a full approximation of convolution, it is natural to extend PointConv to a PointDeconv, which can fully untilize the information in coarse layers and propagate to finer layers. By using PointConv and PointDeconv, we can achieve improved performance on semantic segmentation tasks.

The key contributions of our work are:

We propose PointConv, a density re-weighted convolution, which is able to fully approximate the 3D continuous convolution on any set of 3D points.

We design a memory efficient approach to implement PointConv using a change of summation order technique.

We extend our PointConv to a deconvolution version(PointDeconv) to achieve better segmentation results.

Experiments show that our deep network built on PointConv is highly competitive against other point cloud deep networks and achieve state-of-the-art results in part segmentation [2] and indoor semantic segmentation benchmarks [5]. In order to demonstrate that our PointConv is indeed a true convolution operation, we also evaluate PointConv on CIFAR-10 by converting all pixels in a 2D image into a point cloud with 2D coordinates along with RGB features on each point. Experiments on CIFAR-10 show that the classification accuracy of our PointConv with a 5-layer network is 89.13%, comparable to the results of AlexNet[17] which has similar depth, far outperforming previous best results achieved by a point cloud network. Being a basic approach to CNN on 3D data, we believe there could be many potential applications of this approach in the near future.

2 Related Work

Most work on 3D CNN networks convert 3D point clouds to 2D images or 3D volumetric grids. [34, 25] proposed to project 3D point clouds or shapes into several 2D images, and then apply 2D convolutional networks for classification. Although these approaches have achieved dominating performances on shape classification and retrieval tasks, it is nontrivial to extend them to high-resolution scene segmentation tasks [5]. [40, 21, 25] represent another type of approach that voxelizes point clouds into volumetric grids by quantization and then apply 3D convolution networks. This type of approach is constrained by its 3D volumetric resolution and the computational cost of 3D convolutions. [29] improves the resolution significantly by using a set of unbalanced octrees where each leaf node stores a pooled feature representation. Kd-networks[16] computes the representations in a feed-forward bottom-up fashion on a Kd-tree with certain size. In a Kd-network, the input number of points in the point cloud needs to be the same during training and testing, which does not hold for many tasks.

Some latest work [28, 24, 26, 33, 35, 11, 8, 37] directly take raw point clouds as input without converting them to other formats. [24, 28]

proposed to use shared multi-layer perceptrons and max pooling layers to obtain features of point clouds. Because the max pooling layers are applied across all the points in point cloud, it is difficult to capture local features. PointNet++

[26] improved the network in PointNet [24] by adding a hierarchical structure. The hierarchical structure is similar to the one used in image CNNs, which extracts features starting from small local regions and gradually extending to larger regions. The key structure used in both PointNet [24] and PointNet++ [26] to aggregate features from different points is max-pooling. However, max-pooling layers keep only the strongest reaction in features across a local or global region, which may lose some useful detailed information for segmentation tasks. [33] presents a method that projects the input features of the point clouds onto a high-dimensional lattice, and then apply bilateral convolution on the high-dimensional lattice to aggregate features, which called “SPLATNet”. The SPLATNet [33] is able to give comparable results as PointNet++ [26]. The tangent convolution [35] projects local surface geometry on a tangent plane around every point, which gives a set of planar-convolutionable tangent images. The pointwise convolution [11] queries nearest neighbors on the fly and bins the points into kernel cells, then applies kernel weights on the binned cells to convolve on point clouds. Flex-convolution [8] introduced a generalization of the conventional convolution layer along with an efficient GPU implementation, which can applied to point clouds with millions of points. FeaStNet [37] proposes to generalize conventional convolution layer to 3D point clouds by adding a soft-assignment matrix.

The work [31, 14, 38] and [41] propose to learn continuous filters to perform convolution. [14] proposed that the weight filter in 2d convolution can be treated as a continuous function, which can be approximated by MLPs. [31] firstly introduced the idea into 3d graph structure. Both [14] and [31] are able to achieve very good results on classification tasks. Dynamic graph CNN [38] proposed a depthwise convolution on graph along with a method that can dynamically updating the graph. [41] presents a special family of filters to approximate the weight function instead of using MLPs. Our paper is different from these approachs in three aspects. First, we use density to compensate for the non-uniform sampling in point clouds. Second, we propose a memory efficient version of PointConv, which was not addressed in [31] and [14]; Third, we propose a deconvolution operator based on PointConv to perform semantic segmentation.

PointCNN [19] is proposed as another approach to perform convolution on point clouds. The idea in PointCNN [19] is to learn a transformation from the input points and then use it to simultaneously weight and permute the input features associated with the points. For a local region with points, a transformation [19] is a matrix learned from the coordinates of input points with multi-layer percetrons. Comparing to our approach, PointCNN is unable to achieve permutation-invariance, which is desired for point clouds.

3 PointConv

We propose a convolution operation which extends traditional image convolution into the point cloud called PointConv. PointConv is a Monte Carlo approximation to the 3D continuous convolution operator. For each convolutional filter, it uses MLP to approximate a weight function, then applies an inverse density scale to re-weight the learned weight functions. Sec. 3.1 introduces the structure of the PointConv layer. Sec. 3.2 introduces PointDeconv, using PointConv layers to deconvolve features.

3.1 Convolution on 3D Point Clouds

Formally, convolution is defined as in Eq.(1) for functions and of a

-dimensional vector

.

(1)

Images can be interpreted as 2D discrete functions, which are usually represented as grid-shaped matrices. In CNN, each filter is restricted to a small local region, such as , etc. Within each local region, the relative positions between different pixels are always fixed, as shown in Figure 1 (a). And the filter can be easily discretized to a summation with a real-valued weight for each location within the local region.

A point cloud is represented as a set of 3D points , where each point contains a position vector and its features such as color, surface normal, etc. Different from images, point clouds have more flexible shapes. The coordinate of a point in a point cloud is not located on a fixed grid but can take an arbitrary continuous value. Thus, the relative positions of different points are diverse in each local region. Conventional discretized convolution filters on raster images cannot be applied directly on the point cloud. Fig. 1 shows the difference between a local region in a image and a point cloud.

Figure 1: Image grid vs. point cloud. (a) shows a local region in a image, where the distance between points can only attain very few discrete values; (b) and (c) show that in different local regions within a point cloud, the order and the relative positions can be very different.

To make convolution compatible with point sets, we propose a permutation-invariant convolution operation, called PointConv. Our idea is to first go back to the continuous version of 3D convolution as:

(2)

is the feature of a local region centered around point . A point cloud can be viewed as a non-uniform sample from the continuous space. In each local region, could be any possible position in the local region. We define PointConv as a Monte Carlo estimate of the continuous 3D convolution with respect to an importance sampling:

(3)
Figure 2: 2D weight function for PointConv. (a) is a learned continuous weight function; (b) and (c) are different local regions in a 2d point cloud. Given 2d points, we can obtain the weights at particular locations. The same applies to 3D points. The regular discrete 2D convolution can be viewed as a discretization of the continuous convolution weight function, as in (d).

is the inverse density at point . is required because the point cloud can be sampled very non-uniformly111To see this, note the Monte Carlo estimate with a biased sample: , for . Point clouds are often biased samples because many sensors have difficulties measuring points near plane boundaries, hence needing this reweighting. Intuitively, the number of points in the local region varies across the whole point cloud, as in Figure 2 (b) and (c). Besides, in Figure 2 (c), points are very close to one another, hence the contribution of each should be smaller.

Our main idea is to approximate the weight function by multi-layer perceptrons from the 3D coordinates and and the inverse density by a kernelized density estimation [36] followed by a nonlinear transform implemented with MLP. Because the weight function highly depends on the distribution of input point cloud, we call the entire convolution operation PointConv. Prior work [14, 31] considered the approximation of the weight function but did not consider the approximation of the density scale, hence is not a full approximation of the continuous convolution operator.

The weights of the MLP in PointConv are shared across all the points to maintain the permutation invariance. In order to compute the inverse density scale , we first estimate the density of each point in a point cloud offline, then feed the density into a MLP for a 1D nonlinear transform. The reason to use a nonlinear transform is for the network to decide adaptively whether to use the density estimates. In our experiments, we use the kernel density estimation(KDE) to estimate the density of each point in a point cloud.

Figure 3: PointConv. (a) shows a local region with the coordinates of points transformed from global into local coordinates, is the coordinates of points, and is the corresponding feature; (b) shows the process of conducting PointConv on one local region centered around one point . The input features come form the K nearest neighbors centered at , and the output feature is at .

Figure 3 shows the PointConv operation on a -point local region. Let be the channel size for the input feature and output feature, are the indices for th neighbor, th channel for input feature and th channel for output feature. The inputs are the 3D local positions of the points , which can be computed by subtracting the coordinate of the centroid of the local region and the feature of the local region. We use convolution to implement the MLP. The output of the weight function is . So, is a vector. The density scale is . After convolution, the feature from a local region with neighbour points are encoded into the output feature , as shown in Eq.(4).

(4)

Our PointConv learns a network to approximate the continuous weights for convolution. For each input point, we can compute the weights from the MLPs using its relative coordinates. Figure 2 (a) shows an example continuous weight function for convolution. With a point cloud input as a discretization of the continuous input, a discrete convolution can be computed by Figure 2 (b) to extract the local features, which would work (with potentially different approximation accuracy) for different point cloud samples(Figure 2 (b-d)), including a regular grid(Figure 2 (d)). Note that in a raster image, the relative positions in local region are fixed. Then PointConv (which takes only relative positions as input for the weight functions) would output the same weight and density across the whole image, which indicates that PointConv is equivalent to a traditional sense of convolution.

In order to aggregate the features in the entire point set, we use a hierarchical structure that is able to combine detailed small region features into abstract features that cover a larger spatial extent. The hierarchical structure we use is composed by several feature encoding modules, which is similar to the one used in PointNet++ [26]. Each module is roughly equivalent to one layer in a convolutional CNN. The key layers in each feature encoding module are the sampling layer, the grouping layer and the PointConv. More details for the feature encoding module can be found in the supplementary material.

The drawback of this approach is that each filter needs to be approximated by a network, hence is very inefficient. In Sec.4, we propose an efficient approach to implement PointConv.

3.2 Feature Propagation Using Deconvolution

For the segmentation task, we need point-wise prediction. In order to obtain features for all the input points, an approach to propagate features from a subsampled point cloud to a denser one is needed. PointNet++ [26]

proposes to use distance based interpolation to propagate features, which is reasonable due to local correlations inside of a local region. However, this does not take full advantage of the deconvolution operation that captures local correlations of propagated information from the coarse level. We propose to add a PointDeconv layer based on the PointConv, as a deconvolution operation to address this issue.

Figure 4: Feature encoding and propagation. This figure shows how the features are encoded and propagated in the network for a classes segmentation task. is the number of points in each layer, is the channel size for the features. Best viewed in color.

As shown in Figure 4, PointDeconv is composed of two parts: interpolation and PointConv. Firstly, we employ an interpolation to propagate coarse features from previous layer. Following [26], the interpolation is conducted by linearly interpolating features from the 3 nearest points. Then, the interpolated features are concatenated with features from the convolutional layers with the same resolution using skip links. After concatenation, we apply PointConv on the concatenated features to obtain the final deconvolution output, similar to the image deconvolution layer [22]. We apply this process until the features of all the input points have been propagated back to the original resolution.

4 Efficient PointConv

The naive implementation of the PointConv is memory consuming and inefficient. Different from [31]

, we propose a novel reformulation to implement PointConv by reducing it to two standard operations: matrix multiplication and 2d convolution. This novel trick not only takes advantage of the parallel computing of GPU, but also can be easily implemented using main-stream deep learning frameworks. Because the inverse density scale does not have such memory issues, the following discussion mainly focuses on the weight function.

Specifically, let be the mini-batch size in the training stage, be the number of points in a point cloud, be the number of points in each local region, be the number of input channels, and be the number of output channels. For a point cloud, each local region shares the same weight functions which can be learned using MLP. However, weights computed from the weight functions at different points are different. The size of the weights filters generated by the MLP is . Suppose , , , , , and the filters are stored with single point precision. Then, the memory size for the filters is for only one layer. The network would be hard to train with such high memory consumption. [31] used very small network with few filters which significantly degraded its performance. To resolve this problem, we propose a memory efficient version of PointConv based on Lemma 1.

Lemma 1

The PointConv is equivalent to the following formula: where is the input to the last layer in the MLP for computing the weight function, and is the weights of the last layer in the same MLP, is convolution.

Proof: Generally, the last layer of the MLP is a linear layer. In one local region, let and rewrite the MLP as a convolution so that the output of the weight function is . Let is the index of the points in a local region, and are the indices of the input, middle layer and the filter output, respectively. Then is a vector from . And the is a vector from . According to Eq.(4), the PointConv can be expressed in Eq.(5).

(5)

Let’s explore Eq.(5) in a more detailed manner. The output of the weight function can be expressed as:

(6)

Substituting Eq.(6) into Eq.(5).

(7)

Thus, the original PointConv can be equivalently reduced to a matrix multiplication and a convolution. Figure 5 shows the efficient version of PointConv.

Figure 5: Efficient PointConv. The memory efficient version of PointConv on one local region with points.

In this method, instead of storing the generated filters in memory, we divide the weights filters into two parts: the intermediate result and the convolution kernel . As we can see, the memory consumption reduces to of the original version. With the same input setup as the Figure 3 and let , the memory consumption is , which is about of the original PointConv.

5 Experiments

In order to evaluate our new PointConv network, we conduct experiments on some widely used datasets, ModelNet40 [40], ShapeNet [2] and ScanNet [5]. In order to demonstrate that our PointConv is able to fully approximate conventional convolution, we also report the experiment results on the CIFAR-10 dataset [17]

. In all experiments, we implement the models with Tensorflow on a GTX 1080Ti GPU. The Adam optimizer is used. ReLU and batch normalization are applied after each layer except the last fully connected layer. The detailed network structures and more experiments results can be found in supplementary.

5.1 Classification on ModelNet40

ModelNet40 contains 12,311 CAD models from 40 man-made object categories. We use the official split with 9,843 shapes for training and 2,468 for testing. Following the configuration in [24], we use the source code for PointNet[24] to sample 1,024 points uniformly and compute the normal vectors from the mesh models. For fair comparison, we employ the same data augmentation strategy as [24] by randomly rotating the point cloud along the

-axis and jittering each point by a Gaussian noise with zero mean and 0.02 standard deviation. In Table

1, our PointConv achieved state-of-the-art performance among methods based on 3D input. ECC[31] which is similar to our approach, cannot scale to a large network, which limited their performance.

Method Input Accuracy(%)
Subvolume [25] voxels 89.2
ECC [31] graphs 87.4
Kd-Network [16] 1024 points 91.8
PointNet [24] 1024 points 89.2
PointNet++ [26] 1024 points 90.2
PointNet++ [26] 5000 points+normal 91.9
SpiderCNN [41] 1024 points+normal 92.4
PointConv 1024 points+normal 92.5
Table 1: ModelNet40 Classification Accuracy

5.2 ShapeNet Part Segmentation

Figure 6: Part segmentation results. For each pair of objects, the left one is the ground truth, the right one is predicted by PointConv. Best viewed in color.

Part segmentation is a challenging fine-grained 3D recognition task. The ShapeNet dataset contains 16,881 shapes from 16 classes and 50 parts in total. The input of the task is shapes represented by a point cloud, and the goal is to assign a part category label to each point in the point cloud. The category label for each shape is given. We follow the experiment setup in most related work [26, 33, 41, 16]. It is common to narrow the possible part labels to the ones specific to the given object category by using the known input 3D object category. And we also compute the normal direction on each point as input features to better describe the underlying shape. Figure 6 visualizes some sample results.

class avg. instance avg.
SSCNN [42] 82.0 84.7
Kd-net [16] 77.4 82.3
PointNet [24] 80.4 83.7
PointNet++[26] 81.9 85.1
SpiderCNN [41] 82.4 85.3
SPLATNet [33] 82.0 84.6
PointConv 82.8 85.7
Table 2: Results on ShapeNet part dataset. Class avg. is the mean IoU averaged across all object categories, and inctance avg. is the mean IoU across all objects.

We use point intersection-over-union(IoU) to evaluate our PointConv network, same as PointNet++ [26], SPLATNet [33] and some other part segmentation algorithms [42], [16], [41]. The results are shown in Table 2. PointConv obtains a class average mIoU of 82.8% and an instance average mIoU of 85.7%, which are on par with the state-of-the-art algorithms which only take point clouds as input. According to [33], the SPLATNet also takes rendered 2D views as input. Since our PointConv only takes 3D point clouds as input, for fair comparison, we only compare our result with the SPLATNet in [33].

5.3 Semantic Scene Labeling

Datasets such as ModelNet40 [40] and ShapeNet [2] are man-made synthetic datasets. As we can see in the previous section, most state-of-the-art algorithms are able to obtain relatively good results on such datasets. To evaluate the capability of our approach in processing realistic point clouds, which contains a lot of noisy data, we evaluate our PointConv on semantic scene segmentation using the ScanNet dataset. The task is to predict semantic object labels on each 3D point given indoor scenes represented by point clouds. The newest version of ScanNet [5] includes updated annotations for all 1513 ScanNet scans and 100 new test scans with all semantic labels publicly unavailable and we submitted our results to the official evaluation server to compare against other approaches.

We compare our algorithm with Tangent Convolutions [35], SPLAT Net [33], PointNet++ [26] and ScanNet [5]. All the algorithm mentioned reported their results on the new ScanNet dataset to the benchmark, and the inputs of the algorithms only uses 3D coordinates data plus RGB. In our experiments, we generate training samples by randomly sample cubes from the indoor rooms, and evaluate using a sliding window over the entire scan. We report intersection over union (IoU) as our main measures, which is the same as the benchmark. We visualize some example semantic segmentation results in Figure 7. The mIoU is reported in Table 3. The mIoU is the mean of IoU across all the categories. Our PointConv outperforms other algorithm by a significant margin (Table 3).

Figure 7: Examples of semantic scene labeling. The images from left to right are the input scenes, the ground truth segmentation, and the prediction from PointConv. For better visualization, the point clouds are converted into mesh format. Best viewed in color.
Method mIoU(%)
ScanNet [5] 30.6
PointNet++ [26] 33.9
SPLAT Net [33] 39.3
Tangent Convolutions [35] 43.8
PointConv 55.6
Table 3: Semantic Scene Segmentation results on ScanNet

5.4 Classification on CIFAR-10

In Sec.3.1, we claimed that PointConv can be equivalent with 2D CNN. If this is true, then the performance of a network based on PointConv should be equivalent to that of a raster image CNN. In order to verify that, we use the CIFAR-10 dataset as a comparison benchmark. We treat each pixel in CIFAR-10 as a 2D point with coordinates and RGB features. The point clouds are scaled onto the unit ball before training and testing.

Experiments show that PointConv on CIFAR-10 indeed has the same learning capacities as a 2D CNN. Table 4 shows the results of image convolution and PointConv. From the table, we can see that the accuracy of PointCNN[19] on CIFAR-10 is only , which is much worse than image CNN. However, the network using PointConv is able to achieve , which is similar to the network using image convolution.

Accuracy(%)
Image Convolution 88.52
AlexNet [18] 89.00
PointCNN [19] 80.22
SpiderCNN [41] 77.97
PointConv 89.13
Table 4: CIFAR-10 Classification Accuracy

6 Ablation Experiments and Visualizations

In this section, we conduct additional experiments to evaluate the effectiveness of each aspect of PointConv. Besides the ablation study on the structure of the PointConv, we also give an in-depth breakdown on the performance of PointConv on the ScanNet dataset. Finally, we provide some learned filters for visualization.

6.1 The Structure of MLP

In this section, we design experiments to evaluate the choice of MLP parameters in PointConv. For fast evaluation, we generate a subset from the ScanNet dataset as a classification task. Each example in the subset is randomly sampled from the original scene scans with 1,024 points. There are 20 different scene types for the ScanNet dataset. We empirically sweep over different choices of and different number of layers of the MLP in PointConv. Each experiment was ran for 3 random trials.The results can be find in supplementary. From the results, we find that larger does not necessarily give better classification results. And the different number of layers in MLP does not give much difference in classification results. Since is linearly correlated with the memory consumption of each PointConv layer, this results shows that we can choose a reasonably small for greater memory efficiency.

6.2 Inverse Density Scale

In this section, we study the effectiveness of the inverse density scale . We choose the ScanNet dataset as our evaluation task since the point clouds in ScanNet are generated from real indoor scenes. We follow the standard training/validation split provided by the authors. We train the network using two different kind of PointConv: one with the inverse density scale as described in Sec.3.1; one without the density scale. Table 5 shows the results. As we can see, PointConv with inverse density scale performs better than the one without by about

, which proves the effectiveness of inverse density scale. In our experiments, we observe that inverse density scale tend to be more effective in layers closer to the input. In deep layers, the MLP tends to learn to diminish the effect of the density scale. One possible reason is that with farthest point sampling algorithm as our sub-sampling algorithm, the point cloud in deeper layer tend to be more uniformly distributed.

6.3 Ablation Studies on ScanNet Semantic Segmentation

As one can see, our PointConv outperforms other approaches with a large margin. Since we are only allowed to submit one final result of our algorithm to the benchmark server of ScanNet, we perform more ablation studies for PointConv using the public validation set provide by [5]. For the segmentation task, we train our PointConv with 8192 points randomly sampled from a , and evaluate the model with exhaustively choose all points in the

cube in a sliding window fashion through the xy-plane with different stride sizes. For robustness, we use majority vote with vote number

in all of our experiments. From Table 5, we can see that smaller stride size is able to improve the segmentation results, and the RGB information on ScanNet does not seem to significantly improve the segmentation results. Even without these additional improvements, PointConv still outperforms baselines by a large margin.

Input Stride Size(m) mIoU
xyz 0.5 60.4
1.0 58.5
1.5 57.9
xyz+No density 1.5 56.9
xyz+RGB 0.5 60.8
1.0 58.6
1.5 57.5
Table 5: Ablation study on ScanNet. With and without RGB information, inverse density scale and using different stride size of sliding window.

6.4 Visualization

Figure 8 visualizes the learned filters from the MLPs in our PointConv. In order to better visualize the filters, we sample the learned functions through a plane . From the Figure 8, we can see some patterns in the learned continuous filters.

Figure 8: Learned Convolutional Filters. The convolution filters learned by the MLPs on ShapeNet.For better visualization, we take all weights filters from plane.

7 Conclusion

In this work, we proposed a novel approach to perform convolution operation on 3D point clouds, called PointConv. PointConv trains multi-layer perceptrons on local point coordinates to approximate continuous weight and density functions in convolutional filters, which makes it naturally permutation-invariant and translation-invariant. This allows deep convolutional networks to be built directly on 3D point clouds. We proposed an efficient implementation of it which greatly improved its scalability. We demonstrated its strong performance on multiple challenging benchmarks and capability of matching the performance of a grid-based convolutional network in 2D images. In future work, we would like to adopt more mainstream image convolution network architectures into point cloud data using PointConv, such as ResNet and DenseNet.

References

  • [1] M. M. Bronstein and I. Kokkinos. Scale-invariant heat kernel signatures for non-rigid shape recognition. In

    Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on

    , pages 1704–1711. IEEE, 2010.
  • [2] A. X. Chang, T. Funkhouser, L. Guibas, P. Hanrahan, Q. Huang, Z. Li, S. Savarese, M. Savva, S. Song, H. Su, et al. Shapenet: An information-rich 3d model repository. arXiv preprint arXiv:1512.03012, 2015.
  • [3] D.-Y. Chen, X.-P. Tian, Y.-T. Shen, and M. Ouhyoung.

    On visual similarity based 3d model retrieval.

    In Computer graphics forum, volume 22, pages 223–232. Wiley Online Library, 2003.
  • [4] H. Chu, W.-C. M. K. Kundu, R. Urtasun, and S. Fidler. Surfconv: Bridging 3d and 2d convolution for rgbd images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3002–3011, 2018.
  • [5] A. Dai, A. X. Chang, M. Savva, M. Halber, T. Funkhouser, and M. Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), volume 1, 2017.
  • [6] Y. Fang, J. Xie, G. Dai, M. Wang, F. Zhu, T. Xu, and E. Wong. 3d deep shape descriptor. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2319–2328, 2015.
  • [7] A. Gressin, C. Mallet, J. Demantké, and N. David. Towards 3d lidar point cloud registration improvement using optimal neighborhood knowledge. ISPRS journal of photogrammetry and remote sensing, 79:240–251, 2013.
  • [8] F. Groh, P. Wieschollek, and H. Lensch. Flex-convolution (deep learning beyond grid-worlds). arXiv preprint arXiv:1803.07289, 2018.
  • [9] K. Guo, D. Zou, and X. Chen. 3d mesh labeling via deep convolutional neural networks. ACM Transactions on Graphics (TOG), 35(1):3, 2015.
  • [10] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  • [11] B.-S. Hua, M.-K. Tran, and S.-K. Yeung. Pointwise convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 984–993, 2018.
  • [12] Q. Huang, W. Wang, and U. Neumann. Recurrent slice networks for 3d segmentation of point clouds. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2626–2635, 2018.
  • [13] J.-H. Jacobsen, J. van Gemert, Z. Lou, and A. W. Smeulders. Structured receptive fields in cnns. In Computer Vision and Pattern Recognition (CVPR), 2016 IEEE Conference on, pages 2610–2619. IEEE, 2016.
  • [14] X. Jia, B. De Brabandere, T. Tuytelaars, and L. V. Gool. Dynamic filter networks. In Advances in Neural Information Processing Systems, pages 667–675, 2016.
  • [15] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • [16] R. Klokov and V. Lempitsky. Escape from cells: Deep kd-networks for the recognition of 3d point cloud models. In 2017 IEEE International Conference on Computer Vision (ICCV), pages 863–872. IEEE, 2017.
  • [17] A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009.
  • [18] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
  • [19] Y. Li, R. Bu, M. Sun, and B. Chen. Pointcnn. arXiv preprint arXiv:1801.07791, 2018.
  • [20] H. Ling and D. W. Jacobs. Shape classification using the inner-distance. IEEE transactions on pattern analysis and machine intelligence, 29(2):286–299, 2007.
  • [21] D. Maturana and S. Scherer. Voxnet: A 3d convolutional neural network for real-time object recognition. In Intelligent Robots and Systems (IROS), 2015 IEEE/RSJ International Conference on, pages 922–928. IEEE, 2015.
  • [22] H. Noh, S. Hong, and B. Han. Learning deconvolution network for semantic segmentation. In Proceedings of the IEEE International Conference on Computer Vision, pages 1520–1528, 2015.
  • [23] C. R. Qi, W. Liu, C. Wu, H. Su, and L. J. Guibas. Frustum pointnets for 3d object detection from rgb-d data. arXiv preprint arXiv:1711.08488, 2017.
  • [24] C. R. Qi, H. Su, K. Mo, and L. J. Guibas. Pointnet: Deep learning on point sets for 3d classification and segmentation. Proc. Computer Vision and Pattern Recognition (CVPR), IEEE, 1(2):4, 2017.
  • [25] C. R. Qi, H. Su, M. Nießner, A. Dai, M. Yan, and L. J. Guibas. Volumetric and multi-view cnns for object classification on 3d data. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5648–5656, 2016.
  • [26] C. R. Qi, L. Yi, H. Su, and L. J. Guibas. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. In Advances in Neural Information Processing Systems, pages 5105–5114, 2017.
  • [27] X. Qi, R. Liao, J. Jia, S. Fidler, and R. Urtasun. 3d graph neural networks for rgbd semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5199–5208, 2017.
  • [28] S. Ravanbakhsh, J. Schneider, and B. Poczos. Deep learning with sets and point clouds. arXiv preprint arXiv:1611.04500, 2016.
  • [29] G. Riegler, A. O. Ulusoy, and A. Geiger. Octnet: Learning deep 3d representations at high resolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, volume 3, 2017.
  • [30] R. B. Rusu, N. Blodow, and M. Beetz. Fast point feature histograms (fpfh) for 3d registration. In Robotics and Automation, 2009. ICRA’09. IEEE International Conference on, pages 3212–3217. IEEE, 2009.
  • [31] M. Simonovsky and N. Komodakis. Dynamic edge-conditioned filters in convolutional neural networks on graphs. In Proc. CVPR, 2017.
  • [32] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
  • [33] H. Su, V. Jampani, D. Sun, S. Maji, E. Kalogerakis, M.-H. Yang, and J. Kautz. Splatnet: Sparse lattice networks for point cloud processing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2530–2539, 2018.
  • [34] H. Su, S. Maji, E. Kalogerakis, and E. Learned-Miller. Multi-view convolutional neural networks for 3d shape recognition. In Proceedings of the IEEE international conference on computer vision, pages 945–953, 2015.
  • [35] M. Tatarchenko, J. Park, V. Koltun, and Q.-Y. Zhou. Tangent convolutions for dense prediction in 3d. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3887–3896, 2018.
  • [36] B. A. Turlach. Bandwidth selection in kernel density estimation: A review. In CORE and Institut de Statistique. Citeseer, 1993.
  • [37] N. Verma, E. Boyer, and J. Verbeek. Feastnet: Feature-steered graph convolutions for 3d shape analysis. In CVPR 2018-IEEE Conference on Computer Vision & Pattern Recognition, 2018.
  • [38] Y. Wang, Y. Sun, Z. Liu, S. E. Sarma, M. M. Bronstein, and J. M. Solomon. Dynamic graph cnn for learning on point clouds. arXiv preprint arXiv:1801.07829, 2018.
  • [39] Z. Wu, R. Shou, Y. Wang, and X. Liu. Interactive shape co-segmentation via label propagation. Computers & Graphics, 38:248–254, 2014.
  • [40] Z. Wu, S. Song, A. Khosla, F. Yu, L. Zhang, X. Tang, and J. Xiao. 3d shapenets: A deep representation for volumetric shapes. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1912–1920, 2015.
  • [41] Y. Xu, T. Fan, M. Xu, L. Zeng, and Y. Qiao. Spidercnn: Deep learning on point sets with parameterized convolutional filters. arXiv preprint arXiv:1803.11527, 2018.
  • [42] L. Yi, H. Su, X. Guo, and L. Guibas. Syncspeccnn: Synchronized spectral cnn for 3d shape segmentation. In Computer Vision and Pattern Recognition (CVPR), 2017.
  • [43] T. Zhou, M. Brown, N. Snavely, and D. G. Lowe. Unsupervised learning of depth and ego-motion from video. In CVPR, volume 2, page 7, 2017.
  • [44] Y. Zhou and O. Tuzel. Voxelnet: End-to-end learning for point cloud based 3d object detection. arXiv preprint arXiv:1711.06396, 2017.