PointShuffleNet: Learning Non-Euclidean Features with Homotopy Equivalence and Mutual Information

03/31/2021 ∙ by Linchao He, et al. ∙ Sichuan University 0

Point cloud analysis is still a challenging task due to the disorder and sparsity of samplings of their geometric structures from 3D sensors. In this paper, we introduce the homotopy equivalence relation (HER) to make the neural networks learn the data distribution from a high-dimension manifold. A shuffle operation is adopted to construct HER for its randomness and zero-parameter. In addition, inspired by prior works, we propose a local mutual information regularizer (LMIR) to cut off the trivial path that leads to a classification error from HER. LMIR utilizes mutual information to measure the distance between the original feature and HER transformed feature and learns common features in a contrastive learning scheme. Thus, we combine HER and LMIR to give our model the ability to learn non-Euclidean features from a high-dimension manifold. This is named the non-Euclidean feature learner. Furthermore, we propose a new heuristics and efficiency point sampling algorithm named ClusterFPS to obtain approximate uniform sampling but at faster speed. ClusterFPS uses a cluster algorithm to divide a point cloud into several clusters and deploy the farthest point sampling algorithm on each cluster in parallel. By combining the above methods, we propose a novel point cloud analysis neural network called PointShuffleNet (PSN), which shows great promise in point cloud classification and segmentation. Extensive experiments show that our PSN achieves state-of-the-art results on ModelNet40, ShapeNet and S3DIS with high efficiency. Theoretically, we provide mathematical analysis toward understanding of what the data distribution HER has developed and why LMIR can drop the trivial path by maximizing mutual information implicitly.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 15

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Point cloud analysis is challenging because it needs to process 3D data that is highly irregular and only contains Euclidean space information (e.g

., 3D coordinates and normal vectors). The adjacent points may not be highly relevant and lack the connection information between points. Previous works used deep-learning-based methods to map the coordinates and other features (

e.g., normal vector and RGB information) to higher-dimension space to extract semantic information. As a result, point cloud analysis can be easily formulated as mining information from a Euclidean space problem. In this paper, we consider space of point cloud as a submanifold, which is a local Euclidean space on a high-dimension non-Euclidean manifold. The data distribution of an object category should be a continuous closure subspace on the high-dimension Non-Euclidean manifold, and every point cloud that belongs to the above category should be contained in the subspace. Thus, we can formulate the point cloud analysis as a problem that learns the non-Euclidean features of the continuous closure subspaces from the corresponding discrete point clouds in .

To overcome the obstacle mentioned above, we propose a non-Euclidean feature learner (NEFL) enabling the transformation of point clouds into continuous closure subspaces efficiently. The non-Euclidean feature learner contains two parts: homotopy equivalence relation (HER) and the local mutual information regularizer (LMIR). The HER utilizes the homotopy equivalence transformation [3-manifolds2004Richard, Topology2000Anderson] to describe continuously deformed bijective relations between multiple objects in the same category (eg

, different chairs). Shuffle is the most efficient operation to perform homotopy transformation and can generate random and discrete submanifolds. There are multiple relations, called path-connected in topology, with the same endpoints on the manifold. Thus, we can transform a point cloud to other point clouds through different paths. Thus, the paths can be learned by the neural networks. This extends the generalization of the neural networks. We define the paths that connect several point clouds in the same category and are bundled in the continuous closure subspace of a same category as the non-trivial paths. Otherwise, we named the paths that are not bundled in a subspace as trivial paths. The trivial paths are hard to cut off explicitly on the high-dimension manifold, and this leads to poor generalization. Thus, we propose LMIR to cut off the trivial paths implicitly with mutual information estimation. LMIR estimates the mutual information between the original point clouds and HER-generated point clouds. Inspired by Deep INFOMAX 

[hjelm2019learning], we use a contrastive loss that is based on contrastive loss to regularize our model to make the trivial paths have higher loss than the non-trivial paths. The LMIR can be deployed in the training phase to increase accuracy and can be removed in the inference phase to obtain higher speeds.

To further enhance the efficiency, we propose a parallel-friendly point sampling algorithm named ClusterFPS, which uses the divide-and-conquer method to divide multiple sampling regions and deploy the farthest points sampling (FPS) algorithm in each region. We implement a k-means cluster sampling algorithm with a batch input as our dividing algorithm. The basic FPS algorithm is iterative and context sensitive. It is difficult to utilize modern GPU architectures to speed up the process. By using ClusterFPS as our sampling strategy, we can utilize the parallel computing of the GPUs to gain faster speed than FPS and comparable performance.

Based on the non-Euclidean feature learner and ClusterFPS, we build a highly efficient neural network architecture named PointShuffleNet (PSN). Pointwise group convolution [zhang2018shufflenet, ma2018shufflenet] is introduced to replace MLP with better performance and fewer parameters. We modify the channel attention [hu2018senet] by concatenating a channel descriptor with the Euclidean coordinate of points. We achieve state-of-the-art performance on ModelNet [wu2015modelnet] and comparable results on ShapeNet [shapenet2015] and S3DIS [Armeni2017s3dis]. PSN achieves a speedup on various tasks. Remarkably, our model achieves the highest mean class accuracy () on ModelNet40 at 4.6 ms.

2 Related Work

Recently, deep-learning-based methods have been rapidly developed to process point clouds. They have shown great improvement in speed and accuracy.

Pointwise MLP Methods. PointNet [qi2016pointnet]

was the first deep-learning method that used a sequence of multilayer perceptron (MLP) to directly processes point sets. PointNet showed great promises in accuracy but also weakness in model complexity and training speed. PointNet++ 

[qi2017pointnetplusplus]

follows the process in which convolutional neural networks (CNNs) extract information from local to global. It uses sampling and grouping layers to split point sets into small clusters and deploys PointNet to extract local features hierarchically. Compared with prior works, PointNet++ achieves better performance but is slower and more complicated. To further enhance the performance, Yang

et al[Yang2019PAT] proposed a pointwise-based method called Point Attention Transformers (PATs). This uses a parameter-efficient Group Shuffle Attention (GSA) to replace the complicated multihead attention in transformers. They also proposed a novel task-specific sampling method named Gumbel Subset Sampling (GSS). PointASNL [yan2020pointasnl]

proposed a new adaptive sampling module to benefit the feature learning and avoid the biased effect of outliers. And they use local-nonlocal module to capture the neighbor and long-range dependencies of point clouds. PosPool 

[liu2020closerlook3d] proposed a local aggregation operator without weights named PosPool and combine it with a deep residual network achieving state-of-the-art results. Although pointwise MLP methods show great promise in efficiency, they still suffer from the inefficiency of FPS and lower accuracy than other approaches. Thus, we propose ClusterFPS and NEFL to further improve speed and accuracy.

Figure 1: Homotopy equivalence (e.g.,  and ) exists between objects. We can use homotopy equivalence relation (HER) to transform existing object to other objects, which is continuous mapping. Thus, neural network can capture above transformation and learn data distribution from it. However, there are several trivial paths (e.g., ) for chairs that may cause neural network to learn incorrect data distribution. It is difficult to cut off trivial path explicitly. Inspired by Deep INFOMAX [hjelm2019learning], we propose local mutual information regularizer (LMIR) to perform mutual information estimation on . Trivial paths (e.g., ) have less mutual information and higher loss than excepted path on chairs. However, can be considered as standalone non-trivial paths for stools and lead neural network to have better generalization on stools.

Convolution-based Methods.

CNNs have shown great success in image and video recognition, action analysis and natural language processing. Extending CNNs to process point cloud data has aroused wide interest from researchers. In PointConv 

[wu2018pointconv], the convolutional operation is a Monte Carlo estimate of the continuous 3D convolution with an importance sampling. PointCNN [li2018pointcnn] achieves comparable accuracy by learning a local convolution order, but this has great weakness in convergence speed. RS-CNN [liu2019rscnn] uses 10-D hand crafted features as neighbor relationship and learn a dynamic convolution weights from the 10-D features. DensePoint [liu2019densepoint] utilizes dense connection mode to repeatedly aggregate different level and scale information in a deep hierarchy. KPConv [thomas2019KPConv] proposed a new point convolution method which processes radius neighborhoods with weights spatially located by a small set of kernel points. And they also develop a deformable version of KPConv that can fit the point cloud geometry. The follow-up work ShellNet [zhang-shellnet-iccv19] used statistics from concentric spherical shells to learn representative features, allowing the convolution to operate on feature spaces. However, the cost of downsampling points has become a speed bottleneck in processing large-scale point clouds.

Graph-based Methods. Simonovsky et al[Simonovsky2017ecc]

first considered each points as the vertex of a graph. They proposed Edge-Conditional Convolution (ECC), which uses a filter-generating network. Max pooling is utilized to aggregate the adjacency information. However, ECC has poor performance compared to pointwise-based and convolution-based networks. DGCNN 

[dgcnn] proposed a novel convolution layer named EdgeConv. DGCNN can construct a graph in the feature space and dynamically update the hierarchical structure. EdgeConv can capture local geometric features while ensuring the permutation invariant. While this achieves better results, DGCNN ignores the vector direction between adjacent points; this leads to some lost local geometric information. Lei et al[lei2020spherical] proposed a novel graph convolution that uses a spherical kernel for 3D point clouds. Their spherical kernels quantize the local 3D space to geometric relationships. They built graph pyramids with range search and farthest point sampling, and named the whole network SPH3D-GCN. As a result, it is a challenge to design a neural network for point cloud analysis that needs to balance various factors such as model complexity, difficulty of implementation, accuracy and speed. However, graph-based methods need to construct graphs of points. This is ineffective and dynamic, and it is difficult to implement the networks and optimize the performance.

3 Methods

In this paper, we propose a new end-to-end pointwise neural network named PointShuffleNet (PSN) consisting of a non-Euclidean feature learner that can capture both Euclidean space and non-Euclidean space information, and ClusterFPS, which can downsample points uniformly and efficiently.

3.1 Non-Euclidean Feature Learner

Figure 2: Illustration of PointShuffleNet for point cloud classification and segmentation. HER module is also used in feature decode layer.

As shown in Figure 1, the non-Euclidean feature learner aims to obtain better generalization for the neural network from the high-dimensional manifold and consists of two modules: 1) homotopy equivalence relation (HER) and 2) local mutual information regularizer (LMIR).

3.1.1 Homotopy Equivalence Relation

Given a set of point cloud , we can consider as a low-dimension embedding in space of a higher-dimension manifold that belongs to a class . Thus, we can get a homotopy with a unit interval , where homotopy contains infinite continuous bijective mappings. It is should be noticed that the homotopy between and are not unique. We can transform to with infinite different paths. Therefore, we define a set that contains multiple homotopy paths of each point cloud pair (e.g.,  and ). [basri2016efficient] showed that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of a high-dimension. Thus, neural networks have the ability to learn the latent feature that represents the data distribution on the high-dimension manifold and project it into a low-dimension manifold. Thus, neural networks can learn the homotopy transformations

, which are the data distribution on the high-dimension manifold, and classify

into class . The generalization of neural networks can be extended by constructing more effective homotopy equivalence relations on the input data. [coetzee1995homotropy] constructed a homotopy equivalence function that deforms linear networks into nonlinear networks. Therefore, we can implement the homotopy equivalence function in single layer Perception (SLP) and multilayer perception (MLP). However, SLP and MLP both suffer from additional computational costs and have small coverage over the data distribution.

To overcome the above obstacle, a shuffle operation is introduced in this paper as a zero-overhead homotopy equivalence transformation. [3-manifolds2004Richard, Topology2000Anderson]

proved that the modulo shuffle is a homotopy equivalence in which the manifold is glued together along primitive solid torus components of its characteristic submanifold. Compared to SLP and MLP, modulo shuffle is parameter-free and can be implemented by reading data with a fixed stride from memory. A module shuffle can generate pseudo submanifolds from the known data distribution and improve the generalization ability of the model.

In this paper, we propose two shuffle-based functions: sample shuffle and channel shuffle.

Sample Shuffle. Since large-scale point clouds are going to be processed hierarchically, it is critical to sample and group information from neighbor points. The sample shuffle can construct a local homotopy equivalence transformation. For the sake of clarity, we call and the points from point cloud and their corresponding features from respectively. We can concatenate each with corresponding to low-dimension manifolds , which is defined as

(1)

where is the concatenation operation. The sample shuffle function can be efficiently and elegantly implemented by reshaping, transposing and flattening [zhang2018shufflenet]. Thus, we can integrate sample shuffle into concatenate and define this as

(2)

Since each point does not correspond to the original feature , it is obvious that and satisfy the homotopy equivalence on the high-dimension manifold.

Channel Shuffle. [zhang2018shufflenet] proposed the channel shuffle operation, which showed great promise for mobile devices in image classification. However, Zhang et al.did not explain this mathematically in their paper. The channel shuffle operation can be considered an example of HER that operates on the channel dimension. ShuffleNet uses channel shuffle on the output of group convolution to approximate the output distribution of a non-group convolution layer. Thus, we combine sample shuffle and channel sample as

(3)

where is a learnable mapping function (e.g., MLP) with parameter that maps into high-dimension space. Eventually, we can ensure that the homotopy equivalence relation can help the model to be more general and have zero cost on computation. We will conduct an ablation study to show the performance of both shuffle operations in Section 5.3.

3.1.2 Local Mutual Information Regularizer

Although the homotopy equivalence relation can improve the generalization performance of the neural network models, it may suffer from bad generalization caused by trivial homotopy equivalence path . Due to the high complexity of high-dimensional space, it is difficult to cut off explicitly. Therefore, we propose a novel approach named the local mutual information regularizer (LMIR) to cut off trivial path implicitly. Let us now focus on the original feature and mapped feature . We define a mutual information function between and as

(4)

where , is the original distribution of data and . In [hjelm2019learning]

, the researchers proposed a novel approach named Deep INFOMAX (DIM) to maximize mutual information between the input and output. Deep INFOMAX shows great promise in unsupervised learning. Deep INFOMAX estimates and maximizes the mutual information in one architecture and finds the most discriminative feature of input. DIM follows Mutual Information Neural Estimation (MINE) 

[belghazi2018mine] to estimate and replace the KL divergence estimator with a non-KL divergence estimator (e.g., Jensen-Shannon divergence estimator [nowozin2016fgan]) as

(5)

where is a weight-sharing discriminator with parameters , denotes the distribution of , , and is the softplus function. Deep INFOMAX can maximize the above equation in a contrastive learning scheme. Deep INFOMAX finds the discriminative information between samples and is not concerned with the classes of samples. It shows great promise in unsupervised learning. However, Deep INFOMAX cannot increase the performance in our supervised task (e.g., classification and segmentation) because it is only concerned with the discriminative information between samples and ignores the common features of samples that are in the same category. And Deep INFOMAX only considers the shuffled feature as a negative sample. In our theory, we can treat the shuffled feature as a positive sample if the path is non-trivial. Thus, we modify Equation 5 as

(6)
Figure 3: Structure of PointShuffleNet layer. Orange lines and blue boxes denote forward propagation of input feature. We can sample and after the channel concatenation operation and HER module. Thus, we can get two feature pairs and by pair generation operation. Mutual information estimated pairs are processed by LMIR to generate LMIR loss. Dotted line and dotted frame indicate these be deleted in inference phase.

We exchange and to get a new mutual information estimator named the local mutual information regularizer (LMIR). LMIR can distinguish whether the feature is from a non-trivial path through mutual information and guides the model to learn the more similar shuffled features and punishes the less similar shuffled features which are likely to share less mutual information with the original features.It constrains the neural network to learn non-trivial homotopy equivalence path inside the same category instead of trivial homotopy equivalence path between categories. We then define the regularization loss of our LMIR on the points, maximizing the average estimated MI:

(7)

where is the number of PointShuffleNet layers (its implementation details are provided in Section 4). LMIR can be implemented by simply attaching a two-layer discriminator after every HER module with some parameters increasing. However, LMIR can be excluded from the inference phase without losing accuracy because LMIR is not a part of feature generation.

3.2 ClusterFPS

In the original farthest point sampling implementation, the algorithm sample points from a point cloud with points and returns a downsampling of the metric space in which each is the farthest point from the first points. Although it has a good coverage of the point cloud, it is obvious that the algorithm needs to know the position information of the previous points before sampling a new point. This is difficult to speed up through parallel computing. According to [hu2019randla], FPS takes up to 200 seconds to sample of points.

To address the above issue, we propose a parallel-friendly sampling algorithm named ClusterFPS that aims to ensure good coverage of the point cloud as quickly as possible.

Computing Cluster Centers. For a given point cloud , its cluster can be computed by the efficient -means algorithm. A standard -means algorithm is unable to take batch data as input. Thus, we implement a parallel version of the -means algorithm to utilize the GPUs.

Finding Neighboring Points. For each cluster, we use -Nearest Neighbors (-NN) to query its neighboring points. We sample nearest points from to get a set of points .

Parallel Farthest Point Sampling. For each cluster, a subset of points can be sampled by FPS in parallel.

Grouping Cluster Points Eventually, the output of parallel farthest point sampling is sets of downsampled points in which each set contains sampled points. Thus, we can obtain the final downsampled point set by grouping into one set.

Overall, our ClusterFPS algorithm is designed to reduce context dependencies and run in parallel. Thus, this algorithm can utilize GPUs to compute more quickly. This is discussed in Section 5.2.

4 PointShuffleNet

By combining the above three components proposed in Section 3.2 and Section 3.1, we implement a hierarchical neural network for both classification and segmentation tasks as shown in Figure 2

. We combine the cross-entropy loss with LMIR regularization loss as our loss function.

As shown in Figure 3, we build a basic PointShuffleNet layer for both classification and segmentation tasks. It uses features as input, and its channel is split into two groups to reduce the computation complexity [ma2018shufflenet]. Then, a 3-layer convolution is utilized to extract features and to be concatenated with residual features. The output features can be obtained by HER from . We sample , and from the mainstream of layer (orange line) to construct pairs: and . The LMIR module computes the regularization loss from the outputs of the -GAN discriminator. The dotted yellow line and dotted black frame indicate these can be deleted in the inference phase.

For the classification task, we designed a three-block feature extractor followed by a classifier. Each block contains a sampling layer (ClusterFPS), grouping layer (same as [qi2017pointnetplusplus]) and three PointShuffleNet layers. The first two blocks sample 512 and 256 points, and a max-pooling layer is adopted to aggregate local features. The last block aggregates the final features from all remaining points. The final classification scores can be computed by a classifier with fully connected layers, dropout and softmax activation. Each

convolutional layer is followed by a batch normalization layer and the ReLU activation function. Channel attention is also deployed to enhance the representation between blocks.

Method Input Points Params Class OA Infer(ms)
PointNet [qi2016pointnet] P 1k 3.5M 86.2 89.2 2.5
SO-Net [li2018sonet] P 2k 2.4M 87.3 90.9 -
SPH3D [lei2020spherical] P 10k 0.8M 89.3 92.1 8.4
DGCNN [dgcnn] P 1k 1.8M 90.2 92.2 5.6
PointCNN [li2018pointcnn] P 1k 0.6M - 92.2 7.5
KPConv [thomas2019KPConv] P 6.8k 14.3M - 92.9 21.5
PointASNL [yan2020pointasnl] P 1k - - 92.9 -
Grid-GCN [Xu2020Grid-GCN] P 1k - 91.3 93.1 2.6
PosPool [liu2020closerlook3d] P 10k 19.4M - 93.2 -
RS-CNN [liu2019rscnn] P 1k 1.4M - 93.6 4.3
PointNet++ [qi2017pointnetplusplus] P,N 5k 1.5M 90.7 91.9 1.3
PAT [Yang2019PAT] P,N 1k 0.6M - 91.7 11
SO-Net [li2018sonet] P,N 5k 2.4M 89.3 92.3 -
SpiderCNN [xu2018spidercnn] P,N 5k - - 92.4 -
A-CNN [komarichev2019acnn] P,N 1k - 90.3 92.6 -
PointASNL [yan2020pointasnl] P,N 1k - - 93.2 -
PSN P 1k 1.4M 90.5 92.7
PSN P,N 1k 1.4M 91.6 93.2
Table 1: Mean class accuracy and overall accuracy on ModelNet40. “Params” stands for number of parameters. “” stands for inference without LMIR. “Train” denotes forward and backward propagation time per sample, and “Infer” denotes forward propagation per sample. “P” stands for coordinates for point and “N” stands for normal vector.
Method Size(m) Points mIoU OA Infer(ms)
PointCNN [li2018pointcnn] 4096 57.26 85.91 -
Grid-GCN [Xu2020Grid-GCN] 4096 57.75 86.94 25.9
PAT [Yang2019PAT] 2048 64.3 - -
PointASNL [yan2020pointasnl] 8192 68.7 87.7 -
PointNet [qi2016pointnet] 4096 41.09 - 20.9
DGCNN [dgcnn] 4096 47.94 83.64 178.1
PointNet++ [qi2017pointnetplusplus] 4096 53.2 - -
PSN 4096 55.2 86.34 28.3
Table 2: Mean class IoU and overall accuracy on indoor S3DIS dataset. “Infer” denotes forward propagation time per sample.
Method
instance
mIoU
class
mIoU
air
plane
bag cap car chair
ear
phone
guitar knife lamp laptop
motor
bike
mug pistol rocket
Skate
board
table
Number 2690 76 55 898 3758 69 787 392 1547 451 202 184 286 66 152 5271
PointNet [qi2016pointnet] 83.7 80.4 83.4 78.7 82.5 74.9 89.6 73.0 91.5 85.9 80.8 95.3 65.2 93.0 81.2 57.9 72.8 80.6
SO-Net [li2018sonet] 84.9 81.0 82.8 77.8 88.0 77.3 90.6 73.5 90.7 83.9 82.8 94.8 69.1 94.2 80.9 53.1 72.9 83.0
PointNet++ [li2018sonet] 85.1 81.9 82.4 79.0 87.7 77.3 90.8 71.8 91.0 85.9 83.7 95.3 71.6 94.1 81.3 58.7 76.4 82.6
DGCNN [dgcnn] 85.1 82.3 84.2 83.7 84.4 77.1 90.9 78.5 91.5 87.3 82.9 96.0 67.8 93.3 82.6 59.7 75.5 82.0
P2Sequence [liu2019point2sequence] 85.2 82.2 82.6 81.8 87.5 77.3 90.8 77.1 91.1 86.9 83.9 95.7 70.8 94.6 79.3 58.1 75.2 82.8
PointCNN [li2018pointcnn] 86.1 84.6 84.1 86.5 86.0 80.8 90.6 79.7 92.3 88.4 85.3 96.1 77.2 95.2 84.2 64.2 80.0 83.0
RS-CNN [liu2019rscnn] 86.2 84.0 83.5 84.8 88.8 79.6 91.2 81.1 91.6 88.4 86.0 96.0 73.7 94.1 83.4 60.5 77.7 83.6
PointASNL [yan2020pointasnl] 86.1 83.4 84.1 84.7 87.9 79.7 92.2 73.7 91.0 87.2 84.2 95.8 74.4 95.2 81.0 63.0 76.3 83.2
PSN 85.8 82.5 83.5 81.4 87.9 78.8 91.1 74.5 90.6 87.1 84.9 95.8 71.6 95.2 81.0 57.8 75.7 83.8
Table 3: Mean instance IoU and mean class IoU on part segmentation ShapeNet dataset. Per-class IoU is also illustrated.

For the segmentation task, the configuration of the extractor is similar to the configuration in the classification task. Following [qi2017pointnetplusplus]

, we construct three blocks as decoder. Each block consists of distance-based interpolation, across-level skip links and MLP with HER.

5 Experiments

Method mACC mIoU OA ceiling floor wall beam column window door table chair sofa bookcase board clutter
PointNet [qi2016pointnet] 49.0 41.1 - 88.8 97.3 69.8 0.1 3.9 46.3 10.8 52.6 58.9 40.3 5.9 26.4 33.2
PointCNN [li2018pointcnn] 63.9 57.3 85.9 92.3 98.2 79.4 0.0 17.6 22.8 62.1 74.4 80.6 31.7 66.7 62.1 56.7
PointWeb [zhao2019pointweb] 66.6 60.3 87.0 92.0 98.5 79.4 0.0 21.1 59.7 34.8 76.3 88.3 46.9 69.3 64.9 52.5
HPEIN [jiang2019hpein] 68.3 61.9 87.2 91.5 98.2 81.4 0.0 23.3 65.3 40.0 75.5 87.7 58.5 67.8 65.6 49.7
PointASNL [yan2020pointasnl] 68.5 62.6 87.7 94.3 98.4 79.1 0.0 26.7 55.2 66.2 83.3 86.8 47.6 68.3 56.4 52.1
PSN 63.85 55.2 86.3 91.5 98.2 74.2 0.0 6.0 54.7 22.4 73.6 79.3 48.2 60.3 57.5 51.4
Table 4: Mean instance IoU, mean class IoU and overall accuracy on indoor semantic segmentation S3DIS dataset. Per-class IoU is also shown.

We evaluate PointShuffleNet on several widely used datasets including ModelNet40 [wu2015modelnet], ShapeNet [shapenet2015] and S3DIS [Armeni2017s3dis]

. To demonstrate the efficiency of our method, we also report speed and performance in a fair comparison. All experiments are conducted in PyTorch 

[pytorch] on a single NVIDIA GTX 1080ti GPU with a 11 GB VRAM. We optimize the networks by using Adam [kingma2014adam] with an initial learning rate of 0.001 and batch size of 24. We deploy a cosine-annealing decay schedule [loshchilovH2017sgdr] on the learning rate with a period .

5.1 Point Cloud Classification

The Modelnet40 dataset [wu2015modelnet] includes 12,311 CAD models from 40 classes. Following the official settings, we use 9,843 models to train our network and 2,468 models to evaluate. Following the configuration of [qi2016pointnet, qi2017pointnetplusplus], we sample 1024 points uniformly from raw points and compute the normal vectors from corresponding mesh models. We also use data augmentation in the same way as [qi2016pointnet, qi2017pointnetplusplus] by randomly rotating the data along the

-axis and jittering each point with a Gaussian that has zero mean and 0.02 standard deviation.

Classification results on the test set are summarized in Table 1. Our PointShuffleNet achieves the state-of-the-art accuracy on ModelNet40 except RS-CNN [liu2019rscnn]. RS-CNN improves their results from to

with multiple abstraction scales and tricky voting test strategy. Their training and evaluate protocols are different from other normal protocols. Compared to other methods implemented in TensorFlow with better GPUs, our method still obtains a great balance between speed and accuracy. An analysis of the running time will be demonstrated in Section 

5.2. It is remarkable that LMIR makes PSN have the highest mean class accuracy due to its great generalization on each class.

5.2 Point Cloud Segmentation

We evaluate our PSN on two larger-scale segmentation tasks: ShapeNet [shapenet2015] for part segmentation and Stanford 3D Large-Scale Indoor Spaces (S3DIS) [Armeni2017s3dis] for indoor scene segmentation. The ShapeNet dataset contains 16,881 shapes (14,006 models for training and 2,874 for evaluating) from classes and has 50 parts in total. We follow the configuration of PointNet++ [qi2017pointnetplusplus] using 2048 points as our input. There are 6 large-scale indoor areas with 271 rooms in the S3DIS dataset, and each point is annotated with one of 13 categories. Because area 5 is the only area that has no overlap with other areas, we consider areas 1-4 and 6 as our training split and area 5 as our evaluation split. Following [yan2020pointasnl], we generate training data by randomly sampling rooms into with 4096 points during the training process. At test time, we split the rooms into blocks with 4096 points and stride .

In Table 3, we show the quantitative results of PSN on ShapeNet with other state-of-the-art methods [qi2016pointnet, li2018sonet, qi2017pointnetplusplus, dgcnn, liu2019point2sequence, li2018pointcnn, liu2019rscnn, yan2020pointasnl] under the same training and evaluating protocols. PSN shows comparable results with other methods. Compared with PointNet [qi2016pointnet] and PointNet++ [qi2017pointnetplusplus], we have notably increased both mIoU and class mIoU due to the effectiveness of NEFL.

The evaluation performance of S3DIS dataset on Area 5 is shown in Table 2 and Table 4. Our method achieves comparable results with a block size of and point input of . Our method achieves better results with a block size of and point input of compared with previous methods which use same experiment setting. And our method also achieves comparable performance compared with other methods which use bigger areas or more points as input.

5.3 Ablation Study

Model Points Normal NEFL ClusterFPS Accuracy
HER LMIR class instance
A PointNet++ 1024 88.0 90.7
B PointNet++ 5000 - 91.9
C PointNet++ 1024 91.0 92.5
D PSN 1024 90.9 92.4
E PSN 1024 91.1 92.8
F PSN 1024 90.9 92.9
G PSN 1024 91.5 93.0
H PSN 1024 91.6 93.2
Table 5: Ablation stud on ModelNet40. denotes that we use original local mutual information loss from [hjelm2019learning] to replace our LMIR loss.
Figure 4: Train and evaluate on sparse input.
Figure 5: Only evaluate on sparse input.

In this section, we analyze the effectiveness of each component including HER, LMIR ClusterFPS and Deep INFOMAX on the ModelNet40 dataset. The results of the ablation study are summarized in Table 5.

We set two baselines: A and D. Model A is the original implementation of PointNet++ [qi2017pointnetplusplus], and Model D only has new architecture and ClusterFPS without NEFL. Compared to B, Model D can obtain a improvement on mean instance accuracy with fewer input points. When we combine HER with PointNet++ (model C), there is a great improvement in both mean class and instance accuracy. Model E shows the same improvement results on accuracy ( and ).

Furthermore, we deploy LMIR into model E to get model H. LMIR shows incremental improvement on accuracy compared to model E. To be fair, we replace the LMIR module in model H with the local mutual information loss (Equation 5) from Deep INFOMAX [hjelm2019learning] to get model F. Compared to model E without a local mutual information loss, model F fails to promote accuracy and decreases by

on mean class accuracy. In addition, there is a great gap between models F and H. Thus, we conclude that our LMIR has better performance than Deep INFOMAX on supervised learning. We also replace ClusterFPS with FPS to get model G which has similar performance to model H. Thus, We can draw the conclusion that ClusterFPS may not have the same good sampling uniformity as FPS, but because of the additional cluster information, ClusterFPS still has better performance and faster speed.

5.4 Robustness for Sparser Input

To further verify the robustness of the PSN model, we set two robustness experiments: training and evaluating on sparse input (i.e., 1024, 512, 256, 128, 64 and 32), and training on 1024 points but evaluating on sparse input (i.e., 1024, 512, 256 and 128). We compare our method using PointNet [qi2016pointnet], PointNet++ [qi2017pointnetplusplus], SpiderCNN [xu2018spidercnn], PCNN [Matan2018pcnn] and DGCNN [dgcnn].

As shown in Figure 5, we can see PSN is very robust to take sparse points for training and evaluating, and PSN drops less than , from 1024 points to 256. In each density, PSN outperforms the other methods and still achieves an accuracy with 32 points input.

Following [qi2016pointnet, qi2017pointnetplusplus, Matan2018pcnn, dgcnn], we train our PSN on an invariant point and evaluate on different densities. As Figure 5 shows, PSN has enough robustness to process sparse points randomly sampled from 1024 points. The experimental results demonstrated that HER and LMIR can help models improve the robustness for a variety of input numbers.

5.5 Speed Study

We test the robustness of ClusterFPS by gradually increasing the number of input points from 1024 to 1,000,000. First, we visualize the results of FPS and ClusterFPS. We run FPS and ClusterFPS to downsample a point cloud with a batch size of 1. As shown in Table 6, ClusterFPS outperforms FPS for every input size. Remarkably, ClusterFPS can downsample one million points to its size in only 1.53s, achieving up to a speedup over FPS. This shows the dominating capability of our model in processing large-scale point clouds.

6 Conclusion

Num. of Input 1024 4096 10000 1,000,000
Num. of Output 512 1024 4096 100,000
FPS 95.1 ms 187 ms 798 ms 230 s
ClusterFPS 16.3 ms 32.7 ms 108 ms 1.53 s
Table 6: Downsample speed demonstration of FPS and ClusterFPS. We implement two algorithms in Pytorch with batch size of 1.

In this paper, we proposed PointShuffleNet (PSN) for efficient point cloud analysis. PSN showed great promise in point cloud classification and segmentation by introducing the non-Euclidean feature learner (NEFL), which contains a homotopy equivalence relation (HER) based on topology math and a local mutual information regularizer (LMIR). HER introduces the homotopy equivalence transform to our model, makes the neural network to learn the data distribution by homotopy equivalence and extends the generalization. To cut off the trivial homotopy equivalence relation that leads to bad generation, LMIR regularizes the neural network to focus on the non-trivial paths rather than trivial paths by maximizing the mutual information between the original features and HER transformed features. To further improve the efficiency, we proposed a point sampling algorithm named ClusterFPS that is based on the farthest point sampling algorithm. ClusterFPS splits point clouds into several clusters and deploys FPS on each cluster to run parallel computations on the GPUs. PSN achieves state-of-the-art accuracy on classification and achieves great balance between speed and performance on various tasks.

References

Appendix A Glossary

  • Homotopy Equivalence Relation. The key concept of homotopy is a continuously deformation between two continuous functions in a topological space. In other words, these functions are said to be equivalent in topological math. We can use homotopy equivalence relation to help neural networks to have better generalization. In fact, homotopy equivalence relation can be considered as a principle of feature augmentation in which the augmented features may share a continuous closure subspaces with the original features. We use module shuffle as our deformation method rather than single layer perception for its randomness and high efficiency.

  • Local Mutual Information Regularizer. We can not control what kind of data Homotopy Equivalence Relation can transform the original data into. However, there is not a mathematical theory to drop unreasonable deformation path in topological. So we propose a regularizer based on mutual information to cut off trivial paths (the unreasonable deformation paths) implicitly named Local Mutual Information Regularizer. We use -GAN [nowozin2016fgan] to estimate the mutual information between the transformed feature and the original one. So we insert multiple discriminators into networks and utilize different segments of the backbone as generators to form multiple -GAN inside a network. So we can distinguish whether the feature is from a non-trivial path (the reasonable deformation) through mutual information and guides the model to learn the more similar shuffled features and punished the less similar shuffled features which are likely to share less mutual information with the original features.

  • Continuous Closure Subspaces. The continuous closure subspace is a space that accurately describes the distribution of data, and a class has and only one subspace. But there are infinite decision boundaries between sets. However, we can use the continuous closure subspace to partition the corresponding class and other classes as a decision boundary.

  • Module Shuffle. It is a shuffle in which a deck of cards is divided into halves which contains cards. Module shuffle is also called as riffle shuffle or the Faro shuffle if . It has been proved that module shuffle is a homotopy equivalence transformation by [Topology2000Anderson].

  • Trivial Path. A deformation path connects several point clouds which belong to different class and has intersection with multiple closure subspaces. The trivial path may lead the neural networks to bad generalization.

  • Non-trivial Path. A deformation path connects several point clouds which belong to a same class and has no intersection with other closure subspaces. The non-trivial path can be considered as a continuous data augmentation method which interpolates multiple data points inside the closure subspace and clarifies the decision boundaries.

Figure 6: Confusion matrix for PointShuffleNet
Figure 7: Confusion matrix for PointNet++
Figure 8: