1 Introduction
Point cloud analysis is challenging because it needs to process 3D data that is highly irregular and only contains Euclidean space information (e.g
., 3D coordinates and normal vectors). The adjacent points may not be highly relevant and lack the connection information between points. Previous works used deeplearningbased methods to map the coordinates and other features (
e.g., normal vector and RGB information) to higherdimension space to extract semantic information. As a result, point cloud analysis can be easily formulated as mining information from a Euclidean space problem. In this paper, we consider space of point cloud as a submanifold, which is a local Euclidean space on a highdimension nonEuclidean manifold. The data distribution of an object category should be a continuous closure subspace on the highdimension NonEuclidean manifold, and every point cloud that belongs to the above category should be contained in the subspace. Thus, we can formulate the point cloud analysis as a problem that learns the nonEuclidean features of the continuous closure subspaces from the corresponding discrete point clouds in .To overcome the obstacle mentioned above, we propose a nonEuclidean feature learner (NEFL) enabling the transformation of point clouds into continuous closure subspaces efficiently. The nonEuclidean feature learner contains two parts: homotopy equivalence relation (HER) and the local mutual information regularizer (LMIR). The HER utilizes the homotopy equivalence transformation [3manifolds2004Richard, Topology2000Anderson] to describe continuously deformed bijective relations between multiple objects in the same category (eg
, different chairs). Shuffle is the most efficient operation to perform homotopy transformation and can generate random and discrete submanifolds. There are multiple relations, called pathconnected in topology, with the same endpoints on the manifold. Thus, we can transform a point cloud to other point clouds through different paths. Thus, the paths can be learned by the neural networks. This extends the generalization of the neural networks. We define the paths that connect several point clouds in the same category and are bundled in the continuous closure subspace of a same category as the nontrivial paths. Otherwise, we named the paths that are not bundled in a subspace as trivial paths. The trivial paths are hard to cut off explicitly on the highdimension manifold, and this leads to poor generalization. Thus, we propose LMIR to cut off the trivial paths implicitly with mutual information estimation. LMIR estimates the mutual information between the original point clouds and HERgenerated point clouds. Inspired by Deep INFOMAX
[hjelm2019learning], we use a contrastive loss that is based on contrastive loss to regularize our model to make the trivial paths have higher loss than the nontrivial paths. The LMIR can be deployed in the training phase to increase accuracy and can be removed in the inference phase to obtain higher speeds.To further enhance the efficiency, we propose a parallelfriendly point sampling algorithm named ClusterFPS, which uses the divideandconquer method to divide multiple sampling regions and deploy the farthest points sampling (FPS) algorithm in each region. We implement a kmeans cluster sampling algorithm with a batch input as our dividing algorithm. The basic FPS algorithm is iterative and context sensitive. It is difficult to utilize modern GPU architectures to speed up the process. By using ClusterFPS as our sampling strategy, we can utilize the parallel computing of the GPUs to gain faster speed than FPS and comparable performance.
Based on the nonEuclidean feature learner and ClusterFPS, we build a highly efficient neural network architecture named PointShuffleNet (PSN). Pointwise group convolution [zhang2018shufflenet, ma2018shufflenet] is introduced to replace MLP with better performance and fewer parameters. We modify the channel attention [hu2018senet] by concatenating a channel descriptor with the Euclidean coordinate of points. We achieve stateoftheart performance on ModelNet [wu2015modelnet] and comparable results on ShapeNet [shapenet2015] and S3DIS [Armeni2017s3dis]. PSN achieves a speedup on various tasks. Remarkably, our model achieves the highest mean class accuracy () on ModelNet40 at 4.6 ms.
2 Related Work
Recently, deeplearningbased methods have been rapidly developed to process point clouds. They have shown great improvement in speed and accuracy.
Pointwise MLP Methods. PointNet [qi2016pointnet]
was the first deeplearning method that used a sequence of multilayer perceptron (MLP) to directly processes point sets. PointNet showed great promises in accuracy but also weakness in model complexity and training speed. PointNet++
[qi2017pointnetplusplus]follows the process in which convolutional neural networks (CNNs) extract information from local to global. It uses sampling and grouping layers to split point sets into small clusters and deploys PointNet to extract local features hierarchically. Compared with prior works, PointNet++ achieves better performance but is slower and more complicated. To further enhance the performance, Yang
et al. [Yang2019PAT] proposed a pointwisebased method called Point Attention Transformers (PATs). This uses a parameterefficient Group Shuffle Attention (GSA) to replace the complicated multihead attention in transformers. They also proposed a novel taskspecific sampling method named Gumbel Subset Sampling (GSS). PointASNL [yan2020pointasnl]proposed a new adaptive sampling module to benefit the feature learning and avoid the biased effect of outliers. And they use localnonlocal module to capture the neighbor and longrange dependencies of point clouds. PosPool
[liu2020closerlook3d] proposed a local aggregation operator without weights named PosPool and combine it with a deep residual network achieving stateoftheart results. Although pointwise MLP methods show great promise in efficiency, they still suffer from the inefficiency of FPS and lower accuracy than other approaches. Thus, we propose ClusterFPS and NEFL to further improve speed and accuracy.Convolutionbased Methods.
CNNs have shown great success in image and video recognition, action analysis and natural language processing. Extending CNNs to process point cloud data has aroused wide interest from researchers. In PointConv
[wu2018pointconv], the convolutional operation is a Monte Carlo estimate of the continuous 3D convolution with an importance sampling. PointCNN [li2018pointcnn] achieves comparable accuracy by learning a local convolution order, but this has great weakness in convergence speed. RSCNN [liu2019rscnn] uses 10D hand crafted features as neighbor relationship and learn a dynamic convolution weights from the 10D features. DensePoint [liu2019densepoint] utilizes dense connection mode to repeatedly aggregate different level and scale information in a deep hierarchy. KPConv [thomas2019KPConv] proposed a new point convolution method which processes radius neighborhoods with weights spatially located by a small set of kernel points. And they also develop a deformable version of KPConv that can fit the point cloud geometry. The followup work ShellNet [zhangshellneticcv19] used statistics from concentric spherical shells to learn representative features, allowing the convolution to operate on feature spaces. However, the cost of downsampling points has become a speed bottleneck in processing largescale point clouds.Graphbased Methods. Simonovsky et al. [Simonovsky2017ecc]
first considered each points as the vertex of a graph. They proposed EdgeConditional Convolution (ECC), which uses a filtergenerating network. Max pooling is utilized to aggregate the adjacency information. However, ECC has poor performance compared to pointwisebased and convolutionbased networks. DGCNN
[dgcnn] proposed a novel convolution layer named EdgeConv. DGCNN can construct a graph in the feature space and dynamically update the hierarchical structure. EdgeConv can capture local geometric features while ensuring the permutation invariant. While this achieves better results, DGCNN ignores the vector direction between adjacent points; this leads to some lost local geometric information. Lei et al. [lei2020spherical] proposed a novel graph convolution that uses a spherical kernel for 3D point clouds. Their spherical kernels quantize the local 3D space to geometric relationships. They built graph pyramids with range search and farthest point sampling, and named the whole network SPH3DGCN. As a result, it is a challenge to design a neural network for point cloud analysis that needs to balance various factors such as model complexity, difficulty of implementation, accuracy and speed. However, graphbased methods need to construct graphs of points. This is ineffective and dynamic, and it is difficult to implement the networks and optimize the performance.3 Methods
In this paper, we propose a new endtoend pointwise neural network named PointShuffleNet (PSN) consisting of a nonEuclidean feature learner that can capture both Euclidean space and nonEuclidean space information, and ClusterFPS, which can downsample points uniformly and efficiently.
3.1 NonEuclidean Feature Learner
As shown in Figure 1, the nonEuclidean feature learner aims to obtain better generalization for the neural network from the highdimensional manifold and consists of two modules: 1) homotopy equivalence relation (HER) and 2) local mutual information regularizer (LMIR).
3.1.1 Homotopy Equivalence Relation
Given a set of point cloud , we can consider as a lowdimension embedding in space of a higherdimension manifold that belongs to a class . Thus, we can get a homotopy with a unit interval , where homotopy contains infinite continuous bijective mappings. It is should be noticed that the homotopy between and are not unique. We can transform to with infinite different paths. Therefore, we define a set that contains multiple homotopy paths of each point cloud pair (e.g., and ). [basri2016efficient] showed that deep networks can efficiently extract the intrinsic, lowdimensional coordinates of a highdimension. Thus, neural networks have the ability to learn the latent feature that represents the data distribution on the highdimension manifold and project it into a lowdimension manifold. Thus, neural networks can learn the homotopy transformations
, which are the data distribution on the highdimension manifold, and classify
into class . The generalization of neural networks can be extended by constructing more effective homotopy equivalence relations on the input data. [coetzee1995homotropy] constructed a homotopy equivalence function that deforms linear networks into nonlinear networks. Therefore, we can implement the homotopy equivalence function in single layer Perception (SLP) and multilayer perception (MLP). However, SLP and MLP both suffer from additional computational costs and have small coverage over the data distribution.To overcome the above obstacle, a shuffle operation is introduced in this paper as a zerooverhead homotopy equivalence transformation. [3manifolds2004Richard, Topology2000Anderson]
proved that the modulo shuffle is a homotopy equivalence in which the manifold is glued together along primitive solid torus components of its characteristic submanifold. Compared to SLP and MLP, modulo shuffle is parameterfree and can be implemented by reading data with a fixed stride from memory. A module shuffle can generate pseudo submanifolds from the known data distribution and improve the generalization ability of the model.
In this paper, we propose two shufflebased functions: sample shuffle and channel shuffle.
Sample Shuffle. Since largescale point clouds are going to be processed hierarchically, it is critical to sample and group information from neighbor points. The sample shuffle can construct a local homotopy equivalence transformation. For the sake of clarity, we call and the points from point cloud and their corresponding features from respectively. We can concatenate each with corresponding to lowdimension manifolds , which is defined as
(1) 
where is the concatenation operation. The sample shuffle function can be efficiently and elegantly implemented by reshaping, transposing and flattening [zhang2018shufflenet]. Thus, we can integrate sample shuffle into concatenate and define this as
(2) 
Since each point does not correspond to the original feature , it is obvious that and satisfy the homotopy equivalence on the highdimension manifold.
Channel Shuffle. [zhang2018shufflenet] proposed the channel shuffle operation, which showed great promise for mobile devices in image classification. However, Zhang et al.did not explain this mathematically in their paper. The channel shuffle operation can be considered an example of HER that operates on the channel dimension. ShuffleNet uses channel shuffle on the output of group convolution to approximate the output distribution of a nongroup convolution layer. Thus, we combine sample shuffle and channel sample as
(3) 
where is a learnable mapping function (e.g., MLP) with parameter that maps into highdimension space. Eventually, we can ensure that the homotopy equivalence relation can help the model to be more general and have zero cost on computation. We will conduct an ablation study to show the performance of both shuffle operations in Section 5.3.
3.1.2 Local Mutual Information Regularizer
Although the homotopy equivalence relation can improve the generalization performance of the neural network models, it may suffer from bad generalization caused by trivial homotopy equivalence path . Due to the high complexity of highdimensional space, it is difficult to cut off explicitly. Therefore, we propose a novel approach named the local mutual information regularizer (LMIR) to cut off trivial path implicitly. Let us now focus on the original feature and mapped feature . We define a mutual information function between and as
(4) 
where , is the original distribution of data and . In [hjelm2019learning]
, the researchers proposed a novel approach named Deep INFOMAX (DIM) to maximize mutual information between the input and output. Deep INFOMAX shows great promise in unsupervised learning. Deep INFOMAX estimates and maximizes the mutual information in one architecture and finds the most discriminative feature of input. DIM follows Mutual Information Neural Estimation (MINE)
[belghazi2018mine] to estimate and replace the KL divergence estimator with a nonKL divergence estimator (e.g., JensenShannon divergence estimator [nowozin2016fgan]) as(5) 
where is a weightsharing discriminator with parameters , denotes the distribution of , , and is the softplus function. Deep INFOMAX can maximize the above equation in a contrastive learning scheme. Deep INFOMAX finds the discriminative information between samples and is not concerned with the classes of samples. It shows great promise in unsupervised learning. However, Deep INFOMAX cannot increase the performance in our supervised task (e.g., classification and segmentation) because it is only concerned with the discriminative information between samples and ignores the common features of samples that are in the same category. And Deep INFOMAX only considers the shuffled feature as a negative sample. In our theory, we can treat the shuffled feature as a positive sample if the path is nontrivial. Thus, we modify Equation 5 as
(6) 
We exchange and to get a new mutual information estimator named the local mutual information regularizer (LMIR). LMIR can distinguish whether the feature is from a nontrivial path through mutual information and guides the model to learn the more similar shuffled features and punishes the less similar shuffled features which are likely to share less mutual information with the original features.It constrains the neural network to learn nontrivial homotopy equivalence path inside the same category instead of trivial homotopy equivalence path between categories. We then define the regularization loss of our LMIR on the points, maximizing the average estimated MI:
(7) 
where is the number of PointShuffleNet layers (its implementation details are provided in Section 4). LMIR can be implemented by simply attaching a twolayer discriminator after every HER module with some parameters increasing. However, LMIR can be excluded from the inference phase without losing accuracy because LMIR is not a part of feature generation.
3.2 ClusterFPS
In the original farthest point sampling implementation, the algorithm sample points from a point cloud with points and returns a downsampling of the metric space in which each is the farthest point from the first points. Although it has a good coverage of the point cloud, it is obvious that the algorithm needs to know the position information of the previous points before sampling a new point. This is difficult to speed up through parallel computing. According to [hu2019randla], FPS takes up to 200 seconds to sample of points.
To address the above issue, we propose a parallelfriendly sampling algorithm named ClusterFPS that aims to ensure good coverage of the point cloud as quickly as possible.
Computing Cluster Centers. For a given point cloud , its cluster can be computed by the efficient means algorithm. A standard means algorithm is unable to take batch data as input. Thus, we implement a parallel version of the means algorithm to utilize the GPUs.
Finding Neighboring Points. For each cluster, we use Nearest Neighbors (NN) to query its neighboring points. We sample nearest points from to get a set of points .
Parallel Farthest Point Sampling. For each cluster, a subset of points can be sampled by FPS in parallel.
Grouping Cluster Points Eventually, the output of parallel farthest point sampling is sets of downsampled points in which each set contains sampled points. Thus, we can obtain the final downsampled point set by grouping into one set.
Overall, our ClusterFPS algorithm is designed to reduce context dependencies and run in parallel. Thus, this algorithm can utilize GPUs to compute more quickly. This is discussed in Section 5.2.
4 PointShuffleNet
By combining the above three components proposed in Section 3.2 and Section 3.1, we implement a hierarchical neural network for both classification and segmentation tasks as shown in Figure 2
. We combine the crossentropy loss with LMIR regularization loss as our loss function.
As shown in Figure 3, we build a basic PointShuffleNet layer for both classification and segmentation tasks. It uses features as input, and its channel is split into two groups to reduce the computation complexity [ma2018shufflenet]. Then, a 3layer convolution is utilized to extract features and to be concatenated with residual features. The output features can be obtained by HER from . We sample , and from the mainstream of layer (orange line) to construct pairs: and . The LMIR module computes the regularization loss from the outputs of the GAN discriminator. The dotted yellow line and dotted black frame indicate these can be deleted in the inference phase.
For the classification task, we designed a threeblock feature extractor followed by a classifier. Each block contains a sampling layer (ClusterFPS), grouping layer (same as [qi2017pointnetplusplus]) and three PointShuffleNet layers. The first two blocks sample 512 and 256 points, and a maxpooling layer is adopted to aggregate local features. The last block aggregates the final features from all remaining points. The final classification scores can be computed by a classifier with fully connected layers, dropout and softmax activation. Each
convolutional layer is followed by a batch normalization layer and the ReLU activation function. Channel attention is also deployed to enhance the representation between blocks.
Method  Input  Points  Params  Class  OA  Infer(ms) 

PointNet [qi2016pointnet]  P  1k  3.5M  86.2  89.2  2.5 
SONet [li2018sonet]  P  2k  2.4M  87.3  90.9   
SPH3D [lei2020spherical]  P  10k  0.8M  89.3  92.1  8.4 
DGCNN [dgcnn]  P  1k  1.8M  90.2  92.2  5.6 
PointCNN [li2018pointcnn]  P  1k  0.6M    92.2  7.5 
KPConv [thomas2019KPConv]  P  6.8k  14.3M    92.9  21.5 
PointASNL [yan2020pointasnl]  P  1k      92.9   
GridGCN [Xu2020GridGCN]  P  1k    91.3  93.1  2.6 
PosPool [liu2020closerlook3d]  P  10k  19.4M    93.2   
RSCNN [liu2019rscnn]  P  1k  1.4M    93.6  4.3 
PointNet++ [qi2017pointnetplusplus]  P,N  5k  1.5M  90.7  91.9  1.3 
PAT [Yang2019PAT]  P,N  1k  0.6M    91.7  11 
SONet [li2018sonet]  P,N  5k  2.4M  89.3  92.3   
SpiderCNN [xu2018spidercnn]  P,N  5k      92.4   
ACNN [komarichev2019acnn]  P,N  1k    90.3  92.6   
PointASNL [yan2020pointasnl]  P,N  1k      93.2   
PSN  P  1k  1.4M  90.5  92.7  
PSN  P,N  1k  1.4M  91.6  93.2 
Method  Size(m)  Points  mIoU  OA  Infer(ms) 
PointCNN [li2018pointcnn]  4096  57.26  85.91    
GridGCN [Xu2020GridGCN]  4096  57.75  86.94  25.9  
PAT [Yang2019PAT]  2048  64.3      
PointASNL [yan2020pointasnl]  8192  68.7  87.7    
PointNet [qi2016pointnet]  4096  41.09    20.9  
DGCNN [dgcnn]  4096  47.94  83.64  178.1  
PointNet++ [qi2017pointnetplusplus]  4096  53.2      
PSN  4096  55.2  86.34  28.3 
Method 



bag  cap  car  chair 

guitar  knife  lamp  laptop 

mug  pistol  rocket 

table  

Number  2690  76  55  898  3758  69  787  392  1547  451  202  184  286  66  152  5271  
PointNet [qi2016pointnet]  83.7  80.4  83.4  78.7  82.5  74.9  89.6  73.0  91.5  85.9  80.8  95.3  65.2  93.0  81.2  57.9  72.8  80.6  
SONet [li2018sonet]  84.9  81.0  82.8  77.8  88.0  77.3  90.6  73.5  90.7  83.9  82.8  94.8  69.1  94.2  80.9  53.1  72.9  83.0  
PointNet++ [li2018sonet]  85.1  81.9  82.4  79.0  87.7  77.3  90.8  71.8  91.0  85.9  83.7  95.3  71.6  94.1  81.3  58.7  76.4  82.6  
DGCNN [dgcnn]  85.1  82.3  84.2  83.7  84.4  77.1  90.9  78.5  91.5  87.3  82.9  96.0  67.8  93.3  82.6  59.7  75.5  82.0  
P2Sequence [liu2019point2sequence]  85.2  82.2  82.6  81.8  87.5  77.3  90.8  77.1  91.1  86.9  83.9  95.7  70.8  94.6  79.3  58.1  75.2  82.8  
PointCNN [li2018pointcnn]  86.1  84.6  84.1  86.5  86.0  80.8  90.6  79.7  92.3  88.4  85.3  96.1  77.2  95.2  84.2  64.2  80.0  83.0  
RSCNN [liu2019rscnn]  86.2  84.0  83.5  84.8  88.8  79.6  91.2  81.1  91.6  88.4  86.0  96.0  73.7  94.1  83.4  60.5  77.7  83.6  
PointASNL [yan2020pointasnl]  86.1  83.4  84.1  84.7  87.9  79.7  92.2  73.7  91.0  87.2  84.2  95.8  74.4  95.2  81.0  63.0  76.3  83.2  
PSN  85.8  82.5  83.5  81.4  87.9  78.8  91.1  74.5  90.6  87.1  84.9  95.8  71.6  95.2  81.0  57.8  75.7  83.8 
For the segmentation task, the configuration of the extractor is similar to the configuration in the classification task. Following [qi2017pointnetplusplus]
, we construct three blocks as decoder. Each block consists of distancebased interpolation, acrosslevel skip links and MLP with HER.
5 Experiments
Method  mACC  mIoU  OA  ceiling  floor  wall  beam  column  window  door  table  chair  sofa  bookcase  board  clutter 

PointNet [qi2016pointnet]  49.0  41.1    88.8  97.3  69.8  0.1  3.9  46.3  10.8  52.6  58.9  40.3  5.9  26.4  33.2 
PointCNN [li2018pointcnn]  63.9  57.3  85.9  92.3  98.2  79.4  0.0  17.6  22.8  62.1  74.4  80.6  31.7  66.7  62.1  56.7 
PointWeb [zhao2019pointweb]  66.6  60.3  87.0  92.0  98.5  79.4  0.0  21.1  59.7  34.8  76.3  88.3  46.9  69.3  64.9  52.5 
HPEIN [jiang2019hpein]  68.3  61.9  87.2  91.5  98.2  81.4  0.0  23.3  65.3  40.0  75.5  87.7  58.5  67.8  65.6  49.7 
PointASNL [yan2020pointasnl]  68.5  62.6  87.7  94.3  98.4  79.1  0.0  26.7  55.2  66.2  83.3  86.8  47.6  68.3  56.4  52.1 
PSN  63.85  55.2  86.3  91.5  98.2  74.2  0.0  6.0  54.7  22.4  73.6  79.3  48.2  60.3  57.5  51.4 
We evaluate PointShuffleNet on several widely used datasets including ModelNet40 [wu2015modelnet], ShapeNet [shapenet2015] and S3DIS [Armeni2017s3dis]
. To demonstrate the efficiency of our method, we also report speed and performance in a fair comparison. All experiments are conducted in PyTorch
[pytorch] on a single NVIDIA GTX 1080ti GPU with a 11 GB VRAM. We optimize the networks by using Adam [kingma2014adam] with an initial learning rate of 0.001 and batch size of 24. We deploy a cosineannealing decay schedule [loshchilovH2017sgdr] on the learning rate with a period .5.1 Point Cloud Classification
The Modelnet40 dataset [wu2015modelnet] includes 12,311 CAD models from 40 classes. Following the official settings, we use 9,843 models to train our network and 2,468 models to evaluate. Following the configuration of [qi2016pointnet, qi2017pointnetplusplus], we sample 1024 points uniformly from raw points and compute the normal vectors from corresponding mesh models. We also use data augmentation in the same way as [qi2016pointnet, qi2017pointnetplusplus] by randomly rotating the data along the
axis and jittering each point with a Gaussian that has zero mean and 0.02 standard deviation.
Classification results on the test set are summarized in Table 1. Our PointShuffleNet achieves the stateoftheart accuracy on ModelNet40 except RSCNN [liu2019rscnn]. RSCNN improves their results from to
with multiple abstraction scales and tricky voting test strategy. Their training and evaluate protocols are different from other normal protocols. Compared to other methods implemented in TensorFlow with better GPUs, our method still obtains a great balance between speed and accuracy. An analysis of the running time will be demonstrated in Section
5.2. It is remarkable that LMIR makes PSN have the highest mean class accuracy due to its great generalization on each class.5.2 Point Cloud Segmentation
We evaluate our PSN on two largerscale segmentation tasks: ShapeNet [shapenet2015] for part segmentation and Stanford 3D LargeScale Indoor Spaces (S3DIS) [Armeni2017s3dis] for indoor scene segmentation. The ShapeNet dataset contains 16,881 shapes (14,006 models for training and 2,874 for evaluating) from classes and has 50 parts in total. We follow the configuration of PointNet++ [qi2017pointnetplusplus] using 2048 points as our input. There are 6 largescale indoor areas with 271 rooms in the S3DIS dataset, and each point is annotated with one of 13 categories. Because area 5 is the only area that has no overlap with other areas, we consider areas 14 and 6 as our training split and area 5 as our evaluation split. Following [yan2020pointasnl], we generate training data by randomly sampling rooms into with 4096 points during the training process. At test time, we split the rooms into blocks with 4096 points and stride .
In Table 3, we show the quantitative results of PSN on ShapeNet with other stateoftheart methods [qi2016pointnet, li2018sonet, qi2017pointnetplusplus, dgcnn, liu2019point2sequence, li2018pointcnn, liu2019rscnn, yan2020pointasnl] under the same training and evaluating protocols. PSN shows comparable results with other methods. Compared with PointNet [qi2016pointnet] and PointNet++ [qi2017pointnetplusplus], we have notably increased both mIoU and class mIoU due to the effectiveness of NEFL.
The evaluation performance of S3DIS dataset on Area 5 is shown in Table 2 and Table 4. Our method achieves comparable results with a block size of and point input of . Our method achieves better results with a block size of and point input of compared with previous methods which use same experiment setting. And our method also achieves comparable performance compared with other methods which use bigger areas or more points as input.
5.3 Ablation Study
Model  Points  Normal  NEFL  ClusterFPS  Accuracy  
HER  LMIR  class  instance  
A  PointNet++  1024  88.0  90.7  
B  PointNet++  5000  ✓    91.9  
C  PointNet++  1024  ✓  ✓  91.0  92.5  
D  PSN  1024  ✓  ✓  90.9  92.4  
E  PSN  1024  ✓  ✓  ✓  91.1  92.8  
F  PSN  1024  ✓  ✓  ✓  90.9  92.9  
G  PSN  1024  ✓  ✓  ✓  91.5  93.0  
H  PSN  1024  ✓  ✓  ✓  ✓  91.6  93.2 
In this section, we analyze the effectiveness of each component including HER, LMIR ClusterFPS and Deep INFOMAX on the ModelNet40 dataset. The results of the ablation study are summarized in Table 5.
We set two baselines: A and D. Model A is the original implementation of PointNet++ [qi2017pointnetplusplus], and Model D only has new architecture and ClusterFPS without NEFL. Compared to B, Model D can obtain a improvement on mean instance accuracy with fewer input points. When we combine HER with PointNet++ (model C), there is a great improvement in both mean class and instance accuracy. Model E shows the same improvement results on accuracy ( and ).
Furthermore, we deploy LMIR into model E to get model H. LMIR shows incremental improvement on accuracy compared to model E. To be fair, we replace the LMIR module in model H with the local mutual information loss (Equation 5) from Deep INFOMAX [hjelm2019learning] to get model F. Compared to model E without a local mutual information loss, model F fails to promote accuracy and decreases by
on mean class accuracy. In addition, there is a great gap between models F and H. Thus, we conclude that our LMIR has better performance than Deep INFOMAX on supervised learning. We also replace ClusterFPS with FPS to get model G which has similar performance to model H. Thus, We can draw the conclusion that ClusterFPS may not have the same good sampling uniformity as FPS, but because of the additional cluster information, ClusterFPS still has better performance and faster speed.
5.4 Robustness for Sparser Input
To further verify the robustness of the PSN model, we set two robustness experiments: training and evaluating on sparse input (i.e., 1024, 512, 256, 128, 64 and 32), and training on 1024 points but evaluating on sparse input (i.e., 1024, 512, 256 and 128). We compare our method using PointNet [qi2016pointnet], PointNet++ [qi2017pointnetplusplus], SpiderCNN [xu2018spidercnn], PCNN [Matan2018pcnn] and DGCNN [dgcnn].
As shown in Figure 5, we can see PSN is very robust to take sparse points for training and evaluating, and PSN drops less than , from 1024 points to 256. In each density, PSN outperforms the other methods and still achieves an accuracy with 32 points input.
Following [qi2016pointnet, qi2017pointnetplusplus, Matan2018pcnn, dgcnn], we train our PSN on an invariant point and evaluate on different densities. As Figure 5 shows, PSN has enough robustness to process sparse points randomly sampled from 1024 points. The experimental results demonstrated that HER and LMIR can help models improve the robustness for a variety of input numbers.
5.5 Speed Study
We test the robustness of ClusterFPS by gradually increasing the number of input points from 1024 to 1,000,000. First, we visualize the results of FPS and ClusterFPS. We run FPS and ClusterFPS to downsample a point cloud with a batch size of 1. As shown in Table 6, ClusterFPS outperforms FPS for every input size. Remarkably, ClusterFPS can downsample one million points to its size in only 1.53s, achieving up to a speedup over FPS. This shows the dominating capability of our model in processing largescale point clouds.
6 Conclusion
Num. of Input  1024  4096  10000  1,000,000 

Num. of Output  512  1024  4096  100,000 
FPS  95.1 ms  187 ms  798 ms  230 s 
ClusterFPS  16.3 ms  32.7 ms  108 ms  1.53 s 
In this paper, we proposed PointShuffleNet (PSN) for efficient point cloud analysis. PSN showed great promise in point cloud classification and segmentation by introducing the nonEuclidean feature learner (NEFL), which contains a homotopy equivalence relation (HER) based on topology math and a local mutual information regularizer (LMIR). HER introduces the homotopy equivalence transform to our model, makes the neural network to learn the data distribution by homotopy equivalence and extends the generalization. To cut off the trivial homotopy equivalence relation that leads to bad generation, LMIR regularizes the neural network to focus on the nontrivial paths rather than trivial paths by maximizing the mutual information between the original features and HER transformed features. To further improve the efficiency, we proposed a point sampling algorithm named ClusterFPS that is based on the farthest point sampling algorithm. ClusterFPS splits point clouds into several clusters and deploys FPS on each cluster to run parallel computations on the GPUs. PSN achieves stateoftheart accuracy on classification and achieves great balance between speed and performance on various tasks.
References
Appendix A Glossary

Homotopy Equivalence Relation. The key concept of homotopy is a continuously deformation between two continuous functions in a topological space. In other words, these functions are said to be equivalent in topological math. We can use homotopy equivalence relation to help neural networks to have better generalization. In fact, homotopy equivalence relation can be considered as a principle of feature augmentation in which the augmented features may share a continuous closure subspaces with the original features. We use module shuffle as our deformation method rather than single layer perception for its randomness and high efficiency.

Local Mutual Information Regularizer. We can not control what kind of data Homotopy Equivalence Relation can transform the original data into. However, there is not a mathematical theory to drop unreasonable deformation path in topological. So we propose a regularizer based on mutual information to cut off trivial paths (the unreasonable deformation paths) implicitly named Local Mutual Information Regularizer. We use GAN [nowozin2016fgan] to estimate the mutual information between the transformed feature and the original one. So we insert multiple discriminators into networks and utilize different segments of the backbone as generators to form multiple GAN inside a network. So we can distinguish whether the feature is from a nontrivial path (the reasonable deformation) through mutual information and guides the model to learn the more similar shuffled features and punished the less similar shuffled features which are likely to share less mutual information with the original features.

Continuous Closure Subspaces. The continuous closure subspace is a space that accurately describes the distribution of data, and a class has and only one subspace. But there are infinite decision boundaries between sets. However, we can use the continuous closure subspace to partition the corresponding class and other classes as a decision boundary.

Module Shuffle. It is a shuffle in which a deck of cards is divided into halves which contains cards. Module shuffle is also called as riffle shuffle or the Faro shuffle if . It has been proved that module shuffle is a homotopy equivalence transformation by [Topology2000Anderson].

Trivial Path. A deformation path connects several point clouds which belong to different class and has intersection with multiple closure subspaces. The trivial path may lead the neural networks to bad generalization.

Nontrivial Path. A deformation path connects several point clouds which belong to a same class and has no intersection with other closure subspaces. The nontrivial path can be considered as a continuous data augmentation method which interpolates multiple data points inside the closure subspace and clarifies the decision boundaries.