1 Introduction
Scene understanding has long been a challenging problem in computer vision. Recently, there have been significant advances in applying deep learning
[19] to train neural networks for numerous tasks such as object classification and semantic segmentation. With the wide availability of consumergrade depth sensors, acquiring 3D data has become more intuitive and robust with many 3D datasets available publicly [40, 4, 14, 7, 2, 43, 36]. This leads to increased interests in tackling scene understanding in the 3D domain.Among the representations for 3D data, a promising direction is to let neural networks consume point cloud data directly since point cloud data is the common data format acquired from depth sensors such as RGBD or LiDAR cameras. However, since a point cloud is a mathematical set and so it fundamentally differs from an image, passing a point cloud to a traditional neural network like those in the image domain does not work. In principle, it is necessary to design a convolutionequivalent operator in the 3D domain that can take a point cloud as input and output its perpoint features. Several attempts have been made with promising results [27, 29, 15, 20, 42, 47].
Despite such research efforts, a problem often overlooked in point cloud convolution is that the operator does not exhibit rotation invariance. A viable solution in 2D deep learning is to augment training data with random rotations. However, in 3D, such data augmentation becomes less effective due to the additional degree of freedom in representing 3D rotations, which can make training prohibitively expensive. A few works turn to learn rotationinvariant features
[46, 30, 26, 8, 5], which allows consistent predictions given arbitrarily rotated point clouds.Unfortunately, a limitation from previous works is that rotationinvariant convolution does not yield features that are as distinctive as translationinvariant convolution. This makes performing object classification with aligned data more accurate than performing the same task with data with arbitrary rotations. For exact rotation invariance, it is expected that the rotationinvariant convolution is as accurate as its translationinvariant sibling.
In this paper, we propose a novel approach for performing rotationinvariant convolution for point clouds. Our key observation is that when rotation invariance is added, it introduces some ambiguities and thus reduces feature distinctiveness. To address this problem, we propose to integrate global context information from the input point cloud to the convolution, resulting in a global context aware convolution for 3D point clouds. The main contributions of this work are:

GCAConv, a novel rotationinvariant convolution operator that output features from local point sets and global anchors. Each anchor is built from subdivided spaces using a globallyweighted local reference frame at each keypoint. By explicit encoding the relation between local point sets and the global anchors, GCAConv can capture both local and global context;

GCANet, a neural network architecture that uses GCAConv for learning rotationinvariant features for 3D point clouds. The network allows consistent performance across training/testing scenarios that involves different rotation modes;

Applications of GCANet on object classification, object part segmentation, shape retrieval, and normals estimation that achieve the stateoftheart performance under challenging rotations.
2 Related Works
Deep learning in the 2D domain has witnessed great success in solving scene understanding tasks such as object classification, semantic segmentation, normal estimation, etc. Drawing from this inspiration, techniques for deep learning in the 3D domain has recently been developed with promising results. In this section, we review the stateoftheart research in deep learning with 3D data, and then focus on techniques that enable feature learning on point clouds for scene understanding tasks.
Early research in 3D deep learning focus on regular and structured representations of 3D scenes such as multiple 2D images [33, 28, 10], 3D volumes [28, 21], hierarchical data structures like octree [31] or kdtrees [18, 38]. Such representations yield good performance. However, they face challenges from a practical point of view due to memory consumption, imprecise representation, or lack of scalability when highresolution data is employed.
Many recent works in 3D deep learning switched to investigate how to learn with 3D point cloud, a more compact and intuitive representation compared to volumes and image sets. However, performing deep learning with 3D point clouds is not as straightforward as extending 2D image convolution to 3D because mathematically, a point cloud is a set. To define a valid convolution for a point cloud, it is necessary to ensure that the output features from a convolution is invariant to the permutation of the point set. PointNet [27] pioneered such a solution to output global features by maxpooling perpoint features from MLPs. Several followup works focus on designing convolutions that can learn local features for a point cloud efficiently [15, 29, 20, 42, 39, 47]. Please also refer to the technical report by Guo et al. [13] for further summary of many deep learning techniques for 3D point clouds.
A fundamental missing feature in the previously mentioned convolution for point clouds is that rotation invariance is not supported. A common solution is to augment the training data with arbitrary rotations, but a limitation of doing so is that generalizing the predictions to unseen rotations is challenging, not mentioning that the training time becomes longer due to the increased amount of training data. Instead, it is desirable to have a point cloud convolution with rotationinvariant features.
To this end, Rao et al. [30] map a point cloud to a spherical domain to define a rotationinvariant convolution. Zhang et al. [46] proposed a convolution that operates on features built from Euclidean distances and angles. Poulenard et al. [26] proposed to integrate spherical harmonics to a convolution. You et al. [44] transform the point cloud onto spherical voxel grids and apply convolution in the transformed domain. A great benefit of such techniques is that it allows consistent predictions across training/testing scenarios with or without rotations being applied to the data, and they can generalize robustly to inputs with unseen rotations. Despite that, so far these techniques share a common limitation: their performance is inferior to that in translationinvariant point cloud convolution. A typical example is the accuracy in object classification task on ModelNet40 dataset [40]. Stateoftheart techniques such as PointNet [27], PointNet++ [29], PointCNN [20], or ShellNet [47] report between 89% to 93% of accuracy while techniques with rotationinvariant convolution only report up to 86% of accuracy [46, 26]. Our work in this paper is dedicated to analyze and address this problem.
3 Background
Let us first analyze the performance of existing point cloud convolutions and their rotationinvariant counterparts. We select object classification task as the key task for our analysis. An observation is that the classification accuracy drops when rotationinvariant convolution is applied. We further dissect this phenomenon by visualizing the latent space learnt by the neural networks using tSNE [37]. The results are shown in Figure 1.
In this figure, we follow Esteves et al. [9] and Zhang et al. [46] to evaluate three scenarios for object classification: z/z, SO3/SO3, and z/SO3. In case z/z, we use data augmented with rotation about gravity axis for training and testing. In case SO3/SO3, we use data augmented with arbitrary rotations for training and testing. In case z/SO3, we train with data by zrotations and test with data by SO3 rotations. The first scenario has been extensively evaluated by previous point cloud convolution methods. The second and third scenario is specially designed to evaluate rotation invariance. The third scenario is the most challenging as it is designed to test whether a convolution can generalize well to unseen rotations.
As can be seen, latent space learnt by rotationinvariant convolution such as RIConv by Zhang et al. [46] does not exhibit good discrimination among classes. The main difference between such convolution and traditional point cloud convolution is that it no longer works with point coordinates at start. In the case of RIConv, the points are transformed into Euclidean based features including distances and angles, which are not as unique as point coordinates since many points can share the same distance and angles. This is well reflected into the tSNE in the first column (z/z) in Figure 1. PointNet++ [29] has a good separation among the clusters while RIConv [46] has more condensed clusters in the center, resulting in more ambiguities during classification.
Similarly, in the second column (SO3/SO3), PointNet++ and RIConv has similar clustering, which explains their similar performance in the classification (see more quantitative comparisons in Table 1). Finally, the third column (z/SO3) highlights the strength of rotationinvariant convolutions as they can still maintain consistent predictions and generalize well to unseen conditions. In this case, the tSNEs show that PointNet++ cannot generalize effectively.
The goal of our work is to devise a convolution that can output highly distinctive rotationinvariant features. Here we achieve this by introducing features from a global context to design a new rotationinvariant convolution. We are inspired by the fact that for each point in a point cloud, its 3D coordinates encode global information. Such global information is lost when one converts the coordinates into some rotationinvariant features such as distance and angles as done by Zhang et al. [46].
4 Our Method
Our rotationinvariant convolution is built upon two key concepts: a repeatable and robust local reference frame and a global context using anchors. The idea of using local reference frames is related to spatial transformer [17] which is also leveraged by PointNet [27]. However, as spatial transformer is datadriven, it does not work well to unseen conditions such as the z/SO3 test in Figure 1. To achieve robustness, we build local reference frames (LRFs) at the keypoints of the point cloud so that features can be learnt in such local spaces. At a keypoint, not only points in its local neighborhood can strongly affect the construction of the reference frame, but nonneighboring points can also contribute to such construction. It is well known that repeatable and robust LRFs are keys to traditional 3D point descriptors [35].
After the LRFs are constructed, theoretically we can simply proceed to learn features of the local point sets. However, as previously mentioned, global shape information are also useful for feature learning. We also retain such global information and integrate them into the convolution. Here we achieve this through anchors. Each anchor is defined as a representative point in each subspace formed by the axes of the LRF. Given a LRF, it is possible to construct eight subspaces. At each LRF, the anchors thus approximate global features of the point cloud and we integrate such features to define our convolution.
4.1 Globally Weighted Local Reference Frames
For an input point set, we use farthest point sampling to select a set of keypoints which can fully cover the underlying point cloud and denoted as . For each keypoint , we use it as a query to obtain local region centroid at
. We wish to use deep learning to extract rotation invariant features from the local region. To begin with local features learning, it is necessary to construct a local reference frame (LRF) such that the 3D coordinates can be transformed into rotation invariant features. The unit vectors of the LRF at
can be determined by normalizing the eigenvectors of the covariance matrix
(1) 
where is the number of points in the local region and . However, the LRF via such computation is unstable and sensitive to noise. Slight point variations can affect the LRF and make it not repeatable. Moreover, when a local region undergoes some rotations, ambiguity can arise, reducing the distinctiveness of the local features. For example, it is hard to tell apart a corner region on a bed and on a floor/wall/ceiling in the presence of arbitrary rotations. To solve these problems, we establish more reliable LRFs by utilizing all query points of in the construction:
(2) 
where is the weight that controls how a point in the point set contributes to the matrix. The weight is defined by
(3) 
where . Intuitively, this weight allows nearby points of to have large contributions to the covariance matrix, and thus greatly affect the LRF. Points further away from however can contribute globally to the robustness of the LRF. Such weighted LRF construction is a fundamental step in 3D handcrafted features [35], which can be easily integrated into our proposed convolution.
A typical problem in defining LRFs is the sign flipping, i.e., the LRF signs should not vary for the same point set [35]. There are multiple ways to resolve the ambiguity; here we disambiguate the signs of the eigenvectors by orienting them to the global vector defined by
(4) 
which represents the main orientation of the whole model from the perspective of point .
4.2 Anchor Point Generation
Theoretically, it is possible to perform convolution on the point set transformed into local coordinates using the constructed LRF. However, it is wasteful to discard global information from the original coordinates as such information can further improve feature distinctiveness. Our idea here is to use anchor points to retain such information in a compact way.
Specifically, to establish the anchors, we divide the whole input point cloud into eight bins, as shown in Figure 2. In each bin, we use the barycenter of the local point set in that bin as the anchor point. Such anchors are crude approximations to the global input shape, and therefore they convey useful information for the convolution.
It is worth noting that there are many ways to define anchors in our case. For example, one can choose to use more bins or all the original point coordinates as anchors, but those will significantly increase computation time for the convolution. We empirically use eight bins as it strikes a balance between the amount of global information retained and the running time.
4.3 Global Context Aware Convolution
With the LRFs and anchors points defined, we are now ready to construct our Global Context Aware Convolution (GCAConv) to learn the rotation invariant features. Let us consider a point set where represents 3D coordinates of the point . Let be a local point set centered at . A typical convolution to learn the features of can be written as
(5) 
This formula indicates that features of each point in the point set are first transformed before being aggregated by the aggregation function
and passed to an activation function
. A popular choice of is maxpooling, which supports permutation invariance in the orders of the input point features [27]. There are a few ways to define the transformation function . In PointNet [27], it is defined by(6) 
where indicates the elementwise product. This product however ignores the contribution of features from neighboring points to center . To further incorporate such neighbor information, Liu et al. [22] proposed to define the weights by a mapping from a relation vector between a point and its neighbor .
Here our goal is to define the weights by using the local point set and the anchors. We project both the local point set and anchor points onto the LRF system such that the global 3D coordinates are transformed to a local frame:
(7) 
where and represents the global point and anchor, and and represents the local point and anchor, respectively. From here, we aim to relate the weights to such coordinates. Given a pair of a local point and an anchor , we define their relation as
(8) 
which can be represented by a vector. We stack the features over eight anchors into an matrix.
Our convolution can then be defined as a 1D convolution that transforms such matrix into a feature vector. The kernel of the convolution is .
(9) 
Note that in this formula, we operate on local coordinates, and we use the anchors to approximate features from neighboring points. This allows us to have two main advantages. First, our convolution only needs local features to operate. Second, the LRFs allow that the learnt features are rotation invariant by definition, without the need of data augmentation during training. Our features can generalize easily to unseen rotations, and we also save a lot of computation during training.
4.4 Network Architecture
We use the proposed convolution to design three neural networks for object classification, object part segmentation, and normals estimation, respectively. The architecture is shown in Figure 3
. Our classification network has a standard architecture and uses three consecutive layers of convolution (with point downsampling) followed by fully connected layers (256, 128) to output the probability map. In three layers of convolutions, the output channels are set as 128, 256, 512 respectively, and the downsampling numbers are set as 512, 128 and 32 respectively. The neural network for object part segmentation and normal estimation has a decoder branch that includes skip connections and gradually upsamples the point cloud to the original resolution. We use MLP after a skip connection to unify and transform the combined features to have a valid size before deconvolution. Our deconvolution is defined similarly to GCAConv. The minor difference is that it gradually outputs denser points with fewer features.
5 Experimental Results
In this section, we evaluate our method on the 3D object classification, object part segmentation, shape retrieval, and normal estimation task. We implemented our method in TensorFlow
[1]. We use a batch size of to train object classification and to train object part segmentation, shape retrieval, and normal estimation. The training is performed with Adam optimizer with an initial learning rate set to 0.001. The experiments are conducted on a machine with an Intel(R) Core(TM) i76900K CPU equipped with an NVIDIA GTX TITAN X GPU.5.1 Classification on ModelNet40
Method  Format  Input size  Params.  z/z  SO3/SO3  z/SO3  Average acc.  Acc. std. 
VoxNet [16]  voxel  0.9M  83.0  87.3    85.2  3.0  
SubVolSup [28]  voxel  17M  88.5  82.7  36.6  69.3  28.4  
Spherical CNN [9]  voxel  0.5M  88.9  86.9  78.6  84.8  5.5  
MVCNN 80x [33]  view  99M  90.2  86.0  81.5  85.9  4.3  
PointNet [27]  xyz  3.5M  87.0  80.3  21.6  63.0  41.0  
PointNet++ [29]  xyz  1.4M  89.3  85.0  28.6  67.6  33.8  
PointCNN [20]  xyz  0.60M  91.3  84.5  41.2  72.3  27.2  
RSCNN [22]  xyz  1.41M  90.3  82.6  48.7  73.9  22.1  
RIConv [46]  xyz  1024  0.70M  86.5  86.4  86.4  86.4  0.1 
SPHNet [26]  xyz  1024  2.9M  87.0  87.6  86.6  87.1  0.5 
SFCNN [30]  xyz  1024    91.4  90.1  84.8  88.8  3.5 
ClusterNet [5]  xyz  1024    87.1  87.1  87.1  87.1  0.0 
Ours (w/o anchor)  xyz  1024  0.21M  86.3  86.2  86.2  86.2  0.0 
Ours  xyz  1024  0.39M  89.0  89.2  89.1  89.1  0.0 
Object classification is the main task in our evaluation. We train the classification network by using the ModelNet40 variant of the ModelNet dataset [41]. ModelNet40 contains CAD models from 40 categories such as airplane, bottle, chair, dresser, vase, etc. We use the preprocessed data from PointNet [27] that consists of models for training and models for testing. We use point clouds of size 1024 in this task. Each point is represented by
coordinates in the Euclidean space. The training takes approximately 11 hours to converge in 250 epochs.
Following Esteves et al. [9] and Zhang et al. [46], we evaluate the performance of object classification with three scenarios: (1) using data augmented with rotation about gravity axis (z/z) for training and testing, (2) using data augmented with arbitrary rotations (SO3/SO3) for training and testing, and (3) training with data by zrotations and testing with data by SO3 rotations (z/SO3). It is expected that rotationinvariant convolutions should work well in the z/SO3 scenario.
Table 1 details the results of this experiment, which confirms the effectiveness of the proposed rotationinvariant convolution. As can be seen, on average, not only our classification accuracy outperforms the stateoftheart translationinvariant point cloud convolution, the performance is also consistent across three scenarios. For rotationinvariant convolutions, our method outperforms the accuracy of RIConv [46], SPHNet [26], and ClusterNet [5] by a good margin. Our method is slightly more accurate than SFCNN [30] but much more consistent.
5.1.1 Ablation Studies
Network Design.
Model  Weight  Vector  Anchor  Rot. Aug.  Acc.  








We conduct an ablation study on the ModelNet40 dataset for the classification task (Table 2). We examine four settings in our convolution: (1) the globally weighted LRFs with main orientation (Weight), (2) the use of main orientation to resolve the LRF sign ambiguity ( vector), (3) the use of anchors for global context (Anchor), and (4) the data augmentation with rotations used for the training (Rot. Aug.). Five models (AE) are used to study the effects of these settings by turning them on/off.
Model A is our baseline setting with all settings on. Model B tests the importance of the weights for computing LRFs and the main orientation. It can be seen that without such weights, the accuracy decreases to 87.1%. The main reason is that the LRFs and the main orientation are more noisy and less repeatable in such case. Next, in model C we further turn off the
vector to test the stability of the LRFs without sign correction. The accuracy further decreases to 86.7%. This verifies that constructing stable LRFs is key to good network performance. In model D, we turn off the global anchor. In this case, only the local points are used for feature extraction. Thanks to the LRFs, the local features are still effective despite of mild accuracy drop. In model E, we test the performance without rotation augmentation scheme during the training procedure. We find the accuracy is not affected by data augmentation as GCAConv already achieves exact rotation invariance.
Number of Anchors  1  2  4  8 

Accuracy  87.3  87.8  88.5  89.2 
Comparison to learned LRFs.
It is generally tempting to learn the LRFs to design rotationinvariant convolution. Here we compare this method to our proposed LRFs. We use a twolayer MLP to predict the LRFs and then use them to transform the input point coordinates into a local coordinates before proceeding for convolution as described in the main paper. We found that predicting LRFs works well in z/z and SO3/SO3 mode, with both scenarios achieved accuracies of 89.3% and 89.2%, respectively. However, using datadriven LRFs makes the convolution only rotationaware, but not exactly rotationinvariant. Such convolution fails to generalize to unseen rotations in the z/SO3 scenario with the accuracy of 36.2%.
Number of Anchors.
From the ablation studies, we see that without global anchors, the performance is decreased. Here, we further analyze the effects of the number of anchors by investigating the performance on ModelNet40 with a different number of anchors. The qualitative results are shown in Table 3. We can see that with only one anchor, the accuracy decreases to 87.3%, but still higher than RIConv which is around 86.4%. This shows the advantages of global information. With the number goes on, the accuracy also increases. We empirically use eight anchors as it strikes a balance between the amount of global information retained and the running time.
5.2 Object Part Segmentation on ShapeNet
In addition to object classification, we evaluate our method to output a label for each point in the point cloud, resulting in object part segmentation. We use the 3D models in ShapeNet [4] to train our network with point size of 2048 in this task. It takes roughly 36 hours for the training to complete 300 epochs.
The quantitative and qualitative results are shown in Table 4 and Figure 4, respectively. In this task, we achieve startoftheart results for both SO3/SO3 and z/SO3 scenarios. Our method outperforms RIConv [46] by almost of accuracy. From Figure 4, we can clearly see that with z/SO3 mode methods like PointNet++ and SpiderCNN can not work well. This is easy to explain as these methods use the raw xyz coordinates as input for training, thus cannot well understand unknown rotations. RIConv [46] works better as it converts xyz coordinates into rotation invariant format like distances and angles before training. However, it still has difficulties in recognizing the boundaries while our method can treat these regions well by incorporating global context information (see column 2 and 3 in Figure 4).
Method  input  SO3/SO3  z/SO3 

PointNet [27]  xyz  74.4  37.8 
PointNet++ [29]  xyz+normal  76.7  48.2 
PointCNN [20]  xyz  71.4  34.7 
DGCNN [39]  xyz  73.3  37.4 
SpiderCNN [42]  xyz+normal  72.3  42.9 
RSCNN [22]  xyz  72.5  36.5 
RIConv [46]  xyz  75.5  75.3 
Ours (w/o anchor) 
xyz  73.2  73.6 
Ours  xyz  77.3  77.2 
5.3 Shape Retrieval
A popular evaluation of rotation invariance on 3D shape is the shape retrieval task [32]. Here we conducted experiments on ShapeNet Core [41], following the perturbed protocol of the SHREC’17 3D shape retrieval contest [32] and the experiment setting of SFCNN [30]. We use the same output features from the bottleneck layer in the network (similar to features used in the classification task; see Figure 3). We compare with methods proposed in SHREC’17 [11, 34, 3] and two recent methods on rotationinvariant convolution [9, 30]. The results are shown in Table 5
. It can be seen that our method achieves the stateoftheart accuracy, outperforming previous methods for most evaluation metrics.
micro  macro  
Method  PN  R@N  F1@N  mAP  NDCG  PN  R@N  F1@N  mAP  NDCG  Score 
Furuya [11]  81.4  68.3  70.6  65.6  75.4  60.7  53.9  50.3  47.6  56.0  56.6 
Tatsuma [34]  70.5  76.9  71.9  69.6  78.3  42.4  56.3  43.4  41.8  47.9  55.7 
Zhou [3]  66.0  65.0  64.3  56.7  70.1  44.3  50.8  43.7  40.6  51.3  48.7 
Spherical CNN [9]  71.7  73.7    68.5    45.0  55.0    44.4    56.5 
SFCNN [30]  77.8  75.1  75.2  70.5  81.3  65.6  53.9  53.6  48.3  58.0  59.4 
Ours  82.9  76.3  74.8  70.8  81.3  66.8  55.9  51.2  49.0  58.2  61.2 
. The accuracy (%) is reported based on the standard evaluation metrics including precision, recall, fscore, mean average precision (mAP) and normalized discounted cumulative gain (NDCG).
5.4 Normals Estimation
Normals estimation for point clouds is instrumental in many applications such as point cloud rendering, feature extraction, and surface reconstruction. Here we conduct normals estimation on point clouds using the ModelNet40 dataset. For each model, we uniformly sample points from the original data for training. We compute a loss based on the cosines between the predicted unit vectors and the ground truth normals to guide the training. Our results are shown in Table 6.
Method  z/z  SO3/SO3  z/SO3  Err. std. 

PointNet++ [29]  0.34  0.55  0.81  0.24 
RSCNN [22]  0.26  0.50  0.83  0.29 
RIConv [46]  1.33  1.30  1.30  0.02 
Ours  0.42  0.42  0.44  0.01 
In this table, our method achieves the best consistency in predicting normals across three test scenarios. In SO3/SO3 and z/SO3 case, our method is the most accurate. It outperforms other methods by a wide margin. The predicted normals are depicted in Figure 5. We quantize the errors by calculating the angles between the predicted and ground truth normals. In Figure 5, the blue and red vectors depict normals with less than and greater than of error. It can be seen that our method is the most accurate visually. It is worth noting that RIConv [46] performs poorly in the normals estimation task because it uses rotationinvariant features that discard the reference coordinate frames, and so the normals of RIConv is not globally consistent.
6 Conclusion
In this work, we introduced a novel approach to design rotationinvariant convolution for 3D point clouds. We show that building robust and repeatable local reference frames is critical to boosting the performance of rotationinvariant object classification. In this task, our newly proposed convolution can match the performance of stateoftheart translationinvariant convolutions. Our work opens up opportunities to narrow down the performance gap between rotationinvariant and translationinvariant convolution in general 3D deep learning, making robust convolutions for 3D point clouds feasible.
Here we detail a few potential ideas for future research. First, while our proposed method achieves good performance, it is not clear whether local reference frames can be set robustly by a neural network. There is a recent work [48] that attempts to solve this problem, but the performance on object classification needs further investigation. Second, generalizing point cloud convolutions and object classification to support nonrigid transformations and deformable objects could further improve overall robustness. Finally, more thorough benchmarking rotationinvariant convolutions with realworld data [36] is necessary to understand the impact of such data on the learning of rotationinvariant features.
References

[1]
(2016)
Tensorflow: a system for largescale machine learning
. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), pp. 265–283. Cited by: §5.  [2] (2016) 3D semantic parsing of largescale indoor spaces. In CVPR, Cited by: §1.
 [3] (2016) Gift: a realtime and scalable 3d shape search engine. In CVPR, pp. 5023–5032. Cited by: §5.3, Table 5.
 [4] (2015) ShapeNet: an informationrich 3d model repository. arXiv preprint arXiv:1512.03012. Cited by: §1, §5.2, Table 4.

[5]
(2019)
ClusterNet: deep hierarchical cluster network with rigorously rotationinvariant representation for point cloud analysis
. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pp. 4994–5002. Cited by: §1, §5.1, Table 1.  [6] (1996) A volumetric method for building complex models from range images. In Proceedings of the 23rd annual conference on Computer graphics and interactive techniques, pp. 303–312. Cited by: Figure 7, Appendix B.
 [7] (2017) ScanNet: richlyannotated 3d reconstructions of indoor scenes. In CVPR, pp. 5828–5839. Cited by: §1.

[8]
(2018)
Ppffoldnet: unsupervised learning of rotation invariant 3d local descriptors
. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 602–618. Cited by: §1.  [9] (2018) Learning so (3) equivariant representations with spherical cnns. In ECCV, pp. 52–68. Cited by: §3, §5.1, §5.3, Table 1, Table 5.
 [10] (2019) Equivariant multiview networks. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1568–1577. Cited by: §2.
 [11] (2016) Deep aggregation of local 3d geometric features for 3d model retrieval.. In BMVC, Vol. 7, pp. 8. Cited by: §5.3, Table 5.
 [12] (2013) Rotational projection statistics for 3d local surface description and object recognition. International journal of computer vision 105 (1), pp. 63–86. Cited by: Appendix B.
 [13] (2019) Deep learning for 3d point clouds: a survey. In arXiv:1912.12033, Cited by: §2.
 [14] (2016) SceneNN: a scene meshes dataset with annotations. In International Conference on 3D Vision, Cited by: §1.

[15]
(2018)
Pointwise convolutional neural network
. In CVPR, Cited by: §1, §2.  [16] (2018) Recurrent slice networks for 3d segmentation on point clouds. In CVPR, Cited by: Table 1.
 [17] (2015) Spatial transformer networks. In Advances in Neural Information Processing Systems 28, Cited by: §4.
 [18] (2017) Escape from cells: deep kdnetworks for the recognition of 3d point cloud models. In International Conference on Computer Vision, pp. 863–872. Cited by: §2.
 [19] (2015) Deep learning. Nature 521 (7553), pp. 436. Cited by: §1.
 [20] (2018) PointCNN: convolution on xtransformed points. Advances in Neural Information Processing Systems. Cited by: Table 10, Table 8, Table 9, §1, §2, §2, Table 1, Table 4.
 [21] (2016) Fpnn: field probing neural networks for 3d data. In Advances in Neural Information Processing Systems, pp. 307–315. Cited by: §2.
 [22] (2019) Relationshape convolutional neural network for point cloud analysis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8895–8904. Cited by: Table 10, Table 9, §4.3, Figure 5, Table 1, Table 4, Table 6.
 [23] (2010) On the repeatability and quality of keypoints for local featurebased 3d object retrieval from cluttered scenes. International Journal of Computer Vision 89 (23), pp. 348–361. Cited by: Appendix B.
 [24] (2008) Scaledependent/invariant local 3d shape descriptors for fully automatic registration of multiple sets of range images. In European conference on computer vision, pp. 440–453. Cited by: Appendix B.
 [25] (2011) On the repeatability of the local reference frame for partial shape matching. In 2011 International Conference on Computer Vision, pp. 2244–2251. Cited by: Appendix B.
 [26] (2019) Effective rotationinvariant point cnn with spherical harmonics kernels. International Conference on 3D Vision (3DV). Cited by: §1, §2, §5.1, Table 1.
 [27] (2017) Pointnet: deep learning on point sets for 3d classification and segmentation. In CVPR, Cited by: Table 10, Table 8, Table 9, §1, §2, §2, §4.3, §4, §5.1, Table 1, Table 4.
 [28] (2016) Volumetric and multiview cnns for object classification on 3d data. In CVPR, pp. 5648–5656. Cited by: §2, Table 1.
 [29] (2017) Pointnet++: deep hierarchical feature learning on point sets in a metric space. In Advances in Neural Information Processing Systems, pp. 5105–5114. Cited by: Table 10, Table 8, Table 9, §1, §2, §2, Figure 1, §3, Figure 4, Figure 5, Table 1, Table 4, Table 6.
 [30] (2019) Spherical fractal convolutional neural networks for point cloud recognition. In Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §2, §5.1, §5.3, Table 1, Table 5.
 [31] (2017) OctNet: learning deep 3d representations at high resolutions. In CVPR, Cited by: §2.
 [32] (2016) Shrec16 track: largescale 3d shape retrieval from shapenet core55. In Proceedings of the eurographics workshop on 3D object retrieval, Vol. 10. Cited by: §5.3.
 [33] (2015) Multiview convolutional neural networks for 3d shape recognition. In International Conference on Computer Vision, pp. 945–953. Cited by: §2, Table 1.

[34]
(2009)
Multifourier spectra descriptor and augmentation with spectral clustering for 3d shape retrieval
. The Visual Computer 25 (8), pp. 785–804. Cited by: §5.3, Table 5.  [35] (2010) Unique signatures of histograms for local surface description. In ECCV, Cited by: Appendix B, §4.1, §4.1, §4.
 [36] (2019) Revisiting point cloud classification: a new benchmark dataset and classification model on realworld data. In International Conference on Computer Vision (ICCV), Cited by: §1, §6.

[37]
(2008)
Visualizing highdimensional data using tsne
. Journal of Machine Learning Research. Cited by: §3.  [38] (2017) Ocnn: octreebased convolutional neural networks for 3d shape analysis. ACM Transactions on Graphics 36 (4), pp. 72. Cited by: §2.
 [39] (2019) Dynamic graph cnn for learning on point clouds. ACM Transactions on Graphics. Cited by: Table 10, Table 9, §2, Table 4.
 [40] (2015) 3d shapenets: a deep representation for volumetric shapes. In CVPR, pp. 1912–1920. Cited by: §1, §2.
 [41] (2015) 3d shapenets: a deep representation for volumetric shapes. In CVPR, pp. 1912–1920. Cited by: Table 8, §5.1, §5.3, Table 3, Table 5.
 [42] (2018) SpiderCNN: deep learning on point sets with parameterized convolutional filters. In ECCV, Cited by: Table 10, Table 9, §1, §2, Figure 4, Table 4.
 [43] (2016) A scalable active framework for region annotation in 3d shape collections. ACM Transactions on Graphics. Cited by: §1.

[44]
(2020)
Pointwise rotationinvariant network with adaptive sampling and 3d spherical voxel convolution.
In
AAAI Conference on Artificial Intelligence
, Cited by: §2.  [45] (2009) Surface feature detection and description with applications to mesh matching. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 373–380. Cited by: Appendix B.
 [46] (2019) Rotation invariant convolutions for 3d point clouds deep learning. In International Conference on 3D Vision (3DV), pp. 204–213. Cited by: Table 10, Table 8, Table 9, §1, §2, Figure 1, §3, §3, §3, Figure 4, Figure 5, §5.1, §5.1, §5.2, §5.4, Table 1, Table 4, Table 6.
 [47] (2019) ShellNet: efficient point cloud convolutional neural networks using concentric shells statistics. In International Conference on Computer Vision (ICCV), pp. 1607–1616. Cited by: §1, §2, §2.
 [48] (2020) LRFnet: learning local reference frames for 3d local shape description and matching. In arXiv:2001.07832, Cited by: §6.
Appendix A Learning based LRFs
a.1 Baseline 1: Predicting LRFs
As mentioned in the main text, it could be tempting to learn the LRFs to design rotationaware convolution. For completeness, here we discuss this baseline again. We use a twolayer MLP to predict the LRFs and then use them to transform the input point coordinates into a local coordinates before proceeding for convolution as described in the main paper. We found that predicting LRFs works well in z/z and SO3/SO3 mode, with both scenarios achieved accuracies of 89.3% and 89.2%, respectively. However, using datadriven LRFs makes the convolution only rotationaware, but not exactly rotationinvariant. Such convolution fails to generalize to unseen rotations in the z/SO3 scenario with accuracy 36.2%.
a.2 Baseline 2: Pooling with SignAmbiguous LRFs
Taking the insight from Baseline 1, we proceed to only resolve the ambiguity in constructing the LRFs using learning while using the covariance matrices and their eigenvectors to determine the LRF axes. Here the signs of the LRFs axes are not determined, and instead of resolving this ambiguity as what described in the main paper, here we establish all eight candidates of the LRFs and perform feature learning with all such candidates. The final output features are pooled from the features of each individual candidate. We call this convolution in this baseline the Pooling Convolution (PoolConv).
More illustrations can be found in Figure 6. In general, PoolConv can produce the same accuracy (89.1%) as our method but it has much higher computation. We measure network complexity by the number of trainable parameters, floating point operations (FLOPs), and running time to analyze the network efficiency. With batch size 16, point cloud size 1024 from the ModelNet40 dataset, we report the statistics in Table 7. Given the minor performance difference but significantly more parameters and training time, PoolConv is not as efficient as our proposed method.
Appendix B Repeatability
We further clarify the repeatability of the LRFs as it serves as the backbone for our feature learning. We follow Guo et al. [12] to conduct this experiment (see their section 3.3). Noted that there are also methods that solve LRFs for mesh such as MeshHog [45] and RoPS [12]. In this study we assume no normal vectors or triangle faces so we omit such methods in our comparison. We use six models from the Stanford 3D Scanning Repository [6] (Figure 7). The scenes are created by resampling the models down to 1/2 of their original mesh resolution with Gaussian noise added (0.1 mesh resolution).
Method  Params  FLOPs  Time 

(Train / Infer)  (Train / Infer)  
PoolConv  0.40M  116.3B / 12.8B  0.66s / 0.38s 
Ours  0.39M  11.0B / 1.3B  0.21s / 0.16s 
From each model, 1000 points are randomly selected and their correspondences in the scene are obtained by searching the closest point in the Euclidean space. Let’s denote the pair of points as from scene and model respectively. The LRFs for these two points are computed as and . To measure the similarity between and , we use the error evaluation metric provided by Mian et al. [23]:
(10) 
Ideally, is zero when there is no error. We compare with four existing methods: EM [24], Mian [23], SHOT [35], and P [25]. The results are shown in Figure 8, where the horizontal axis indicates the angular error range and the vertical axis represents the percentage of points. The more points fall into left lower error range, the better of the methods. As can be seen, our proposed LRFs have much more lowrange angular errors than other methods, but has significantly less highrange errors. This means that our LRFs varies more slowly, and thus allows more consistent predictions.
Appendix C PerClass Accuracies
To further demonstrate the advantages of our proposed convolution operator, we show the perclass accuracies for both classification and part segmentation tasks in this section.
c.1 PerClass Accuracies for Object Classification on ModelNet40
The perclass accuracies for object classification on ModelNet40 under z/SO3 scenario is shown in Table 8. Our method outperforms previous methods significantly (ranking 1st in 32 out of 40 classes).
Network  aero  bathtub  bed  bench  bookshelf  bottle  bowl  car 

PointNet [27]  12.0  2.0  8.0  10.0  15.0  14.0  5.0  12.0 
PointNet++ [29]  53.0  2.0  18.0  10.0  29.0  22.0  20.0  13.0 
PointCNN [20]  60.0  10.0  20.0  10.0  20.0  37.0  25.0  34.0 
RIConv [46]  100.0  82.0  94.0  80.0  93.0  94.0  100.0  98.0 
Ours  100.0  90.0  98.0  80.0  95.0  97.0  100.0  98.0 
chair  cone  cup  curtain  desk  door  dresser  flower pot  
PointNet[27]  9.0  15.0  0.0  0.0  16.3  5.0  8.1  0.0 
PointNet++ [29]  32.0  20.0  15.0  45.0  2.3  30.0  9.3  15.0 
PointCNN [20]  46.0  25.0  15.0  40.0  34.9  30.0  32.6  25.0 
RIConv [46]  96.0  90.0  60.0  95.0  79.1  85.0  73.3  30.0 
Ours  98.0  90.0  55.0  95.0  81.4  80.0  68.6  10.0 
glass box  guitar  keyboard  lamp  laptop  mantel  monitor  night stand  
PointNet [27]  4.0  36.0  5.0  15.0  15.0  4.0  11.0  3.5 
PointNet++ [29]  11.0  47.0  50.0  10.0  15.0  10.0  36.0  1.2 
PointCNN [20]  35.0  46.0  50.0  20.0  20.0  38.0  35.0  40.7 
RIConv [46]  96.0  99.0  95.0  80.0  95.0  91.9  97.0  77.9 
Ours  97.0  100.0  95.0  85.0  100.0  93.0  98.0  73.3 
person  piano  plant  radio  range hood  sink  sofa  stairs  
PointNet [27]  5.0  36.7  55.0  5.0  4.0  20.0  11.0  25.0 
PointNet++ [29]  20.0  5.0  71.0  20.0  9.0  5.0  21.0  10.0 
PointCNN [20]  15.0  34.0  26.0  10.0  28.0  20.0  32.0  30.0 
RIConv [46]  85.0  90.8  83.0  55.0  87.0  75.0  92.0  85.0 
Ours  90.0  91.0  93.0  65.0  86.0  70.0  93.0  80.0 
stool  table  tent  toilet  tv stand  vase  wardrobe  xbox  
PointNet [27]  5.0  3.0  5.0  20.0  4.0  26.3  0.0  10.0 
PointNet++ [29]  10.0  9.0  15.0  13.0  2.0  85.0  15.0  20.0 
PointCNN [20]  20.0  36.0  15.0  33.0  29.0  70.0  40.0  15.0 
RIConv [46]  60.0  80.0  70.0  95.0  78.0  76.8  70.0  65.0 
Ours  75.0  84.0  95.0  99.0  81.0  77.0  70.0  75.0 
c.2 PerClass Accuracies for Part Segmentation on ShapeNet
Here, we also show the perclass accuracies for part segmentation under the SO3/SO3 and z/SO3 scenarios in Table 9 and Table 10 respectively.
Network  aero  bag  cap  car  chair  earph.  guitar  knife 

PointNet [27]  81.6  68.7  74.0  70.3  87.6  68.5  88.9  80.0 
PointNet++ [29]  79.5  71.6  87.7  70.7  88.8  64.9  88.8  78.1 
PointCNN [20]  78.0  80.1  78.2  68.2  81.2  70.2  82.0  70.6 
DGCNN [39]  77.7  71.8  77.7  55.2  87.3  68.7  88.7  85.5 
SpiderCNN [42]  74.3  72.4  72.6  58.4  82.0  68.5  87.8  81.3 
RSCNN [22]  71.8  76.4  78.9  68.1  80.2  62.5  82.6  76.6 
RIConv [46]  80.6  80.2  70.7  68.8  86.8  70.4  87.2  84.3 
Ours  81.2  82.6  81.6  70.2  88.6  70.6  86.2  86.6 
Network  lamp  laptop  motor  mug  pistol  rocket  skate  table 
PointNet [27]  74.9  83.6  56.5  77.6  75.2  53.9  69.4  79.9 
PointNet++ [29]  79.2  94.9  54.3  92.0  76.4  50.3  68.4  81.0 
PointCNN [20]  68.9  80.8  48.6  77.3  63.2  50.6  63.2  82.0 
DGCNN [39]  81.8  81.3  36.2  86.0  77.3  51.6  65.3  80.2 
SpiderCNN [42]  71.3  94.5  45.7  88.1  83.4  50.5  60.8  78.3 
RSCNN [22]  73.2  90.2  54.8  89.8  72.8  43.6  65.3  72.6 
RIConv [46]  78.0  80.1  57.3  91.2  71.3  52.1  66.6  78.5 
Ours  81.6  79.6  58.9  90.8  76.8  53.2  67.2  81.6 
Network  aero  bag  cap  car  chair  earph.  guitar  knife 

PointNet [27]  40.4  48.1  46.3  24.5  45.1  39.4  29.2  42.6 
PointNet++ [29]  51.3  66.0  50.8  25.2  66.7  27.7  29.7  65.6 
PointCNN [20]  21.8  52.0  52.1  23.6  29.4  18.2  40.7  36.9 
DGCNN [39]  37.0  50.2  38.5  24.1  43.9  32.3  23.7  48.6 
SpiderCNN [42]  48.8  47.9  41.0  25.1  59.8  23.0  28.5  49.5 
RSCNN [22]  26.9  49.7  44.7  25.3  36.5  30.0  33.3  39.4 
RIConv [46]  80.6  80.0  70.8  68.8  86.8  70.3  87.3  84.7 
Ours  80.9  82.6  81.0  70.2  88.4  70.6  87.1  87.2 
Network  lamp  laptop  motor  mug  pistol  rocket  skate  table 
PointNet [27]  52.7  36.7  21.2  55.0  29.7  26.6  32.1  35.8 
PointNet++ [29]  59.7  70.1  17.2  67.3  49.9  23.4  43.8  57.6 
PointCNN [20]  51.1  33.1  18.9  48.0  23.0  27.7  38.6  39.9 
DGCNN [39]  54.8  28.7  17.8  74.4  25.2  24.1  43.1  32.3 
SpiderCNN [42]  45.0  83.6  20.9  55.1  41.7  36.5  39.2  41.2 
RSCNN [22]  54.9  36.1  20.6  53.3  29.0  29.4  32.3  42.6 
RIConv [46]  77.8  80.6  57.4  91.2  71.5  52.3  66.5  78.4 
Ours  81.8  78.9  58.7  91.0  77.9  52.3  66.8  80.3 