1 Introduction
The development of neural networks for point cloud analysis has drawn a lot of interests in recent years [24, 26, 34, 30, 16, 31, 44, 21, 43], and applied to various 3D applications, e.g., shape classification, object detection, semantic scene segmentation, etc. However, the features learned by these networks are not rotation invariant, meaning that they consider a point cloud and an arbitrary rotation of it as two different shapes. Thus, they may extract different features for the same shape, that is merely embedded in different poses in 3D.
To alleviate this fundamental problem, a common approach is to apply rotation augmentation to the training data. However, aggressive rotation augmentation, like arbitrary 3D rotations, often harms the recognition performance, since most existing networks do not have strong capacity to learn effective features from such unstable inputs. Thus, often, only azimuthal rotations (around the gravity axis) are considered. However, such limited augmentation does not generalize well, which could lead to a significant performance drop. See Figure 1(a), e.g., the classification performance of the recent ShellNet [43] drops from 93.1% down to 19.9%, on rotations with arbitrary axes.
Recently, some works attempt to design neural networks with rotation invariance [23, 27, 9, 42, 5]. One approach employs sphericalrelated convolutions, and the other employs local rotationinvariant features, e.g., distances and angles, to replace Cartesian coordinates as the network inputs. However, as we shall show, both approaches have limited success. Typically, the former approach has limited capability to embed features and is still sensitive to the rotations, while the latter one encodes mainly local information, which may not be unique, and often, the performance is much lower than using global Cartesian coordinates; see a part segmentation comparison in Figure 1(b) and various quantitative comparison results in Section 3.3, which demonstrate the superiority of our method over the stateoftheart rotationinvariant methods.
In this work, we revisit the problem of rotation invariance in deep 3D point cloud analysis, and enumerate the considerations for achieving rotation invariance in the aspects of network inputs and network processing. Accordingly, we then design an effective lowlevel representation to replace 3D Cartesian coordinates as the network inputs. Our representation is purely rotationinvariant and encodes both local and global information, as well as being
robust to noise and outliers
. Also, we present a deep hierarchical network to embed these lowlevel representations into highlevel features and to extract local relations between points and their neighbors, together with the global shape information. Further, to alleviate global information loss caused by the rotationinvariant representations, we enrich the network features with more global information by introducing a novel region relation convolution in the network to extract both local and nonlocal information across the hierarchy. Lastly, we evaluate the effectiveness of our method on various point cloud analysis tasks, including shape classification, part segmentation, and shape retrieval. Experimental results confirm that, our method achieves not only consistent results on inputs at any orientation, but also the best performance on all tasks compared with stateofthearts.1.1 Related Works
Deep learning on 3D point sets. The design of robust and effective neural networks to embed point features has been an emerging topic in recent years. The pioneering networks PointNet [24] and PointNet++ [26] show the potentials of deep networks to directly process 3D point sets. To better capture local neighborhoods, several works [34, 30, 21] suggested to extract point features by considering local graphs. To address the irregular and orderless properties of point sets, some works defined convolutions on nonEuclidean domains, e.g.
, the selforganizing map
[18], transformation [20], permutohedral lattice [31], and parameterized embedding [38]. Some others designed new convolution operators on points, e.g., Monte Carlo convolution [16], PointConv [36], ShellConv [43], and KPConv [32]. Besides supervised methods, some unsupervised and selfsupervised networks [39, 45, 6, 13, 28, 14] were designed recently to avoid tedious manual labeling. Besides object recognition, some networks were designed for point set registration [2, 33, 22], upsampling [41, 40, 19], and denoising [15, 46]. Although these networks are translation and permutation invariant, they are not rotation invariant. They embed different features, and likely produce different outputs for the same input given in different orientations.Rotationinvariant networks for 3D shapes. Since vanilla CNNs only have translation invariance, some works attempted to learn rotationallyequivariant features by designing spherical CNNs [11, 8] and 3D steerable CNNs [35]. These features rotate correspondingly with the input. While these methods generalize well to unseen orientations, their convolutions are defined in a nonspatial domain, thus leading to poorer learning capability than spatial convolutions on regular grids. Also, they can only handle meshes or regular voxel grids.
Recently, some works explored rotationinvariant networks for point clouds. Poulenard et al. [23] represented points using volume functions, then used spherical harmonics kernels for convolution. The feature embedding capability of such convolution is, however, limited. Rao et al. [27] adaptively projected points on a discretized sphere and designed a hierarchical feature learning architecture to capture patterns on the sphere. However, the discretized sphere still carries a global orientation and cannot guarantee perfect symmetry, so the learned features are not purely rotation invariant. Hence, a notable performance drop still exists for inputs at arbitrary orientations.
On the other hand, some other methods suggested using lowlevel rotationinvariant geometric features to replace 3D Cartesian coordinates as the network inputs. Deng et al. [9]
suggested relative angles between pointwise normal vectors and paired distances. Chen
et al. [5] suggested relative angles between twopoint vectors, and vector norm, etc. Zhang et al. [42] constructed a point’s neighborhood with local triangles, each formed by a reference point, a neighbor point, and the local neighborhood centroid. They then take the triangle side lengths and angles as the rotationinvariant features. Though these representations are rotation invariant, they encode mainly local information, which may not be unique and sufficient; see Section 2.1 for a detailed analysis. In this work, we present a new rotationinvariant representation, capturing both local neighborhood and global shape structures, while being robust to noise and outliers. Also, we formulate a deep network, and introduce a novel region relation convolution to hierarchically process the point regions.2 Method
2.1 General Model for Point Feature Extraction
To start, we review a general model for point feature extraction. Denote
as a point cloud of points, where is the 3D Cartesian coordinate of the th point in . To extract features for a point, say , a general model would include both local and global information, so can be written as(1) 
where denotes the global shape information at ; denotes the local shape information at with its th neighbor point (); is a nonlinear function with learnable parameters ; and is a symmetric aggregation operation, e.g., max or summation, over the neighbor points of .
For general points processing networks without considering rotation invariance, is simply represented by , since 3D coordinates are global. For , different networks have different choices, e.g., PointNet++ [26] uses as , while DGCNN [34] uses relative position, i.e., , as . Clearly, both and are based on 3D coordinates, so they are not rotation invariant.
To achieve rotation invariance, current attempts [9, 42, 5] proposed different pointwise purely rotationinvariant representations as the network inputs. However, they focus mainly on encoding the local relations between nearby point pairs using, e.g., distances and relative angles, and ignore . Also, most existing methods suffer from the ambiguity of distinguishing between local shapes, meaning that they may produce the same representation for points of different local configurations. More seriously, we could have information loss, where the embedded features are insufficient to describe the underlying shapes.
2.2 Considerations for Rotation Invariance
For a deep points processing network to be rotation invariant, both the network inputs and operations should be rotation invariant. Hence, before we present the design of our network inputs (Section 2.3) and network architecture (Section 2.4), we first discuss the relevant design considerations that we have taken:
Considerations for designing the network inputs.

Denoting as the function to extract rotationinvariant representations (network inputs) from point cloud , a purely rotationinvariant should satisfy
(2) where SO(3)^{1}^{1}1SO(3) is the space of all 3D rotations in . is an arbitrary rotation. Most geometric quantities are rotationvariant, e.g., Cartesian coordinates and vectors in 3D space. Hence, we build our pointwise representation by carefully choosing rotationinvariant information inside a notsosmall local neighborhood around the point.

Second, simply using distances and relative angles between nearby point pairs may cause a large amount of information loss and easily introduce ambiguity in the representations, as explained earlier in Section 2.1. Hence, we avoid these issues by combining both rotationinvariant global information and rotationinvariant local point representations.

Last, noise is often unavoidable when scanning 3D point clouds. Hence, should be noise tolerant, meaning that the rotationinvariant representations extracted by should not be too sensitive to noise in .
Considerations for designing the network architecture.

A rotationinvariant network should not take point coordinates but only relative geometric information, such as distances and angles, as its inputs. However, without absolute information defined in a global coordinate frame, the network would lack global information. Hence, we should extract more global features, even from the relative geometric inputs, by considering more global relations among points. Existing rotationinvariant methods did not explore the global point relations, as in our work.

Besides, the network should not assume specific order (which may not be rotation invariant) when processing/aggregating point and regional features.
2.3 Our RotationInvariant Representations
Before extracting our rotationinvariant representations for and , we first normalize input point cloud to fit it in the origincentered unit sphere. Then, for each point in (e.g., the red point in Figure 2(a)), we follow PointNet++ [26] to use a query ball of radius to locate neighbor points (including itself) as its local neighborhood (blue points in Figure 2(b)).
Now, we are ready to extract and of . Here, should capture ’s location relative to the whole object and to its local neighborhood, serving like global but rotationinvariant coordinates of in the object. On the other hand, of should capture the local shape around , so we model a local representation for each , serving like local rotationinvariant coordinates of in ’s local neighborhood.
Global representation includes the following five pieces of rotationinvariant information about (see Figure 2(b) for an illustation of these five components).
(i) , simple but global and rotationinvariant information about .
(ii) , the distance from to ’s local neighborhood center (denoted as ). Here, a common choice of is ’s centroid (arithmetic mean), but such is sensitive to outliers and noise, so may not be stable. We propose to use the geometric median, i.e., the point with minimal distance sum to all . Such a choice is more stable, but computationally expensive [7]. So, we resort to a fast but approximate procedure based on the idea of divide and conquer: we first randomly and independently pick subsets of points in , find the centroid of each subset, and cluster the centroids. Then, we take the mean of the centroids in the largest cluster as . Please refer to Section 3.1 for hyperparameters and , Section 3.6 for a noise tolerance experiment, and our supplementary material for an evaluation of the approximated .
(iii)(v) we locate (purple point in Figure 2(b)), the intersection between the query ball and line extended from origin to , and form triangle . Then, we consider , the distance from to , and the cosine of the two angles subtended at and (denoted as and ) as the last three components of ; see Figure 2(b).
In our implementation, the query ball radius increases with the network layer (see Section 2.4), so the underlying structure described by triangle  will enlarge gradually. To sum up, our global rotationinvariant representation is
(3) 
Also, note that all distances range (since the input point cloud has been normalized), whereas angles and range . To avoid numerical instability, which hinders the network learning, we use cosine of these two angles in .
Local representation should help uniquely locate relative to in ’s local neighborhood. First, we construct a tetrahedron by joining to triangle , and consider the three distances , , and from to points , , and , respectively, and the three angles , , and subtended at on three tetrahedron faces; see Figure 2(c). Using these information alone may be ambiguous, since a mirror point of on the opposite side of triangle  can have the same set of distances and angles; see Figure 2(c). So, we further consider , the angle for rotating plane of triangle  to plane of triangle  about line .
Again, to avoid numerical instability, we take cosine of , , and . As for , since it ranges , we use a nonlinear function , which is monotonic for and also ranges . To sum up, our ambiguityfree local rotationinvariant representation for point relative to is
(4)  
Please refer to supplementary file for the proof on the ambiguityfree property.
Overall, for each point , we employ Eq. (3) to obtain its and Eq. (4) to obtain its (for each of its neighbor points ). Then, we pack copies of with to form a matrix (see Figure 3) to store the global and local rotationinvariant representations for .
Global relation between points. To supplement and with more global and rotationinvariant information, we further construct , an matrix to encode the global relations between all point pairs in a point cloud (say, of points), where matrix elements and encode the distance between and , and angle between the two vectors from origin to and , respectively. These information are later fed into the region relation convolution in the network to regress pointwise relation weights; see Section 2.4.
2.4 Network Architecture
Guided by the considerations presented in Section 2.2, we design a deep hierarchical network of three layers to embed a rotationinvariant codeword of the input point cloud. Figure 4 illustrates the network architecture, where the green boxes denote 3D point coordinates (e.g., ) sampled from the input point cloud; yellow boxes denote extracted rotationinvariant representations (see Section 2.3); purple boxes denote point indices from farthest sampling; and blue boxes denote embedded features in the network.
Specifically, given a point set of points, like PointNet++ [26], we first adopt a sampleandgroup operator. That is, we use farthest sampling to select a subset of points, then for each sampled point, we use a query ball to find its neighbor points and group an volume of 3D point coordinates; see in Figure 4. We then follow the steps in Section 2.3 to map it into our rotationinvariant representations (yellow box) and compute an global relation matrix (yellow box) from the sampled points. Further, we feed and into the region relation convolution (to be presented later) to obtain the feature map (blue box) of the first layer.
The second layer continues to sampleandgroup into a smaller point subset and uses the same set of indices (Idx) to group into . Note that, we set and to allow a progressively enlarging receptive field in the hierarchy. Instead of directly feeding into region relation convolution for feature embedding, we avoid information loss by concatenating and , which are highlevel features extracted from lowlevel representations
via a series of multilayer perceptron (MLPs); see Figure
4. We then feed the concatenated features , together with another global relation matrix from , to another region relation convolution to generate as the output from the second layer.Further, the third layer samplesandgroups into , and uses the concatenated features for convolution. Now, we only have one single point together with its neighbors (
), so we directly use MLPs followed by maxpooling along
on to produce the global feature vector , which is a rotationinvariant codeword of the input point set.Next, we can use in various point cloud analysis tasks. For examples, for shape classification, we can follow the common routine of using fullyconnected layers to regress the class scores. For part
segmentation, we can adopt the point feature propagation and interpolation
[26] to recover the perpoint features, then use MLPs to regress perpoint scores; please see [26]for details. For shape retrieval, we can directly compare the cosine similarity between the codewords of the query and target point clouds.
Region relation convolution. To alleviate the inevitable global information loss in the rotationinvariant representations, we further formulate the region relation convolution (see Figure 5 for its illustration) to regress global region relation weight from the global relation matrix (which is or ) and to refine feature extracted from or . Here, for each reference point and its local neighbors, previous networks [26, 34] commonly apply shared MLPs to the point features and maxpooling along to obtain a feature vector for encoding the local structure around the reference point. The same operation is applied to all points to obtain an feature map . Such operation, however, considers only ’s own local region when extracting point features for , without looking at its relations with other points more globally.
To introduce more global information into the embedded features, compared with conventional convolutions [26, 34], after the shared MLPs and maxpooling, we refine features by regressing rotationinvariant region relation weight from the global relation matrix ; see the top branch in Figure 5. The weights in each row, say , are regressed based on distances and angles of relative to all the other points (see the last paragraph in Section 2.3 for details), so reveals certain global relations between and other points. We then bring such global information into by ), where and mean elementwise addition and multiplication, respectively. Hence, the features of each point encode not only the local structure around the associated point, but also certain nonlocal relations with other local structures.
3 Experiments
3.1 Implementation Details
We implemented our network using TensorFlow
[1]and trained it for 200 epochs in all tasks. Adam optimizer
[17] was used with a learning rate of 0.001 and a minibatch size of six. Also, we set and , and followed [26] to capture multiscale local regions with different and in each layer. Besides, we empirically set and to balance the computing time and stability in finding the approximate geometric median. For details on the hyperparameter settings (i.e., , , and ), please refer to the supplementary material. We shall release our trained models with code upon the publication of this work.To evaluate the robustness of our network on inputs of arbitrary orientations, besides conventional data augmentation strategies by random scaling and jittering, we followed the settings in recent rotationinvariant methods [42, 5] to train and test our network in three scenarios: (i) z/z (as a reference): train and test with rotation augmentation about azimuthal axis, (ii) z/SO3: train with azimuthal rotations and test with arbitrary rotations, and (iii) SO3/SO3: train and test with arbitrary rotations. Overall, it is expected that an effective rotationinvariant approach should have consistent performance for all scenarios. In the followings, we evaluate the performance of our method against others, both quantitatively and qualitatively, on three tasks: shape classification (Section 3.2), part segmentation (Section 3.3), and shape retrieval (Section 3.4). Then, we show the network component analysis (Section 3.5) and noise tolerance test (Section 3.6).
Method  z/z (reference)  z/SO3  drop by  SO3/SO3  drop by 

SubVolSup MO [25]  89.5  45.5  49.2%  85.0  5.0% 
PointNet [24]  89.2  16.4  81.6%  75.5  15.4% 
PointNet++ (MSG) [26]  90.7  28.6  68.5%  85.0  6.3% 
PointCNN [20]  92.5  41.2  55.6%  84.5  8.6% 
DGCNN [34]  92.9  20.6  77.8%  81.1  12.7% 
ShellNet [43]  93.1  19.9  78.6%  87.8  5.7% 
Ours  89.4  89.4  0%  89.3  0.1% 
3.2 Evaluation: 3D Shape Classification
First, we evaluate our method on the 3D shape classification task by comparing it with both rotationvariant and rotationinvariant methods using the standard ModelNet40 dataset [37], which has 12,311 CAD models from 40 categories. We adopted the standard split to train our network using 9,843 models and tested it using the remaining 2,468 models. Each input point cloud has 1024 points.
Comparison with rotationvariant methods. Table 1 compares the drop in accuracy (%) for handling inputs at arbitrary rotations. First, existing rotationvariant methods have significant accuracy drops in z/SO3 as compared with z/z, showing that they are not rotationinvariant. Second, for the results in SO3/SO3, their performance still drops considerably, though arbitrary rotations in data augmentation can help improve their performance when testing in SO3. This means their networks cannot learn in SO3/SO3 as effective as in z/z. In contrast, our method has no accuracy drop for z/SO3, which validates the rotation invariance of our method. Also, it outperforms others when testing on SO3, no matter trained in z or in SO3. Note that, the slight drop in accuracy (i.e., 0.1%) of our method in SO3/SO3 is caused by the network retraining.
Method  z/z (reference)  z/SO3  drop by 

Spherical CNN [11]  88.9  78.6  11.6% 
SFCNN [27]  91.4  84.8  7.2% 
RIShellConv [42]  86.5  86.4  0.1% 
ClusterNet [5]  87.1  87.1  0% 
Ours  89.4  89.4  0% 
Comparison with rotationinvariant methods. Next, we compare our method with four most recent rotationinvariant methods. From the results shown in Table 2, we can see that the accuracy of the sphericalrelated methods (Spherical CNN & SFCNN) drops considerably; since their models cannot guarantee perfect symmetry in rotations, their results are still sensitive to rotations. For RIShellConv [42] and ClusterNet [5], they are formulated with pure rotation invariance, so they have consistent performance. Yet, our method still outperforms them, since our rotationinvariant representations encode both local and global information, and our network can effectively learn highlevel features more globally with the help of the region relation convolution and global relations.
Method (z/SO3)  aero  bag  cap  car  chair  earph.  guitar  knife  lamp  laptop  motor  mug  pistol  rocket  skate  table  avg. mIoU 
PointNet [24]  40.4  48.1  46.3  24.5  45.1  39.4  29.2  42.6  52.7  36.7  21.2  55.0  29.7  26.6  32.1  35.8  37.8 
PointNet++ (MSG) [26]  51.3  66.0  50.8  25.2  66.7  27.7  29.7  65.6  59.7  70.1  17.2  67.3  49.9  23.4  43.8  57.6  48.3 
PointCNN [20]  21.8  52.0  52.1  23.6  29.4  18.2  40.7  36.9  51.1  33.1  18.9  48.0  23.0  27.7  38.6  39.9  34.7 
DGCNN [34]  37.0  50.2  38.5  24.1  43.9  32.3  23.7  48.6  54.8  28.7  17.8  74.4  25.2  24.1  43.1  32.3  37.4 
ShellNet [43]  55.8  59.4  49.6  26.5  40.3  51.2  53.8  52.8  59.2  41.8  28.9  71.4  37.9  49.1  40.9  37.3  47.2 
RIShellConv [42]  80.6  80.0  70.8  68.8  86.8  70.3  87.3  84.7  77.8  80.6  57.4  91.2  71.5  52.3  66.5  78.4  75.3 
Ours  81.4  82.3  86.3  75.3  88.5  72.8  90.3  82.1  81.3  81.9  67.5  92.6  75.5  54.8  75.1  78.9  79.2 
Method (SO3/SO3)  aero  bag  cap  car  chair  earph.  guitar  knife  lamp  laptop  motor  mug  pistol  rocket  skate  table  avg. mIoU 
PointNet [24]  81.6  68.7  74.0  70.3  87.6  68.5  88.9  80.0  74.9  83.6  56.5  77.6  75.2  53.9  69.4  79.9  74.4 
PointNet++ (MSG) [26]  79.5  71.6  87.7  70.7  88.8  64.9  88.8  78.1  79.2  94.9  54.3  92.0  76.4  50.3  68.4  81.0  76.7 
PointCNN [20]  78.0  80.1  78.2  68.2  81.2  70.2  82.0  70.6  68.9  80.8  48.6  77.3  63.2  50.6  63.2  82.0  71.4 
DGCNN [34]  77.7  71.8  77.7  55.2  87.3  68.7  88.7  85.5  81.8  81.3  36.2  86.0  77.3  51.6  65.3  80.2  73.3 
ShellNet [43]  79.0  79.6  80.2  64.1  87.4  71.3  88.8  81.9  79.1  95.1  57.2  91.2  69.8  55.8  73.0  79.3  77.1 
RIShellConv [42]  80.6  80.2  70.7  68.8  86.8  70.4  87.2  84.3  78.0  80.1  57.3  91.2  71.3  52.1  66.6  78.5  75.3 
Ours  81.4  84.5  85.1  75.0  88.2  72.4  90.7  84.4  80.3  84.0  68.8  92.6  76.1  52.1  74.1  80.0  79.4 
3.3 Evaluation: 3D Object Part Segmentation
Next, we evaluate our method on 3D object part segmentation by comparing it with both rotationvariant methods and the recent rotationinvariant method RIShellConv [42] using the ShapeNet dataset [3]. This dataset has 16,881 models from 16 categories, and is annotated with 50 parts. We adopt the percategory averaged intersection over union (mIoU) metric [24] in the evaluation. Note that we do not compare with ClusterNet [5], since it is designed for classification.
Table 3 shows the percategory mIoU and averaged mIoU (over all 16 categories) produced by different methods in scenarios z/SO3 and SO3/SO3. Comparing the results shown in top and bottom tables, we can see that the rotationvariant methods yield very different segmentation results for z/SO3 and SO3/SO3, while both RIShellConv and our method achieve more consistent performance when tested on inputs at arbitrary rotations. Also, our method outperforms RIShellConv and others with the highest averaged mIoU; see the rightmost columns in the two tables. Again, due to network retraining, although both RIShellConv and our method are rotation invariant, there are slight difference in the results for z/SO3 and SO3/SO3. Further, we show some typical visual comparison results in z/SO3 in Figures 1(b) and 6, where the segmentation results produced by our method are the closest to the ground truths, compared with others. Please see supplementary material for more visual comparisons.
3.4 Evaluation: 3D Shape Retrieval
Besides 3D shape classification and object part segmentation, we further evaluate our method on 3D shape retrieval using the perturbed ShapeNet Core55 dataset [3]. Here, we followed the rules of the SHREC’17 3D shape retrieval contest [29]
, where each model has been randomly rotated by a uniformlysampled rotation in SO(3). For a fair comparison, we trained and tested all methods on the provided training/validation/testing sets, and evaluated their performance with the official evaluation metrics,
i.e., precision (P@N), recall (R@N), F1score (F1@N), mean average precision (mAP), and normalized discounted cumulative gain (NDCG). On each shape, 2,048 points are sampled as the network input. To combine the retrieval results of different categories, we followed [29] to use the macro and micro average strategies on the above five metrics. For a better demonstration, we also compute the average score over all the metrics.Table 4 reports the evaluation results. Overall, a larger metric value indicates a better retrieval performance. Compared with the contest winner [12] and also the recent rotationinvariant methods [27, 42], our method achieves the best performance for most evaluation metrics (six out of ten metrics), and also the best average score with a large margin compared with others.
Method  micro  macro  avg  
P@N  R@N  F1@N  mAP  NDCG  P@N  R@N  F1@N  mAP  NDCG  
Furuya [12] (contest winner)  0.814  0.683  0.706  0.656  0.754  0.607  0.539  0.503  0.476  0.560  0.630 
SFCNN [27]  0.778  0.751  0.752  0.705  0.813  0.656  0.539  0.536  0.483  0.580  0.659 
RIShellConv [42]  0.641  0.698  0.639  0.786  0.883  0.325  0.608  0.368  0.659  0.821  0.643 
Ours  0.847  0.456  0.522  0.928  0.937  0.701  0.495  0.501  0.889  0.926  0.720 
3.5 Network Component Analysis
Scenario  Ablation study  Rot.inv. representation  Network architecture  Full pipeline  

Case #1  Case #2  RIShellConv  PointNet++  DGCNN  
z/SO3  88.4  88.4  87.8  88.6  82.6  89.4 
Next, we conduct an ablation study, an analysis on our rotationinvariant representation, and a network architecture analysis to evaluate different aspects of our method using the shape classification task on ModelNet40.
Ablation study. First, we evaluate two major modules in our method:

Case #2. To verify our proposed region relation convolution, we degenerate it into just the shared MLPs followed by maxpooling (see Figure 5).
The leftmost portion of Table 5 shows the results of the two cases. Since both cases are rotation invariant, their classification accuracies are consistent for z/z, z/SO3, and SO3/SO3, so we report only the accuracies under z/SO3. By comparing the result with our full pipeline (rightmost in Table 5), we can see that each module (case) contributes to achieve a better classification performance.
Rotationinvariant representation analysis. To verify the effectiveness of our rotationinvariant representation (both and , as depicted in Figure 3), we replace it with the stateoftheart rotationinvariant representation proposed in [42]. The resulting classification accuracy is shown in the middle portion of Table 5. Comparing with the fullpipeline result (rightmost in Table 5), we can see that our network achieves better performance with our rotationinvariant representation (89.4%) than with the representation in [42] (87.8%). However, such performance (87.8%) is still higher than the performance (86.4%) of [42] (see Table 2). The difference reveals that although both cases use the same rotationinvariant representation, our network with the region relation convolution and global relation information can achieve better performance.
Network architecture analysis. To verify the effectiveness of our network (Figure 4), we replace it with PointNet++ [26] and DGCNN [34], respectively, while keeping our rotationinvariant representations as the network inputs. The “network architecture” column in Table 5 shows the results. Apparently, our network (full pipeline) achieves higher performance. Also, we explore our network performance with different number of layers; see supplementary material.
3.6 Noise Tolerance Test
Noise is common in the acquisition of 3D point clouds. This motivates us to introduce the geometric median (which is less sensitive to noise) for formulating our rotationinvariant representations. To study our method’s robustness to noise, we test its shape classification performance on ModelNet40 using inputs that are corrupted by Gaussian noise of increasing level (variance). In this test, we consider four cases: (i) our method with geometric median; (ii) our method with arithmetic mean; (iii) ClusterNet [5]; and (iv) RIShellConv [42].
Figure 7 plots the shape classification accuracy for the four cases over shape inputs of increasing amount of noise. From the results, we can see that using geometric median consistently achieves better performance than using arithmetic mean, while existing rotationinvariant methods are more sensitive to noise.
4 Conclusion
We presented a rotationinvariant framework for deep 3D point cloud analysis. Given an input cloud at arbitrary orientation, our framework produces consistent, and also the best performance, on multiple point cloud analysis tasks, including shape classification, part segmentation, and shape retrieval, compared with the stateofthearts. To achieve this, we introduce a novel lowlevel purely rotationinvariant representation as the network inputs, which encodes both local and global information, as well as being robust to noise and outliers. Further, we formulate the region relation convolution to enrich the network features with more global information. The extensive experimental results confirm the rotation invariance of our method, and also its superiority over the stateofthearts.
Despite the effectiveness of our method (see Figure 7) on handling noisy inputs as compared with others, the performance still drops progressively when the noise becomes larger. In the future, we plan to explore the possibility of designing a noiseresistant network, since LiDARscanned real inputs are often contaminated by large amount of noise, particularly for outdoor situations. Besides, we plan also to extend our rotationinvariant framework for the problems of point cloud registration and partial shape matching.
References

[1]
Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., Devin, M., Ghemawat, S., Irving, G., Isard, M., et al.: TensorFlow: A system for largescale machine learning. In: 12th USENIX Symposium on Operating Systems Design and Implementation. pp. 265–283 (2016)

[2]
Aoki, Y., Goforth, H., Srivatsan, R.A., Lucey, S.: PointNetLK: Robust & efficient point cloud registration using PointNet. In: IEEE Conf. on Computer Vision and Pattern Recognition (CVPR). pp. 7163–7172 (2019)
 [3] Chang, A.X., Funkhouser, T., Guibas, L.J., Hanrahan, P., Huang, Q., Li, Z., Savarese, S., Savva, M., Song, S., Su, H., et al.: ShapeNet: An informationrich 3D model repository. arXiv preprint arXiv:1512.03012 (2015)
 [4] Chen, C., Fragonara, L.Z., Tsourdos, A.: GAPNet: Graph attention based point neural network for exploiting local feature of point cloud. arXiv preprint arXiv:1905.08705 (2019)

[5]
Chen, C., Li, G., Xu, R., Chen, T., Wang, M., Lin, L.: ClusterNet: Deep hierarchical cluster network with rigorously rotationinvariant representation for point cloud analysis. In: IEEE Conf. on Computer Vision and Pattern Recognition (CVPR). pp. 4994–5002 (2019)
 [6] Chen, S., Duan, C., Yang, Y., Li, D., Feng, C., Tian, D.: Deep unsupervised learning of 3D point clouds via graph topology inference and filtering. arXiv preprint arXiv:1905.04571 (2019)

[7]
Cohen, M.B., Lee, Y.T., Miller, G., Pachocki, J., Sidford, A.: Geometric median in nearly linear time. In: Proceedings of the fortyeighth annual ACM symposium on Theory of Computing. pp. 9–21. ACM (2016)
 [8] Cohen, T., Geiger, M., Köhler, J., Welling, M.: Spherical CNNs. In: Int. Conf. on Learning Representations (ICLR) (2018)

[9]
Deng, H., Birdal, T., Ilic, S.: PPFFoldNet: Unsupervised learning of rotation invariant 3D local descriptors. In: European Conf. on Computer Vision (ECCV). pp. 602–618 (2018)
 [10] Duan, Y., Zheng, Y., Lu, J., Zhou, J., Tian, Q.: Structural relational reasoning of point clouds. In: IEEE Conf. on Computer Vision and Pattern Recognition (CVPR). pp. 949–958 (2019)
 [11] Esteves, C., AllenBlanchette, C., Makadia, A., Daniilidis, K.: Learning SO(3) equivariant representations with spherical CNNs. In: European Conf. on Computer Vision (ECCV). pp. 52–68 (2018)
 [12] Furuya, T., Ohbuchi, R.: Deep aggregation of local 3D geometric features for 3D model retrieval. In: British Machine Vision Conf. (BMVC). pp. 121.1–121.12 (2016)
 [13] Han, Z., Wang, X., Liu, Y.S., Zwicker, M.: Multiangle point cloudVAE: Unsupervised feature learning for 3D point clouds from multiple angles by joint selfreconstruction and halftohalf prediction. In: IEEE Int. Conf. on Computer Vision (ICCV). pp. 10442–10451 (2019)
 [14] Hassani, K., Haley, M.: Unsupervised multitask feature learning on point clouds. In: IEEE Int. Conf. on Computer Vision (ICCV). pp. 8160–8171 (2019)
 [15] Hermosilla, P., Ritschel, T., Ropinski, T.: Total Denoising: Unsupervised learning of 3D point cloud cleaning. In: IEEE Int. Conf. on Computer Vision (ICCV). pp. 52–60 (2019)
 [16] Hermosilla, P., Ritschel, T., Vázquez, P.P., Vinacua, À., Ropinski, T.: Monte Carlo convolution for learning on nonuniformly sampled point clouds. ACM Trans. on Graphics (SIGGRAPH Asia) 37(6), 235:1–12 (2018)
 [17] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Int. Conf. on Learning Representations (ICLR) (2015)
 [18] Li, J., Chen, B.M., Hee Lee, G.: SONet: Selforganizing network for point cloud analysis. In: IEEE Conf. on Computer Vision and Pattern Recognition (CVPR). pp. 9397–9406 (2018)
 [19] Li, R., Li, X., Fu, C.W., CohenOr, D., Heng, P.A.: PUGAN: A point cloud upsampling adversarial network. In: IEEE Int. Conf. on Computer Vision (ICCV). pp. 7203–7212 (2019)
 [20] Li, Y., Bu, R., Sun, M., Wu, W., Di, X., Chen, B.: PointCNN: Convolution on transformed points. In: Conference and Workshop on Neural Information Processing Systems (NeurIPS). pp. 820–830 (2018)

[21]
Liu, Y., Fan, B., Xiang, S., Pan, C.: Relationshape convolutional neural network for point cloud analysis. In: IEEE Conf. on Computer Vision and Pattern Recognition (CVPR). pp. 8895–8904 (2019)
 [22] Lu, W., Wan, G., Zhou, Y., Fu, X., Yuan, P., Song, S.: DeepVCP: An endtoend deep neural network for point cloud registration. In: IEEE Int. Conf. on Computer Vision (ICCV). pp. 12–21 (2019)
 [23] Poulenard, A., Rakotosaona, M.J., Ponty, Y., Ovsjanikov, M.: Effective rotationinvariant point CNN with spherical harmonics kernels. In: Int. Conf. on 3D Vision (3DV). pp. 47–56 (2019)
 [24] Qi, C.R., Su, H., Mo, K., Guibas, L.J.: PointNet: Deep learning on point sets for 3D classification and segmentation. In: IEEE Conf. on Computer Vision and Pattern Recognition (CVPR). pp. 652–660 (2017)
 [25] Qi, C.R., Su, H., Nießner, M., Dai, A., Yan, M., Guibas, L.J.: Volumetric and multiview CNNs for object classification on 3D data. In: IEEE Conf. on Computer Vision and Pattern Recognition (CVPR). pp. 5648–5656 (2016)
 [26] Qi, C.R., Yi, L., Su, H., Guibas, L.J.: PointNet++: Deep hierarchical feature learning on point sets in a metric space. In: Conference and Workshop on Neural Information Processing Systems (NeurIPS). pp. 5099–5108 (2017)
 [27] Rao, Y., Lu, J., Zhou, J.: Spherical fractal convolutional neural networks for point cloud recognition. In: IEEE Conf. on Computer Vision and Pattern Recognition (CVPR). pp. 452–460 (2019)
 [28] Sauder, J., Sievers, B.: Selfsupervised deep learning on point clouds by reconstructing space. In: Conference and Workshop on Neural Information Processing Systems (NeurIPS) (2019), to appear
 [29] Savva, M., Yu, F., Su, H., Kanezaki, A., Furuya, T., Ohbuchi, R., Zhou, Z., Yu, R., Bai, S., Bai, X., et al.: SHREC17’ track: Largescale 3D shape retrieval from ShapeNet Core55. In: Proceedings of the Eurographics Workshop on 3D Object Retrieval (2017)
 [30] Shen, Y., Feng, C., Yang, Y., Tian, D.: Mining point cloud local structures by kernel correlation and graph pooling. In: IEEE Conf. on Computer Vision and Pattern Recognition (CVPR). pp. 4548–4557 (2018)
 [31] Su, H., Jampani, V., Sun, D., Maji, S., Kalogerakis, E., Yang, M.H., Kautz, J.: SPLATNet: Sparse lattice networks for point cloud processing. In: IEEE Conf. on Computer Vision and Pattern Recognition (CVPR). pp. 2530–2539 (2018)
 [32] Thomas, H., Qi, C.R., Deschaud, J.E., Marcotegui, B., Goulette, F., Guibas, L.J.: KPConv: Flexible and deformable convolution for point clouds. In: IEEE Int. Conf. on Computer Vision (ICCV). pp. 6411–6420 (2019)
 [33] Wang, Y., Solomon, J.M.: Deep closest point: Learning representations for point cloud registration. In: IEEE Int. Conf. on Computer Vision (ICCV). pp. 3523–3532 (2019)
 [34] Wang, Y., Sun, Y., Liu, Z., Sarma, S.E., Bronstein, M.M., Solomon, J.M.: Dynamic graph CNN for learning on point clouds. ACM Trans. on Graphics 38(5), 146:1–12 (2019)
 [35] Weiler, M., Geiger, M., Welling, M., Boomsma, W., Cohen, T.: 3D steerable CNNs: Learning rotationally equivariant features in volumetric data. In: Conference and Workshop on Neural Information Processing Systems (NeurIPS). pp. 10381–10392 (2018)
 [36] Wu, W., Qi, Z., Fuxin, L.: PointConv: Deep convolutional networks on 3D point clouds. In: IEEE Conf. on Computer Vision and Pattern Recognition (CVPR). pp. 9621–9630 (2019)
 [37] Wu, Z., Song, S., Khosla, A., Yu, F., Zhang, L., Tang, X., Xiao, J.: 3D ShapeNets: A deep representation for volumetric shapes. In: IEEE Conf. on Computer Vision and Pattern Recognition (CVPR). pp. 1912–1920 (2015)
 [38] Xu, Y., Fan, T., Xu, M., Zeng, L., Qiao, Y.: SpiderCNN: Deep learning on point sets with parameterized convolutional filters. In: European Conf. on Computer Vision (ECCV). pp. 87–102 (2018)
 [39] Yang, Y., Feng, C., Shen, Y., Tian, D.: FoldingNet: Point cloud autoencoder via deep grid deformation. In: IEEE Conf. on Computer Vision and Pattern Recognition (CVPR). pp. 206–215 (2018)
 [40] Yifan, W., Wu, S., Huang, H., CohenOr, D., SorkineHornung, O.: Patchbased progressive 3D point set upsampling. In: IEEE Conf. on Computer Vision and Pattern Recognition (CVPR). pp. 5958–5967 (2019)
 [41] Yu, L., Li, X., Fu, C.W., CohenOr, D., Heng, P.A.: PUNet: Point cloud upsampling network. In: IEEE Conf. on Computer Vision and Pattern Recognition (CVPR). pp. 2790–2799 (2018)
 [42] Zhang, Z., Hua, B.S., Rosen, D.W., Yeung, S.K.: Rotation invariant convolutions for 3D point clouds deep learning. In: Int. Conf. on 3D Vision (3DV). pp. 204–213 (2019)
 [43] Zhang, Z., Hua, B.S., Yeung, S.K.: ShellNet: Efficient point cloud convolutional neural networks using concentric shells statistics. In: IEEE Int. Conf. on Computer Vision (ICCV). pp. 1607–1616 (2019)
 [44] Zhao, H., Jiang, L., Fu, C.W., Jia, J.: PointWeb: Enhancing local neighborhood features for point cloud processing. In: IEEE Conf. on Computer Vision and Pattern Recognition (CVPR). pp. 5565–5573 (2019)
 [45] Zhao, Y., Birdal, T., Deng, H., Tombari, F.: 3D point capsule networks. In: IEEE Conf. on Computer Vision and Pattern Recognition (CVPR). pp. 1009–1018 (2019)
 [46] Zhou, H., Chen, K., Zhang, W., Fang, H., Zhou, W., Yu, N.: DUPNet: Denoiser and upsampler network for 3D adversarial point clouds defense. In: IEEE Int. Conf. on Computer Vision (ICCV). pp. 1961–1970 (2019)
Comments
There are no comments yet.