. Massive attention has been paid to point cloud which is a basic type of 3D data representation. As a pioneer of deep learning method towards point cloud analysis, PointNet employs MLP to extract feature from raw 3D coordinates, which has been extensively following. Most previous works are evaluated on synthetic datasets, -i.e., ModelNet40  and ShapeNet , where the point-cloud models are well aligned. Nevertheless, it is non-trivial to get well aligned point clouds in the real world, where rotation is inevitable. Specifically, the pose of point-cloud models is arbitrary which includes translation and rotation. PointNet and its modified versions fail in this case because of the variation of coordinates caused by transformations. As shown in Fig. 1(a), the classification and segmentation results are significantly confused by the rotation transformation.
Considering that the issue of translation can be easily addressed by centring the point-cloud models, some attempts have been developed concentrating on rotation robustness. An intuitive solution is augmenting the training data using arbitrary rotation. However, the augmentation is unable to cover all8, 4]
, which still needs some extra process such as max pooling to achieve rotation invariance, and the loss of information is inevitable during the projection.
The issue of rotation sensitivity is able to be boiled down to changes of input coordinates. Inspired by this discovery, we expect to transform the raw coordinates into some rotation-invariant representations as the input of the network, which can achieve intrinsically invariant to the rotation. Although some rotation-invariant representations have been designed [37, 3], existing methods focus on utilizing distance and angles in some local regions, lacking the constrain of global information, which leads to the limited distinctiveness. In this regard, we present a simple yet effective solution to tackle the rotation problem, which combines global and local rotation-invariant features. Specifically, in the aspect of local representations, we extend Darboux feature 
into a more distinctive feature space, where the relative locations are distinguished by measuring the distance and difference of local coordinate systems between the query point and its neighbors. For global representations, we estimate a global coordinate system on the down sampling point-cloud model employing singular value decomposition. Subsequently, the original points are able to be projected into the estimated coordinate system which is invariant to the rotation.
Additionally, in order to extract highly dimensional feature from our presented representations, we propose a two-brunch network where the global and local representations are individually processed. The group convolution layer is designed as a basic module, which hierarchically extracts and aggregates features. As illustrated in Fig.1(b), the presented method is fully invariant to the rotation in classification and segmentation tasks. Extensive experiments are developed in both synthetic datasets, -i.e., ModelNet40  and ShapeNet , and the real-world dataset, -i.e., ScanObjectNN , which show that our method achieves state-of-the-art performance for classification and segmentation tasks on rotation-augmented benchmark.
In a nutshell, Our major contributions are summarized as follows:
We present a combination of global and local representations which are intrinsically invariant to rotation changes.
We propose a two-brunch network111The code will be available at https://github.com/sailor-z/GLR-Net which employs group convolutions to hierarchically extract and aggregate features.
Our method achieves state-of-the-art performance on comprehensive evaluation benchmarks which contain rotation in both synthetic and real-world data.
2 Related Work
To alleviate rotation issue, a straightforward way is augmenting training data using arbitrary rotation transformations [22, 22]. However, there are three degrees of freedom in the real world, -i.e., pitch, yaw, and roll, and each freedom ranges from to , which leads to innumerable rotations. Consequently, it is impractical to cover all kinds of rotations in the limited capacity. In order to improve the robustness to rotation in a more efficient way, an alternative attempt suggests to employ deep learning methods to immediately learn some spatial transformations . Specifically, T-Net is used in PointNet to regress a spatial transformation and a highly dimensional transformation, with the expectation of transforming the point clouds into a canonical coordinate system. Nevertheless, the learned transformations are still vulnerable against the nuisance of rotation, because the regression procedure lacks a theoretical support from the perspective of rotation invariance.
Rotation equivariance convolutions.
Inspired by the tremendous process of convolution networks in 2D computer vision, numerous works have been developed to bridge the success of convolutions from images to point clouds [34, 17, 32]. However, most previous works are sensitive to rotation, without taking the rotation invariance into account. In this regard, some efforts have been developed, which utilize spherical convolutions to achieve rotation equivariance [8, 4, 18]. First, the 3D mesh or voxel models are projected into spheres, translating the coordinates into angles. Second, a series of spherical convolutions are carried out on the spheres to generate a set of feature maps, accomplished by Fourier Transform. Third, Inverse Fourier Transform is employed to recover the angles. However, note that the equivariance means the output and input vary equally, which is not intrinsically invariant to rotation. The global process such as global max pooling is crucial in order to achieve the rotation invariance. Additionally, the loss of information is inevitable during the generation of mesh/voxel, the transformation and inverse transformation, which leads to the limited performance.
Rotation invariance representations. For the sake of interior rotation invariance, some approaches attempt to transform the raw point clouds into rotation invariance representations, where distance and angles are the most widely used features. Specifically, Deng et al. 
proposed a 4D point pair feature (PPF) for the task of rotation-invariant descriptors, which utilized the distance and angles between the central reference and neighbors in a local patch, combining the information of normal vectors. For the tasks of classification and segmentation, Chen et al. integrated distance, angles, , and features in local -NN graphs into a cluster network. Zhang et al.  combined distance and angle features in local graphs and ones in reference points generated by down sampling. Nevertheless, all previous works concentrated on local features, -i.e., relative distance and angles in local graphs, lacking directions of effective global features. It makes sense that only employs local information in the field of local descriptors, while the local features are prone to being ambiguous for the tasks of classification and segmentation. For instance, the relative distance and angles tend to be similar among different regions in the same plane of a desk. In this regard, we present a combination of global and local rotation-invariant representations, filling in the gap of global constrains. The details will be introduced in the following section.
3.1 Problem Statement
Our method directly processes raw point clouds as input, which are represented as a set of 3D points , -i.e., with . The normal vector of each point is also utilized which is indicated by . The rotation issue is formulated by transforming through a orthogonal matrix (), which contains three degrees of freedom, -i.e., , and . The task of rotation invariance is able to be boiled down to
where is the function that generates the presented representations from raw points.
For the classification task with classes, the output of our approach are scores where the maximum score is expected to correspond to the correct class label. For semantic segmentation task, our method outputs a score map, which indicates the scores of categories for all points. Both the two tasks are supposed to be invariant to the rotation changes.
3.2 Local Branch
As aforementioned, local features have been proved to be critical for the point-cloud classification and segmentation [34, 17, 32]. To this end, we also design a local branch to extract local patterns in graph-based structures. Intuitively, the distance and angle are two kinds of rotation-invariant features, while the core issue is how to utilize these features in an effective way. Inspired by the classical Darboux , we dig out local geometrical features by estimating the relative relationships of local coordinate systems between the central point and its neighbors.
The representation is illustrated in Fig. 3. First, for a query point , a local graph is generated by k-nearest searching, where is one of the neighbors. Second, the Euclidean distance is calculated to indicate the local intensity around , where is the relative vector between and , -i.e., . Third, the relationship between the local coordinate systems centred in and is recovered. Note that normal vectors are required to generate local coordinate systems. Specifically, in order to determine three orthogonal vectors ( as an example), we leverage cross product to estimate and as
Subsequently, the relationship is represented by , which respectively indicates the angle between , , , , , , . The angle ( as an example) is calculated as
Note that and are employed to alleviate the ambiguity. Given neighors for a query point , the generated representation is a feature map which fully mines the local pattern around .
3.3 Global Branch
Although local information has been extensively employed in rotation-invariant point cloud analysis, the extraction of global information is still an intractable issue. As mentioned in , the limited accuracy of their method is due to the lack of original point coordinates, which reflect the absolute locations in a global coordinate system. The classification result significantly increases when they replace the presented features with raw 3D coordinates, while the approach is no loner invariant to rotation. This observation is reasonable because the local features represent relative relationships which are inevitable to be ambiguous in some cases. For instance, for some points located on a flat plane of a table, the local representations, -i.e., distance and angles, tend to be similar among different neighbors. To this end, we design a global branch which contains a global feature extraction module, taking into account of rotation invariance.
An intuitive solution to extract global features is establishing a global coordinate system leveraging singular value decomposition to dig out three main directions which are equivariant to rotation changes. Nevertheless, it is time consuming that uses SVD in the original point-cloud model which may contain thousands of points, and SVD is also sensitive to the data missing. In order to achieve a efficient and robust solution, as shown in Fig. 4, we establish a down-sampling subset of the original model , which contains much fewer points, while remains the major geometrical structure, -i.e., skeleton. The down-sampling procedure is implemented by farthest point sampling in this paper which is able to increase the robustness against nuisances. SVD is then carried out on formulated as
where contains the generated three orthogonal axes. The invaraince against rotation is achieved by transforming points of the original model into the established global coordinate system as
3.4 Group Convolution
For purpose of extracting highly dimensional features from the presented representations, we integrate the RI-feature extraction module into a deep learning framework. Inspired by the success of deep learning in 2D computer vision [6, 27, 12], massive efforts have been widely carried out for point cloud analysis 
, which mainly employ MLP as a basic feature extraction module. Further, to aggregate local information which has been proved to be critical in 2D convolution neural networks, some approaches such as graph-based convolutions[11, 30, 31] and local max pooling have been developed. However, the previous works include interior drawbacks. For instance, graph-based methods are space consuming and local max pooling carried out after MLP which still individually processes each point leads to the inevitable loss of information. To this end, we design a series of group convolution layers which are able to hierarchically extract and aggregate features in a efficient way. Note that our framework is intrinsically invariant to rotation changes owing to the rotation-invariant representations.
As shown in Fig. 5, for a central point (red dot), where represents the feature dimension, a graph is established by -nearest searching in the space. The neighbors (black dots) are distributed into several groups according to the Euclidean distance from the central point, which transforms the unordered points into a sorted format. The group convolution is able to be conducted on the groups as
where is a set of learning weights which are shared among different groups, and is the aggregated feature block of . The feature map is output by concatenating all feature blocks as
Instead of using the combination of MLP and local max pooling which is accountable for the loss of information, we employ a set of group convolutions to hierarchically aggregate the input representation into a feature map, which intensively digs out the effective local information.
3.5 Rotation-Invariant Analysis
As demonstrated in Fig. 6, we visualize the extracted global and local representations in the 3D space using - . Compared with raw point locations in Fig. 6 (a) which are sensitive to orientation changes, the projected locations of our representations in Fig. 6 (b) are identical facing the challenge of rotation, which intrinsically guarantees the robustness against rotation for the subsequent learning process.
The theoretical demonstration is also introduced as follows.
Distance. Assuming is the norm of , where , the invariance against rotation is able to be proved as
Angle. Supposing are the angles between and , the equivalence is formulated as
Singular Value Decomposition. We define two point clouds as and () with . Singular value decomposition is respectively performed as
so the relationship between and is able to be derived as . The invariance of point locations transformed by is then shown as
In this section, we develop experiments on three datasets designed for different tasks, -i.e., ModelNet40  (Synthetic shape classification), ScanObjectNN  (Real world shape classification), and ShapeNet  (Part segmentation). Ablation study is also performed to evaluate the effectiveness of our network design.
4.1 Implementation Details
For local graph generation, we use -nearest searching to find out neighbors for each central point. In global branch, we down sample the original model into points utilizing farthest point sampling. For group convolutions which are individual between two branches, the dimensions
are employed. Each group convolution is followed by Batch Normalization and LeakyReLU . We use three fully connected layers to predict classification results, and three layers of MLP to generate segmentation results, where and indicate the number of candidate labels.
4.2 Synthetic Shape Classification
We evaluate our method on ModelNet40 which has been extensively used for synthetic shape classification [16, 14]. ModelNet40 includes CAD models from categories that are split into for training and for testing. We randomly sample points from each model. These points are then centralized and normalized into a unit sphere.
We divide previous works into two categories, -i.e., rotation-sensitive method and rotation-robust method. The experiments are performed in three different cases, -i.e., raw training data and testing data, raw training data and 3D rotation-augmented testing data, and 3D rotation-augmented training data and testing data, which are respectively indicated by , , and . Table 1 lists the experimental results. First, In the case of , our method (GLR-Net) surpasses the other rotation-robust methods. Compared with Spherical-CNN and S-CNN where mesh information is necessary, our method achieves superior performance even though we use raw points as input, which verifies our framework is more effective than spherical solutions. For ClusterNet and Riconv which also propose some local rotaion-invariant representations, the lack of global information leads to inferior performance compared with our method. Second, in the situations of and , the results of GLR-Net are almost identical, exceeding other ones by a large margin, while the results of rotation-sensitive algorithms considerably decline. The previous state-of-the-art approach (DGCNN) is vulnerable in , which only gets a accuracy. The performance is still unsatisfactory () in , even though the training data is augmented by 3D rotations. These phenomenons show that it is crucial to take into account of rotation robustness for the applications in the real world.
|Rotation-sensitive Method||input||# views||z/z(%)||z/SO3(%)||SO3/SO3(%)|
|Rotation-robust Method||input||# views||z/z(%)||z/SO3(%)||SO3/SO3(%)|
4.3 Real World Shape Classification
For purpose of analysing the limitation of our method, we estimate the confusion matrix which is shown in Fig.7 (a). An unexpected discovery is observed that ModelNet40 contains interior ambiguity. Specifically, as illustrated in Fig. 7 (a), the most two confusing categories are flower pot and plant, so we show the models belong to these categories in Fig. 7
(b), where both two models include similar plants and pots. They are ambiguous that can not be explicitly classified even by human beings.
Additionally, considering that the objects in ModelNet40 are man-made CAD models, which are thus well-aligned and noise-free, there is a significant gap between the synthetic data and real-world data which tends to include different oriented objects and various nuisances, -e.g., missing data, occlusion, and non-uniform density. In order to evaluate the performance of shape classification in the real world and the robustness against noises in a reliable way, we perform experiments on ScanObjectNN  which is collected in the real-world indoor scenes on the one hand and declares to discard ambiguous objects on the other hand. This dataset includes objects that are categorized into categories, taking into account of one freedom rotation, translation, missing data, background noise, occlusion, and non-uniform density. Some examples in this dataset are shown in Fig. 8.
We develop the experiments on the easiest part OBJ_BG without rotation, translation, and scaling, and the hardest part PB_T50_RS which contains bounding box translation, rotation around the gravity axis, and random scaling. The evaluated results are shown in Table 2. Our method achieves the best performance compared with previous works, which indicates that GLR-Net is not only invariant to rotation, but also robust to common nuisances. Consequently, it is promising to utilize our method for the classification task in the real world. However, the performances of GLR-Net considerably decline compared with the ones in Table 1, which suggests that there is still a large room for further improvement from the perspectives of robustness and generalization.
4.4 Part Segmentation
Given a point-cloud model, the target of segmentation is accurately predicting per-point labels. Compared with the shape classification, segmentation is a more challenging task which requires the capacity of capturing fine-grained patterns. Consequently, we extend our experiments on ShapeNet  which is a widely used dataset for part segmentation evaluation. We use a part of ShapeNet that includes 3D models from kinds of objects with part categories. Overall average category mIoU (Cat. mIoU)  is utilized to measure the segmentation performance, which is calculated by immediately averaging the results over categories.
|Method||z/z (%)||z/SO3 (%)||SO3/SO3 (%)|
The general results are reported in Table 5, and the specific results are listed in Table 3 and Table 4. In the situation without rotation (), our approach considerably surpasses the previous rotation-invariant algorithm (Riconv); In the case of rotations ( and ), GLR-Net achieves consistent performance, significantly exceeding other algorithms, which empirically confirms that GLR-Net makes a well trade-off between rotation invariance and Cat. mIoU.
4.5 Evaluations of Network Design
In order to further verify the effectiveness of our two-branch network design, we perform an ablation study. Specifically, we separate the global branch and local branch and individually employ each branch to train the classification network on ModelNet40.
As reported in Table 6, considerable decline occurs when the branch is individually used during the training step. It is confirmed that the combination of global and local representations is a promising solution to increase the distinctiveness in the embedded feature space. The two branches play complementary roles that reasonably take into account of considerations from two different views, -i.e., global observation and local fine-grained patterns.
|Global Branch||Local Branch||Acc. (%)|
Although our method achieves state-of-the-art performances, the limitation is still can not be ignored. During the RI-feature extraction in global branch, the original model is projected into a global coordinate system estimated by SVD, which leads to various orientations among different models. Considering that the objects in existing datasets are well-aligned, the various orientations reduce the underlying consistence among the objects from the same category, which causes the loss of performance. However, due to the rotations in the real world, it is impractical to obtain well-aligned instances in practical applications. Our method still shows a promising prospect for the classification and segmentation tasks in the real world.
We have presented a combination of global and local representations which are intrinsically invariant to rotations. For further highly dimensional feature extraction, we integrate the representations into a two-branch network where a series of group convolutions are designed to hierarchically extract and aggregate features. Both theoretical and empirical proofs for the invariance against rotations are provided. Experiments also demonstrate the superiority of our two-branch network design. Our method shows a promising prospect for the real-world applicaitons.
-  (2017) Towards subjective quality assessment of point cloud imaging in augmented reality. In 2017 IEEE 19th International Workshop on Multimedia Signal Processing (MMSP), pp. 1–6. Cited by: §1.
-  (2006) Simultaneous localization and mapping (slam): part ii. IEEE robotics & automation magazine 13 (3), pp. 108–117. Cited by: §1.
ClusterNet: deep hierarchical cluster network with rigorously rotation-invariant representation for point cloud analysis. In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4994–5002. Cited by: §1, §2, Table 1.
-  (2018) Spherical cnns. In International Conference on Learning Representations, Cited by: §1, §2.
Ppf-foldnet: unsupervised learning of rotation invariant 3d local descriptors. In Proceedings of the European Conference on Computer Vision, pp. 602–618. Cited by: §2.
-  (2009) Imagenet: a large-scale hierarchical image database. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255. Cited by: §3.4.
-  (2006) Simultaneous localization and mapping: part i. IEEE robotics & automation magazine 13 (2), pp. 99–110. Cited by: §1.
-  (2018) Learning so (3) equivariant representations with spherical cnns. In Proceedings of the European Conference on Computer Vision, pp. 52–68. Cited by: §1, §2, Table 1.
-  (2011) Deep sparse rectifier neural networks. In Proceedings of the International Conference on Artificial Intelligence and Statistics, pp. 315–323. Cited by: §4.1.
-  (1971) Singular value decomposition and least squares solutions. In Linear Algebra, pp. 134–151. Cited by: §1.
-  (2018) Multi-kernel diffusion cnns for graph-based learning on point clouds. In Proceedings of the European Conference on Computer Vision, pp. 0–0. Cited by: §3.4.
-  (2016) Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778. Cited by: §3.4.
-  (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167. Cited by: §4.1.
-  (2018) Pointsift: a sift-like network module for 3d point cloud semantic segmentation. arXiv preprint arXiv:1807.00652. Cited by: §4.2.
-  (2019) GS3D: an efficient 3d object detection framework for autonomous driving. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1019–1028. Cited by: §1.
-  (2018) So-net: self-organizing network for point cloud analysis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9397–9406. Cited by: §4.2.
-  (2018) Pointcnn: convolution on x-transformed points. In Advances in Neural Information Processing Systems, pp. 820–830. Cited by: §2, §3.2, Table 1, Table 2.
-  (2018) Deep learning 3d shapes using alt-az anisotropic 2-sphere convolution. Cited by: §2, Table 1.
Visualizing data using t-sne.
Journal of Machine Learning Research9 (Nov), pp. 2579–2605. Cited by: Figure 6, §3.5.
-  (2015) Voxnet: a 3d convolutional neural network for real-time object recognition. In IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 922–928. Cited by: Table 1.
-  (2019) LaserNet: an efficient probabilistic 3d object detector for autonomous driving. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 12677–12686. Cited by: §1.
-  (2017) Pointnet: deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 652–660. Cited by: §1, §2, §3.4, Table 1, Table 2.
-  (2016) Volumetric and multi-view cnns for object classification on 3d data. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5648–5656. Cited by: Table 1.
-  (2017) Pointnet++: deep hierarchical feature learning on point sets in a metric space. In Advances in neural information processing systems, pp. 5099–5108. Cited by: Table 1, Table 2.
-  (2008) Aligning point cloud views using persistent feature histograms. In 2008 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 3384–3391. Cited by: §1, §3.2.
-  (2018) Mining point cloud local structures by kernel correlation and graph pooling. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4548–4557. Cited by: §4.4.
-  (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §3.4.
-  (2015) Multi-view convolutional neural networks for 3d shape recognition. In Proceedings of the IEEE International Conference on Computer Vision, pp. 945–953. Cited by: Table 1.
-  (2019) Revisiting point cloud classification: a new benchmark dataset and classification model on real-world data. arXiv preprint arXiv:1908.04616. Cited by: §1, §4.3, §4.
-  (2018) Learning localized generative models for 3d point clouds via graph convolution. Cited by: §3.4.
-  (2018) Local spectral graph convolution for point set feature learning. In Proceedings of the European Conference on Computer Vision, pp. 52–66. Cited by: §3.4.
-  (2019) Dynamic graph cnn for learning on point clouds. ACM Transactions on Graphics 38 (5), pp. 146. Cited by: §2, §3.2, Table 1, Table 2.
-  (2015) 3d shapenets: a deep representation for volumetric shapes. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1912–1920. Cited by: §1, §1, §4.
-  (2018) Spidercnn: deep learning on point sets with parameterized convolutional filters. In Proceedings of the European Conference on Computer Vision, pp. 87–102. Cited by: §2, §3.2, Table 2.
-  (2018) IPOD: intensive point-based object detector for point cloud. arXiv preprint arXiv:1812.05276. Cited by: §1.
-  (2016) A scalable active framework for region annotation in 3d shape collections. ACM Transactions on Graphics 35 (6), pp. 210. Cited by: §1, §1, §4.4, §4.
-  (2019) Rotation invariant convolutions for 3d point clouds deep learning. arXiv preprint arXiv:1908.06297. Cited by: §1, §2, §3.3, Table 1, Table 2.