1 Introduction
With the rapid development of 3D sensors, 3D point cloud analysis techniques have drawn increasing attention in recent years. Inspired by the great success of Deep Neural Networks (DNNs) in the image analysis filed, a large number of works [1, 2, 3, 4, 5, 6] have utilized DNNs to handle various tasks in the field of 3D point cloud analysis.
It is generally believed that one of the main factors which impede the development of many existing DNNs for point cloud classification and segmentation is: the input sets of point clouds belonging to a same object category are generally viewdependent, and they undergo different rigid transformations (translations and rotations) relative to a unified view. Compared with translation transformations whose influences could be easily eliminated through coordinate centralization, rotation transformations are more difficult to be handled. Additionally, it still lacks experimental analysis of the influence of object poses on the performances of these DNNs in literature. It is noted that some recent works [1, 4] showed that a learnable module that allows spatial manipulation of data could significantly boost the performances of DNNs on various point cloud processing tasks, such as point cloud classification and segmentation. For example, the popular TNet [1, 4] is a learnable module that predicts an transformation matrix with an orthogonal constraint for transforming all the input point clouds to a 3dimensional latent canonical space and has significantly improved the performance of many existing DNNs. Despite its excellent performance, the poses of different point clouds transformed via TNet are still up to some 3degreeoffreedom rotations as analyzed in Section Methodology.
Motivated by the aforementioned issues, we firstly compare and analyze the influence of RDF of input 3D objects on several popular DNNs empirically in this paper, observing that the smaller the RDF of objects is, the better these DNNs consistently perform. This observation encourages us to further investigate how to reduce the RDF of objects via a learnable DNN module. Then, we evaluate the performances of the TNet used in [1] and [4], and find that although it could manipulate 3D objects spatially and improve the DNNs’ performances to some extent, it could not transform the input viewdependent data into viewinvariant data with
RDF in most cases. Finally, we propose a rotation transformation network, called RTN, which utilizes a Euleranglebased rotation discretization manner to learn the pose of input 3D objects and then transforms them to a unified view. The proposed RTN has a twostream architecture, where one stream is for global feature extraction while the other one is for local feature extraction, and we also design a selfsupervised scheme to train the RTN.
In sum, our major contributions are threefold:

We empirically verify that the smaller the RDF of objects is, the more easily these objects are handled by some stateoftheart DNNs, and we find that the popular TNet could not reduce the RDF of objects in most cases.

To our best knowledge, the proposed RTN is the first attempt to learn the poses of 3D objects for point cloud analysis under a selfsupervised manner. It could effectively transform viewdependent data to viewinvariant data, and could be easily inserted into many existing DNNs to boost their performance on point cloud analysis.

Extensive experimental results on point cloud classification and segmentation demonstrate that the proposed RTN could help several stateoftheart methods improve their performances significantly.
2 Related Work
2.1 Deep Learning for 3D Point Clouds
PointNet [1]
is the pioneering method to directly process 3D point clouds using shared multilayer perceptrons (MLPs) and maxpooling layers. PointNet++
[2] extends PointNet by extracting multiplescale features of local pattern. Spatial graph convolution based methods have also been applied to 3D point clouds. SpiderCNN [3] treats the convolutional kernel weights as a product of a simple step function and a Taylor polynomial. EdgeConv is proposed in DGCNN [4] where a channelwise symmetric aggregation operation is applied to the edge features in both Euclidean and semantic spaces.2.2 RotationInvariant Representation for 3D Point Clouds
Rotation invariance is one of the most desired properties for object recognition. Addressing this issue, many existing works investigate how to learn rotationinvariant representations from the 3D point clouds. In [7, 8, 9, 10, 11], different types of convolutional kernel are designed to directly extract approximately rotationinvariant features of the input 3D point clouds. In [12, 13, 14], they propose to manually craft a strictly rotationinvariant representation in the input space and uses this representation to replace the 3D Euclidean coordinate as model input which will inevitably result in information loss. Unlike those above methods, this paper aims to learn a spatial transformation which transforms the input viewdependent 3D objects into viewinvariant objects with 0 RDF.
3 Methodology
In this section, we firstly compare and analyze the influences of the rotational degree of freedom (RDF) of objects on the performances of four popular DNNs for point cloud analysis. Secondly, we investigate whether TNet [15] could reduce the RDF of objects or not. Finally, we describe the proposed rotation transformation network (RTN) in detail.
Method  SO(0)(Ins/mCls)  SO(1)(Ins/mCls)  SO(3)(Ins/mCls) 

PointNet [1]  89.1/85.9  88.1/85.2  84.4/79.9 
PointNet++ [2]  90.6/86.8  89.9/86.2  85.7/80.6 
DGCNN [4]  92.4/90.2  91.4/88.8  88.7/84.4 
SpiderCNN [3]  91.5/87.8  90.2/87.8  83.9/78.7 
3.1 Influences of RDF of Objects on DNNs
We investigate the influences of the RDF of objects on four stateoftheart methods including PointNet [1], PointNet++ [2], DGCNN [4], and SpiderCNN [3], where no special modules are employed for explicitly extracting rotationinvariant representations from 3D point clouds. These methods are trained and evaluated on point cloud classification with the following three sets of data:

Data SO(0): for the input objects belonging to each category, they locate a same pose in a centralized 3D space. The RDF of these objects is .

Data SO(1): for the input objects belonging to each category, they locate on a reference plane in a centralized 3D space. The RDF of these objects is .

Data SO(3): for the input objects belonging to each category, they locate with an arbitrary pose in a centralized 3D space. The RDF of these objects is .
The instance accuracy (Ins(%)) and average perclass accuracy (mCls(%)) for the classification task on the public ModelNet40 dataset by the four methods are reported in Table 1. We also investigate the influences of the RDF of objects on ShapenetPart for point cloud segmentation, which refers to the supplementary material. As seen from Table 1, the classification performances by the referred methods on Data SO(0) and Data SO(1) are significantly higher than those on Data SO(3), and their performances on Data SO(0) are best in most cases. This demonstrates that the smaller the RDF of objects is, the more easily these objects are handled, which encourages us to investigate whether the popular TNet used in some stateoftheart methods [1, 4] could reduce the RDF of objects and how to design a more effective DNN module to do so in the following two subsections respectively.
3.2 Could TNet Reduce the RDF of Objects?
The observation in the above subsection naturally raises the following question: Could the TNet extensively used in some stateoftheart methods [1, 4]
reduce the RDF of objects or not? In theory, TNet aims to learn a spatial transformation matrix with only an orthogonal constraint, and the learnt orthogonal matrix by TNet could not strictly guarantee that the input viewdependent objects could be transformed into a unified view.
In order to further investigate the above question, we visualize many samples from each category in ModelNet40 and the corresponding transformed point clouds by the TNet used in [4] ^{1}^{1}1Due to the fact that the TNet were used similarly in [1, 4], we only visualize the prediction results by the TNet used in [4].. Due to the limited space, the second row of Figure 1 shows four samples of the planes where the orange point clouds with RDF are the inputs to TNet, while the blue point clouds are the corresponding transformed ones by TNet. As seen in Figure 1, the transformed point clouds by TNet still have RDF. This demonstrates that TNet could not reduce the RDF of objects.
3.3 Rotation Transformation Network
Inspired by the above observations, we investigate how to design a network for reducing the RDF of input object point clouds effectively. Here, we propose a rotation transformation network (RTN), which could learn the rotations of the input 3D objects and then use the learnt rotations to obtain viewinvariant objects by performing inverse rotations. The architecture of the proposed RTN is shown in Figure 2.
In the proposed RTN, the rotation learning problem is transformed into a classification problem where a Euleranglebased rotation discretization is employed. Then a selfsupervised learning scheme is designed to train the proposed RTN. In the following, we firstly give a detailed explanation on the Euleranglebased rotation discretization in our network. Then, we describe the detailed architecture. Lastly, we present the details of the proposed selfsupervised learning scheme.
3D Rotation Discretization. Here, our goal is to discretize infinite 3degreeoffreedom rotations into a finite group of rotation classes. We use the ZYZ Eulerangle representation under a world coordinate system: An arbitrary 3D rotation is accomplished by firstly rotating the object around the Z axis by angle , and secondly rotating it around the Y axis by angle , and lastly rotating it around the Z axis by angle , which is also formulated by the following equation:
(1) 
where indicates an arbitrary 3D rotation, also indicates a rotation with also around the Z axis, indicates a rotation with around the Y axis, and means matrix multiplication.
After defining the ZYZ Eulerangle representation of 3D rotations, we discretize the continuous range of {, , } into a set of discrete values. In detail, we uniformly discretize the range of into values with a prefixed interval . To avoid singular points, we adopt a sphere equiangular discretization to jointly discretize and into values with interval . Then, the total number of rotation classes is . Note that the discretized rotation classes will become more finegrained (larger ) as the interval becomes smaller.
Network Architecture. As shown in Figure 2, the proposed RTN employs a global branch and a local branch, where the local branch uses local aggregation method to extract features and the global branch only extracts pointwise features of the key points. The inputs to the RTN are point clouds with an arbitrary view, while its outputs are the corresponding viewinvariant point clouds.
The global branch firstly samples key points of the 3D objects, which is described in the supplementary material specifically. Then, these key points are used to extract the pointwise features via three shared MLP layers, and a maxpooling followed by a fullyconnected layer is applied to the features of these key points.
The local branch takes dense points clouds as inputs and employs five EdgeConv [4] layers to extract features. The last EdgeConv layer takes as input the feature concatenated by the outputs of the preceding EdgeConv layers to aggregate local features of the point clouds, and the final feature is obtained by a maxpooling layer followed by a fullyconnected layer.
After obtaining the features from the global branch and the local branch, we concatenate and feed them into fullyconnected layers to predict a discretized rotation class. Once the rotation of an input object relative to the unified view is obtained, an inverse rotation is applied to the input object to obtain its corresponding viewinvariant point cloud.
SelfSupervised Rotation Training. Here, a selfsupervised scheme for generating labeled training samples is introduced. Assuming that some samples with a fixed view are given, for each sample, we firstly generate a random ZYZ Euleranglebased rotation. Its rotation label is obtained according to the discretized {} rotation angles, where and is the number of all classes of discretized rotations. Then we apply the generated 3D rotation to the sample under a world coordinate system for generating a new sample. Accordingly, we could obtain a large amount of labeled samples with different views and utilize them to train the RTN via multiclass crossentropy loss.
4 Experiments
In this section, we firstly introduce the experimental setup. Secondly, we evaluate the rotation estimation performance of the proposed RTN. Then we give the comparative experimental results on the classification and segmentation tasks. Lastly, we end up with ablation analysis. Additionly, we also provide experiments on the effect of different rotation representations in the supplementary material. The code will be available at
https://github.com/ds0529/RTN.4.1 Experimental Setup
We evaluate the proposed method on the ModelNet40 shape classification benchmark [16] and the ShapenetPart part segmentation benchmark [17]. The poses of shapes in ModelNet40 is not totally aligned, so we manually rotated the shapes belonging to an same category to locate at an same pose for precise alignment. The pose of all the shapes in ShapenetPart is aligned precisely. The discretization interval of is set to , so that is . The details of datasets and network parameters are described in the supplementary material.
Dataset  ModelNet40  ShapenetPart 

Mean inCD  0.19  0.21 
Mean outCD  0.09  0.08 
4.2 Performance of RTN on Rotation Estimation
We evaluate the rotation estimation performance of the proposed RTN on ModelNet40 and ShapenetPart through Chamfer Distance (CD) [18] and rotation classification accuracy. CD can directly evaluate the quality of rotation estimation but the other can not due to symmetric 3D objects. The details of rotation classification results are discribed in the supplementary material.
CD calculates the average closest point distance between two point clouds. For each 3D object, we calculate two CD values, one of which is between input rotated point cloud and the point cloud with RDF (inCD), and the other is between output point cloud by proposed RTN and the point cloud with RDF (outCD). Then we average the calculated CD values of all 3D objects. We perform the experiments five times independently and use the mean results as the final results.
The mean CD values are listed in Table 2. As seen from Table 2, the mean outCD values on both datasets are pretty lower than the mean inCD values, which indicates that the proposed RTN has ability to transform the input 3RDF point clouds to 0RDF point clouds in most cases. Furthermore, we visualize the input rotated point clouds in ModelNet40 and the corrected counterpart via two spatial manipulation module (RTN and TNet [4]) in Figure 1. The visualization shows that TNet could not reduce the RDF of objects, but the proposed RTN could effectively reduce RDF from them.
Method  Input(size)  Ins/mCls 
PointNet(with TNet) [1]  pc(10243)  84.4/79.9 
PointNet++ [2]  pc(10243)  85.7/80.6 
DGCNN(with TNet) [4]  pc(10243)  88.7/84.4 
SpiderCNN [3]  pc(10243)  84.0/78.7 
Zhang et al.[9]  pc(10243)  86.4/ 
Poulenard et al.[7]  pc(10243)  87.6/ 
Li et al.[8]  pc+normal(10246)  88.8/ 
ClusterNet [13]  pc(10243)  87.1/ 
SRINet [12]  pc+normal(10246)  87.0/ 
REQNNs [14]  pc(10243)  83.0/ 
Ours(RTN+PointNet)  pc(10243)  86.0/81.0 
Ours(RTN+PointNet++)  pc(10243)  87.4/82.7 
Ours(RTN+DGCNN)  pc(10243)  90.2/86.5 
Ours(RTN+SpiderCNN)  pc(10243)  86.6/82.4 
4.3 3D Point Cloud Classification
Here, we combine the proposed RTN with four stateoftheart methods including PointNet [1], PointNet++ [2], DGCNN [4], and SpiderCNN [3] respectively, denoted as RTN+PointNet, RTN+PointNet++, RTN+DGCNN, and RTN+SpiderCNN, and evaluate their performances on 3D point cloud classification task. The models are trained and tested with Data SO(3) on ModelNet40 for comparing the performance on 3D rotation invariance, and two criteria are used to evaluate the performance: instance accuracy (denoted as Ins (%)) and average perclass accuracy (denoted as mCls (%)). We perform the experiments five times independently and use the mean results as the final results. We compare the results of the proposed methods with nine recent stateoftheart methods as summarized in Table 3. In Table 3, the results of the four methods marked by are obtained by reimplementing these methods by the authors, because these methods are not evaluated on Data SO(3) in the original papers, while the results of the five methods marked by are cited from the original papers directly. As noted from Table 3
, we find that the proposed RTN is able to help the existing DNNs to improve their performances on dealing with 3D rotation variance by transforming the input viewdependent point clouds to viewinvariant point clouds. The comparative results also show us that the RTNbased DNNs are superior to the TNetbased DNNs, which informs us that the proposed RTN is better at reducing RDF than TNet. The DGCNN equipped with the proposed RTN outperforms the current stateoftheart methods with significant improvement.
Method  Input(size)  mIoU/Acc 

PointNet(with TNet) [1]  pc(20483)  79.1/90.6 
PointNet++ [2]  pc(20483)  75.4/88.4 
DGCNN(with TNet) [4]  pc(20483)  78.9/90.8 
SpiderCNN [3]  pc(20483)  74.5/87.9 
Zhang et al.[9]  pc(20483)  75.5/ 
SRINet [12]  pc+normal(20486)  77.0/89.2 
Ours(RTN+PointNet)  pc(20483)  80.1/91.2 
Ours(RTN+PointNet++)  pc(20483)  80.0/91.0 
Ours(RTN+DGCNN)  pc(20483)  82.8/92.6 
Ours(RTN+SpiderCNN)  pc(20483)  80.1/90.7 
4.4 3D Point Cloud Segmentation
Although the results in the classification task have demonstrated the effectiveness of the proposed RTN, we further evaluate the proposed RTN by conducting experiments in 3D point cloud segmentation task. We perform segmentation on ShapenetPart, and average pershape IoU (denoted as mIoU (%)) and pointlevel classification accuracy (denoted as Acc (%)) are used to evaluate the performances. We also perform the experiments five times independently and use the mean results as the final results, where the models are trained and tested with Data SO(3). The results are compared with six recent stateoftheart methods as listed in Table 4. A more detailed comparison among the RTN based DNNs and the comparative methods is described in the supplementary material. As seen in Table 4, the methods equipped with RTN lead to a significant improvement compared to the corresponding original methods without RTN respectively. The DGCNN equipped with the proposed RTN outperforms all the current methods.
Backbone  GA  LA  GLA 

Ins  89.7  89.6  90.2 
mCls  85.1  85.8  86.5 
Quantization Interval  

Ins  89.7  90.2  89.8  89.5 
mCls  86.0  86.5  85.9  85.2 
4.5 Ablation Analysis
Effect of backbone. To prove the superiority of the proposed globallocal architecture(GLA), we perform the classification task on ModelNet40 with RTNs with the global architecture(GA), the local architecture(LA) and the globallocal architecture. DGCNN is used as the classification network after RTN. The results under different backbone configurations are summarized in Table 5. It shows that the proposed globallocal architecture achieves the best performance among all the backbone configurations, which demonstrates the benefit of the globallocal architecture.
Effect of Discretization Interval. The interval affects the rotation classification performance of RTN, and thus affects the performance of existing DNNs equipped with RTN for point cloud analysis. Here we conduct experiments to analyze the effect of the discretization interval by setting a group of intervals in the classification task on ModelNet40. The results are listed in Table 6. As seen from Table 6, the classification accuracies under the above internals are quite close, demonstrating that the proposed method is not sensitive to the angle interval. The interval achieves the best performanceand, so we use this interval in both classification and segmentation experiments.
5 Conclusion
In this paper, we firstly find that the smaller the RDF of objects is, the more easily these objects are handled by these DNNs. Then, we find that TNet module has limited effect on reducing the RDF of input 3D objects. Motivated by the above two issues, we propose a rotation transformation network, called RTN, which has the ability to explicitly transform input viewdependent point clouds to viewinvariant point clouds by learning the rotation transformation based on an Euleranglebased rotation discretization manner. Extensive experimental results indicate that the proposed RTN is able to help existing DNNs significantly improve their performances on point cloud classification and segmentation.
References

[1]
Charles R Qi, Hao Su, Kaichun Mo, and Leonidas J Guibas,
“Pointnet: Deep learning on point sets for 3d classification and segmentation,”
in CVPR, 2017, pp. 652–660.  [2] Charles Ruizhongtai Qi, Li Yi, Hao Su, and Leonidas J Guibas, “Pointnet++: Deep hierarchical feature learning on point sets in a metric space,” in NeurIPS, 2017, pp. 5099–5108.
 [3] Yifan Xu, Tianqi Fan, Mingye Xu, Long Zeng, and Yu Qiao, “Spidercnn: Deep learning on point sets with parameterized convolutional filters,” in ECCV, 2018, pp. 87–102.
 [4] Yue Wang, Yongbin Sun, Ziwei Liu, Sanjay E Sarma, Michael M Bronstein, and Justin M Solomon, “Dynamic graph cnn for learning on point clouds,” TOG, vol. 38, no. 5, pp. 1–12, 2019.
 [5] Yangyan Li, Rui Bu, Mingchao Sun, Wei Wu, Xinhan Di, and Baoquan Chen, “Pointcnn: Convolution on xtransformed points,” in NeurIPS, 2018, pp. 820–830.
 [6] Qiangeng Xu, Xudong Sun, ChoYing Wu, Panqu Wang, and Ulrich Neumann, “Gridgcn for fast and scalable point cloud learning,” in CVPR, 2020, pp. 5661–5670.
 [7] Adrien Poulenard, MarieJulie Rakotosaona, Yann Ponty, and Maks Ovsjanikov, “Effective rotationinvariant point cnn with spherical harmonics kernels,” in 3DV, 2019, pp. 47–56.
 [8] Jiaxin Li, Yingcai Bi, and Gim Hee Lee, “Discrete rotation equivariance for point cloud recognition,” arXiv: 1904.00319, 2019.
 [9] Zhiyuan Zhang, BinhSon Hua, David W Rosen, and SaiKit Yeung, “Rotation invariant convolutions for 3d point clouds deep learning,” in 3DV, 2019, pp. 204–213.

[10]
Yongming Rao, Jiwen Lu, and Jie Zhou,
“Spherical fractal convolutional neural networks for point cloud recognition,”
in CVPR, 2019, pp. 452–460.  [11] Yang You, Yujing Lou, Qi Liu, YuWing Tai, Lizhuang Ma, Cewu Lu, and Weiming Wang, “Pointwise rotationinvariant network with adaptive sampling and 3d spherical voxel convolution,” in AAAI, 2020, pp. 12717–12724.
 [12] Xiao Sun, Zhouhui Lian, and Jianguo Xiao, “Srinet: Learning strictly rotationinvariant representations for point cloud classification and segmentation,” in MM, 2019, pp. 980–988.

[13]
Chao Chen, Guanbin Li, Ruijia Xu, Tianshui Chen, Meng Wang, and Liang Lin,
“Clusternet: Deep hierarchical cluster network with rigorously rotationinvariant representation for point cloud analysis,”
in CVPR, 2019, pp. 4994–5002.  [14] Binbin Zhang, Wen Shen, Shikun Huang, Zhihua Wei, and Quanshi Zhang, “3drotationequivariant quaternion neural networks,” arXiv: 1911.09040, 2019.
 [15] Max Jaderberg, Karen Simonyan, Andrew Zisserman, et al., in NeurIPS, 2015, pp. 2017–2025.
 [16] Zhirong Wu, Shuran Song, Aditya Khosla, Fisher Yu, Linguang Zhang, Xiaoou Tang, and Jianxiong Xiao, “3d shapenets: A deep representation for volumetric shapes,” in CVPR, 2015, pp. 1912–1920.
 [17] Angel X Chang, Thomas Funkhouser, Leonidas Guibas, Pat Hanrahan, Qixing Huang, Zimo Li, Silvio Savarese, Manolis Savva, Shuran Song, Hao Su, et al., “Shapenet: An informationrich 3d model repository,” arXiv: 1512.03012, 2015.
 [18] Haoqiang Fan, Hao Su, and Leonidas J Guibas, “A point set generation network for 3d object reconstruction from a single image,” in CVPR, 2017, pp. 2463–2471.
Comments
There are no comments yet.