1. Introduction
3D shape analysis attracts increasing attentions with the advancement of 3D sensors and computing resources. Extracting discriminative features from 3D models or scenes becomes demanding for its widespread applications, such as autonomous vehicles, robotics, and many other fields. As a fundamental format to represent 3D models, point cloud can be conveniently acquired by laser scanner, but is unsuitable to be fed into deep neural networks due to its uncertain point number and unordered permutation. Previous works usually convert a point cloud into voxels or a collection of views from multiple perspectives. Then, these regular data can be further processed by powerful convolution neural networks. Besides the time consumed in transformation, shape information is unavoidably missing since the newlygenerated data can not cover all details of the original geometry.
Independent of networks that deal with regular data, pointbased methods, like PointNet (Qi et al., 2017a), PointCNN (Li et al., 2018) and PointSIFT (Jiang et al., 2018), create a new paradigm to analyze unordered point cloud. Despite of the remarkable performance achieved, they limit themselves in processing the aligned 3D point clouds with canonical orientation. Even for the commonly used dataset, ModelNet (Wu et al., 2015), the orientations of the models are manually and roughly aligned, meaning that dealing with the problem of orientation bring in many inconveniences. Existing methods usually apply plenty of data augmentation to make the model robust to rotation. In our preliminary experiments, however, we find that rotating the point cloud along several axes may cause some degrees of degradation in performance (see Figure 2). This demonstrates that the networks learn overmuch orientation information and obscure the intrinsic geometry characteristic. PRIN (You et al., 2018) translates the sparse points signal into voxel grids by DensityAware Adaptive Sampling and employs Spherical Voxel Convolution to obtain approximate rotationinvariant features for every point. Such operations only improve the robustness to orientation, but can not achieve the goal of strict rotation invariance. PPFFoldNet (Deng et al., 2018) associates every point with a reference point and uses point pair feature (Drost et al., 2010) to substitute the original one. Despite of its strict rotation invariance, it is easy to construct different point pairs that are mapped into the same highdimensional feature, which we believe impairs the capability of representations. In this paper, we propose the point projection feature
. Specifically, we select three main axes of the object and project every point to these axes to obtain a 3 dimensional feature, along with the norm of the vector starting from the central point to this point. The intuition is that the relative location relationship between points and the selected axes keeps fixed when rotating. Thus, original 3D coordinates are mapped to the speciallydesigned 4D feature space, in which the representation of every point is rotation invariant to orientations. We use such mapping in two branches. In the main branch, we map the whole points to 4D space and feed the new representations to PointNetbased backbone to extract global features; in the side branch, we find
nearest neighboring points for every point and apply graph aggregation operation to perceive local shape structure. Thus, the final encoded feature is independent of the orientation of input point cloud. In addition, we hold the opinion that points in different position are of unequal importance for geometry perception. To be specific, corner points and edges are visually sensitive than those in flat regions. Automatically emphasizing such key points is essential for improving the quality of the obtained feature. Since there is no evidence for PointNetlike networks to automatically detect key points without extra supervision, we manually design a simple but effective key point detector to guide the neural network to focus on such key points.Major contributions in this paper are threefold. First, we propose the point projection feature, a rotationinvariant representation that encodes the original coordinates of point cloud. Second, we design graph aggregation operation to mine local structure and explicitly introduce key point descriptor to emphasize the regions that are crucial for recognition. Third, extensive experiments demonstrate the superiority of our method in dealing with rotated point clouds.
2. Related Work
2.1. Deep learning on regular 3D data
3D shape analysis relies on the quality of features extracted from 3D shapes. The appearance of largescale 3D shape repositories and the development of hardware make it possible to leverage powerful deep networks to understand 3D data, and deep feature based methods outperform the traditional handcrafted descriptors in most 3D vision tasks. Pioneer works
(Maturana and Scherer, 2015; Wu et al., 2015; Qi et al., 2016) typically base on voxels for voxels are regularly arranged and suitable to feed into 3D convolution networks. However, 3D convolution occupies far more memories due to an extra dimension compared to 2D convolution, which limit the resolution of the voxels to be processed. In addition, 3D shape is perceived by its surface. Thus, operating on the elements inside the surface is a waste of computing resources. Another intuitive idea is to convert 3D shapes into a collection of views from multiple perspectives, then the proven techniques of 2D convolution can be adopted (Su et al., 2015; Feng et al., 2018). Viewbased methods adopt view pooling to eliminate the order of views and become gradually robust to orientations as the number of views increases. Even so, they can only form a global descriptor, meaning that they can not carry out the delicate tasks like point labeling and matching.2.2. Deep learning on point cloud
Different from other regular 3D representations, point cloud is difficult to mine local geometry and be processed by deep neural networks. Recently, researchers show increasing interest in point cloud processing (Hermosilla et al., 2018; Yin et al., 2018; Roveri et al., 2018). PointNet (Qi et al., 2017a)
is a pioneering work to study point cloud. The main idea of PointNet is to use pointwise convolution to map the original 3D coordinates to a highdimension feature space, followed by a max pooling or average pooling operation to eliminate the effect of points permutation. Despite of its capacity to extract a global feature to represent point cloud, it neglects the exploring of forming local shape descriptors, making it hard to distinguish tiny diversity between contouranalogical 3D shapes. Several follow ups attempt to mine point cloud local structure in different ways. PointNet++
(Qi et al., 2017b) divides the whole point set into subsets and a simplified PointNet is applied repetitively for each of the subsets. These local features are grouped to make up a global representation. Due to the complicated process of division and grouping and repeated forward propagation, PointNet++ becomes timeconsuming and sensitive to tuning, which often lead to worse results compared to PointNet in our preliminary experiments. PointCNN (Li et al., 2018) proposes to find K nearest neighbor points for every point, from which a transformation matrix is learned to repermutate these points, aiming to achieve permutation equivalence and perceive local regions. Since premultiplying original feature matrix by the learned transformation matrix is equal to swap or linearly recombine features in each row of original matrix, permutation equivalence can not be guaranteed. KCNet (Shen et al., 2018) proposes a shallow Kernel Correlation (KC) layer and KNN Graph to incorporate local feature. We believe that deepening the KC layer may help mine more abstract and discriminative signals. All methods mentioned above need orientationaligned point clouds as input, which limits their spread to practice when the prior of orientation is unknown. PRIN (You et al., 2018) resorts to Spherical Voxel Convolution to extract features that are robust to orientations without data augmentation in the training process, but the performance degrades when testing with rotated data, indicating that it can not achieve strictly rotationinvariant representations.3. Method Description
In this section, we introduce the proposed SRINet. The goal of our method is to obtain rotationinvariant representations for 3D point clouds, which can be used in the later applications ranging from classification to segmentation. We map the 3D coordinates into 4D point projection feature space and mine the features in both local and global receptive fields. We describe the details in the following subsections.
3.1. Point Projection Feature
Suppose the input point cloud is a set of points with random orientations, and we put the mass center at the origin. The coordinates are also interpreted as the vectors starting from the origin. From the input vectors, we can freely choose three linearly independent axes (orthogonality is not necessary). Without loss of generality, we choose the vector with the maximum norm as axis 1 , the vector with the minimal norm as axis 2 , and the multiplication cross result of and as axis 3 . These three axes are scaled to unit norm. Clearly, no matter how the 3D object rotates, the relative location relationship between these axes and points keeps consistent. Then the original point cloud is encoded as a collection of point projection features
(1) 
where represents the three axes and denotes point projection mapping
(2) 
We do not further calculate the angles between vectors because these patterns are difficult for neural networks to learn. Obviously, different points will not collide after mapping to 4 dimensional feature space.
Proposition 1. Point Projection Feature is invariant to the rotation of point cloud.
Proof. Let us consider the point and three selected axes , then part components of the 4D feature are remarked as
(3) 
where . For simplicity, we denote , and thus . Then we can construct the matrix from the vector :
(4) 
Given matrix , the vector
can be obtained by applying singular value decomposition,
, and . Note that, the axes are related to the values of and the elements keep fixed if the axes keep fixed. Rotating the original point cloud with orthogonal rotation matrix will not change the result matrix : .Proposition 2. Given the 4D point projection feature and 3 selected axes, the original point can be uniquely identified.
Proof. By constructing the matrix and applying SVD, the coordinate of point can be easily calculated. Note that the particular solution
is not up to a orthogonal matrix since the three axes are settled.
3.2. Local structure exploiting by graph aggregation
In PointNet, transform matrix is learned from the whole points to integrate the features attached to every vertex, which is considered to impair the capability of perceiving local structure (Qi et al., 2017b; Shen et al., 2018). Thus, it is necessary to extract features from local regions. Here we denote the local region as the K nearest neighboring points of a center point, and substract the coordinate of center point to neglect the relative position relationship to the whole point cloud. Intuitively, neighboring points construct local geometry structure together. Mining local geometry features requires to exchange information between neighboring points and eliminate the effect of point permutation.
Only applying graph convolution among local points lacks interaction between features, and we conjecture that combining similar signatures may result in more salient ones. Inspired from PointNet, which learns a transform matrix and postmultiplies the feature matrix, we learn a similar transform matrix from local points and premultiply the feature matrix. Since we know, every row in feature matrix represents a feature vector attached to a point and premultiplying the feature matrix by transform matrix may linearly recombine the features. After that, graph convolution and pooling operations are used for feature update and fusion, which can be formulated as
(5) 
where P denotes the pooling operation across neighboring points and F denotes graphbased convolution applied on each of the local points
(6) 
Here, we update each of the local features using the points in the neighboring region around the center point, which is slightly different from the original definition of GCN (Kipf and Welling, 2016; Bronstein et al., 2017).
Compared to EdgeConv operation proposed in (Wang et al., 2018), which updates the local feature point by point, our newly generated feature of one point takes all points in local region into consideration. And we use max pooling to achieve permutation invariance and screen out the most salient signature among the local points. This aggregated signature reflects highlevel abstract feature of local region and can be concatenated with the global feature to form a complete point cloud representation.
3.3. Key points detection
In this context, key points denote the points lying on the edges or corners of the object. Exsiting works, such as (Liu et al., 2018), use attention module to highlight regions that are beneficial for recognition. In such datadriven methods, the degree of importance of each point is automatically learned without groundtruth key points for supervision, making it difficult to distinguish which point is truly important. Besides, it is hard to say the improvement of performance comes from the attention mechanism or from the increased number of parameters. We believe that more accurate information can be obtained by exploiting the intrinsic property of point cloud. The commonly used 3D corner detector, Harris 3D (Sipiran and Bustos, 2011) achieves satisfying results, but is timeconsuming and depends on parameter settings. It is universally acknowledged that the normals of points can reflect shape feature. Thus, we assign a response for every point by considering the changes of normals in its neighboring region
(7) 
where denotes the normal at point . Though simple, we find that it works well and the response of point cloud is visualized in Figure 6. High responses appear in the regions of edges, especially at the corners. The calculated responses are integrated in the global representations of point cloud before global max pooling operation.
3.4. Network Architecture
The overall pipeline of the proposed method is demonstrated in Figure 1
. The input point cloud is fed into two branches to extract both global and local features. Both of the two branches begin with point projection operation, mapping the 3D coordinates into 4 dimensional feature space. For the backbone, we use multilayer perceptron (MLP) to abstract pointwise feature. For the side branch, we leverage Graph Aggregation operation, which first learns a transform matrix from local points and premultiplies the feature matrix to recombine signatures, followed by graph convolutions and a max pooling layer to update features and form a local descriptor. The features from two branches are concatenated and then decorated with key point response values in two ways: pointwise multiplication or summation. We use global max pooling operation to eliminate the effect of point permutation and obtain a complete representation for point cloud. Classification and segmentation tasks share the the same representation of point cloud. In classification task, three extra fullyconnected layers are used to serve as a classifier. In segmentation task, we replicate the representation and concatenate it with the features in previous layer, and then feed it to a threelayer MLP to produce scores for each point.
4. Experiments
In this section, we validate the effectiveness of the proposed architecture in point cloud classification and part segmentation tasks, and conduct ablation study to evaluate the contribution of each components. SRINet is implemented with Tensorflow and runs on GTX1080Ti. We use Adam
(Kingma and Ba, 2014)optimizer with initial learning rate 0.001 for training and decrease by 0.3 for 20 epochs in all experiments. For data augmentation, noise is added to perturb the object locations. We train the networks for 250 epochs to guarantee the convergence of model.
4.1. Point Cloud Classification
Dataset. We conduct classification experiments on ModelNet40 (Wu et al., 2015). The dataset consists of 12,311 CAD models from 40 categories, 9,843 of them are split for training and 2,468 for testing. Note that the orientation of these models are roughly aligned. We follow the same experimental settings as (Qi et al., 2017a). For each model, we uniformly sample 1024 points along with their normals as the network input.
Table 1 compares the results of our method with several stateoftheart works. NR/NR means not rotation of point clouds in both training and testing. NR/AR means training with no rotation augmentation and testing with arbitrary rotations. Our method gets the highest accuracy when testing with arbitrary rotations and outperforms other methods by a large margin. We also achieve comparable results compared to PointNet on nonrotation data. Besides, we achieve equal accuracy in rotation and nonrotation test settings, which means the obtained representation for point cloud is strictly rotationinvariant. PRIN degrades slightly when testing with rotations and shows strong robustness to rotations. Other works, however, fail to recognize the object with unseen orientations.
Method  NR/NR  NR/AR  


88.45  12.47  

89.42  21.35  

92.60  10.53  

86.20  8.49  

80.13  69.85  

87.01  87.01 
Method  NR/NR  NR/AR  R10  R20  R30  input size  


93.42/83.43  45.66/28.26  61.02/41.59  67.85/50.54  74.91/58.66  20483  

94.00/84.62  60.15/38.16  69.06/47.26  70.01/49.26  70.82/49.95  10243  

93.78/83.53  47.13/30.41  61.33/41.40  68.10/50.76  73.44/58.03  204833  

90.33/82.36  40.66/24.76  59.11/38.70  64.50/47.60  69.33/51.06  

88.97/73.96  78.13/57.41  80.94/64.25  83.83/67.68  84.76/68.76  2048 3  

89.24/76.95  2048 3 
4.2. Part Segmentation
Dataset. We evaluate SRINet for part segmentation task on ShapeNet part dataset (Yi et al., 2016). The dataset consists of 16,881 3D point cloud objects from 16 categories. The objects from various categories are segmented into 50 parts in total, and there are no overlap parts across different categories. For each object, a semantic label is assigned to every point. Each object contains no more than 5 parts. We use the processed dataset provided by (Qi et al., 2017a) and randomly sample 2048 points with their normals from each object.
The rotationinvariant representations for point cloud can be also used for part segmentation task. We compare our work with PointNet (Qi et al., 2017a), PointNet++ (Qi et al., 2017a), SyncSpecCNN (Yi et al., 2017), KdNetwork (Klokov and Lempitsky, 2017) and PRIN (You et al., 2018). We follow the same experimental settings as (You et al., 2018) in evaluation, and three groups of settings are listed as follows:
1. Train and test with no rotations.
2. Train with no rotation augmentations and test with arbitrary rotations.
3. Train with 10/20/30 rotations for every model and test with arbitrary rotations.
The results are shown in Table 2. Stateoftheart methods, such as PointNet, use orientationaligned point clouds as their input and achieve good performance in the original task, but show great performance degradation when dealing with rotated point clouds. Training with increasing rotation augmentations may help improve the robustness to rotation, but the results are still worse than PRIN and ours. Besides, the improvement is obtained at a price of aggravating burden on computing resources. PRIN is not sensitive to rotations, however, it fails to achieve strictly rotation invariance. Our method neglects the effect of orientation and obtains best performance in segmenting rotated point cloud. The visualization of our segmentation results are demonstrated in Figure 7. We train these three models without rotation augmentation and rotate the input point clouds by a random angle when testing.
Task  Classification  Segmentation  

Metric  Acc(%)  Acc(%)  IoU(%) 
Full  87.01  89.24  76.95 
GA  82.22  87.72  74.30 
KPD  85.59  88.73  76.29 
4.3. Ablation Study
Graph Aggregation. Graph aggregation operation is introduced in the side branch to exploit local geometry structure, and aggregate the features attached to the neighboring points around the center. We find it useful in point cloud recognition task, meaning that incorporating local structure helps perceive the global geometry. It also makes sense in segmentation task and improves the performance slightly. This is because precise segmentation requires global perception and local information only plays a secondary rule. The quantitative results for eliminating Graph Aggregation module are shown in Table 3.

Classification  Segmentation  

Acc(%)  Acc(%)  IoU(%)  
Multiplication  85.26  88.81  75.52  
Summation  87.01  89.24  76.95  
Key Points Detection. Intuitively, mining the skeleton and key points of an object is beneficial to recognize the whole shape. We directly define the key point response value instead of adopting learnable neuralnetwork based attention mechanism. We combine the response values and global point cloud representations in two ways: multiplication and summation. As shown in Table 4, combination by summation is proved to be useful. However, combination by multiplication results in worse performance than that with no key point detection module. And we remove the key points detection module to observe the effect to the whole model. The accuracy in classification experiment drops (from to ), and IoU value drops (from to ) without detecting key points. Though simple, this module brings in stable improvement for classification and segmentation tasks.
4.4. The effect of parameters
Number of nearest neighboring points. We need to find K nearest neighboring points for each point in Graph Aggregation operation. From Table 5, we can see that the number of neighboring points is not crucial in classification, but greatly affects the segmentation task. As the number of neighboring points increases, the performance of segmentation keeps going up. We conjecture that segmentation relies on the receptive field of local region, and broader receptive field may lead to better perception of global shape.
KNN point number  16  25  36  49  64  


Acc(%)  86.85  87.01  86.93  86.89  86.56  

Acc(%)  88.19  88.42  88.89  89.00  89.24  
IoU(%)  74.96  75.34  76.19  76.56  76.95 
Number of input points. We vary the number of sampled points in the input point cloud to see if the proposed model is robust to the resolution of point cloud. The number ranges from 256 to 2048, shown in Table 6. We obtain the best results for both two tasks when the number of points is set to 1024. There is a tiny swing when going left or right from 1024, which suggests that SRINet is capable of extract valid local information despite of the different distributions of local regions, and sampling 1024 points from the original point cloud is an optimum choice to cover the whole object.
Point Number  256  512  1024  2048  


Acc(%)  85.87  86.32  87.01  85.83  

Acc(%)  88.71  88.97  89.28  89.24  
IoU(%)  77.07  77.24  77.28  76.95 
4.5. Comparing with Point Pair Feature
There exist several works that adopt point pair feature to reformulate the coordinates of point cloud and achieve strictly rotation invariance (Deng et al., 2018; Birdal and Ilic, 2015, 2017). Here, we compare it with the proposed point projection feature. In preliminary experiments, we find it difficult for neural networks to extract discriminative patterns from the original point pair features, which incorporate calculating the angle between two defined vectors. Thus, we step back and replace the angles with their cosine values that can be calculated by the inner product of two normalized vectors. For fair comparison, we use PointNet architecture and conduct classification task on ModelNet40. The original 3D coordinates of point cloud are converted to 4D point pair features and point projection features respectively, and then fed to the network. The one with point projection features gets an accuracy of , while the counterpart only gets . This implies that the proposed point projection feature preserves more relative location information between points compared to point pair feature, and shows great superiority in terms of point cloud recognition.
5. Conclusion
In this paper, we proposed SRINet to extract the strictly rotationinvariant representation of point cloud. Point projection feature was introduced to reformulate the original 3D coordinates. We used graph aggregation to mine local structure and key point detection to guide the network to perceive the 3D shape. Experiments on classification and part segmentation tasks showed that our method outperforms other methods in dealing with rotated point clouds. In the future work, the choice of more stable axes needs to be further exploited to reduce the loss when converting the 3D coordinates to point projection features. Besides, how to better understand the point projection feature and generalize it to more applications is also an interesting work worth to be done in the future.
6. Acknowledgments
This work was supported by National Key Research and Development Program of China (2017YFB1002601), National Natural Science Foundation of China (Grant No.: 61672043 and 61672056) and Key Laboratory of Science, Technology and Standard in Press Industry (Key Laboratory of Intelligent Press Media Technology).
References

Point pair features based object detection and pose estimation revisited
. In 2015 International Conference on 3D Vision, pp. 527–535. Cited by: §4.5. 
Cad priors for accurate and flexible instance reconstruction.
In
Proceedings of the IEEE International Conference on Computer Vision
, pp. 133–142. Cited by: §4.5. 
Geometric deep learning: going beyond euclidean data
. IEEE Signal Processing Magazine 34 (4), pp. 18–42. Cited by: §3.2. 
Ppffoldnet: unsupervised learning of rotation invariant 3d local descriptors
. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 602–618. Cited by: §1, §4.5. 
Model globally, match locally: efficient and robust 3d object recognition.
In
2010 IEEE computer society conference on computer vision and pattern recognition
, pp. 998–1005. Cited by: §1.  GVCNN: groupview convolutional neural networks for 3d shape recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 264–272. Cited by: §2.1.
 Monte carlo convolution for learning on nonuniformly sampled point clouds. In SIGGRAPH Asia 2018 Technical Papers, pp. 235. Cited by: §2.2.
 Pointsift: a siftlike network module for 3d point cloud semantic segmentation. arXiv preprint arXiv:1807.00652. Cited by: §1.
 Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §4.
 Semisupervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907. Cited by: §3.2.
 Escape from cells: deep kdnetworks for the recognition of 3d point cloud models. In Proceedings of the IEEE International Conference on Computer Vision, pp. 863–872. Cited by: §4.2, Table 1, Table 2.
 PointCNN: convolution on xtransformed points. In Advances in Neural Information Processing Systems, pp. 828–838. Cited by: §1, §2.2.
 Point2Sequence: learning the shape representation of 3d point clouds with an attentionbased sequence to sequence network. arXiv preprint arXiv:1811.02565. Cited by: §3.3, Table 1.
 Voxnet: a 3d convolutional neural network for realtime object recognition. In 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 922–928. Cited by: §2.1.
 Pointnet: deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 652–660. Cited by: §1, §2.2, §4.1, §4.2, §4.2, Table 1, Table 2.
 Volumetric and multiview cnns for object classification on 3d data. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5648–5656. Cited by: §2.1.
 Pointnet++: deep hierarchical feature learning on point sets in a metric space. In Advances in Neural Information Processing Systems, pp. 5099–5108. Cited by: §2.2, §3.2, Table 1, Table 2.
 Pointpronets: consolidation of point clouds with convolutional neural networks. In Computer Graphics Forum, Vol. 37, pp. 87–99. Cited by: §2.2.
 Mining point cloud local structures by kernel correlation and graph pooling. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4548–4557. Cited by: §2.2, §3.2.
 Harris 3d: a robust extension of the harris operator for interest point detection on 3d meshes. The Visual Computer 27 (11), pp. 963. Cited by: §3.3.
 Multiview convolutional neural networks for 3d shape recognition. In Proceedings of the IEEE international conference on computer vision, pp. 945–953. Cited by: §2.1.
 Dynamic graph cnn for learning on point clouds. arXiv preprint arXiv:1801.07829. Cited by: §3.2.
 3d shapenets: a deep representation for volumetric shapes. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1912–1920. Cited by: §1, §2.1, §4.1.
 A scalable active framework for region annotation in 3d shape collections. ACM Transactions on Graphics (TOG) 35 (6), pp. 210. Cited by: §4.2.
 Syncspeccnn: synchronized spectral cnn for 3d shape segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2282–2290. Cited by: §4.2, Table 2.
 P2Pnet: bidirectional point displacement net for shape transform. ACM Transactions on Graphics (TOG) 37 (4), pp. 152. Cited by: §2.2.
 PRIN: pointwise rotationinvariant network. arXiv preprint arXiv:1811.09361. Cited by: §1, §2.2, §4.2, Table 1, Table 2.
Comments
There are no comments yet.