3D shape analysis attracts increasing attentions with the advancement of 3D sensors and computing resources. Extracting discriminative features from 3D models or scenes becomes demanding for its widespread applications, such as autonomous vehicles, robotics, and many other fields. As a fundamental format to represent 3D models, point cloud can be conveniently acquired by laser scanner, but is unsuitable to be fed into deep neural networks due to its uncertain point number and unordered permutation. Previous works usually convert a point cloud into voxels or a collection of views from multiple perspectives. Then, these regular data can be further processed by powerful convolution neural networks. Besides the time consumed in transformation, shape information is unavoidably missing since the newly-generated data can not cover all details of the original geometry.
Independent of networks that deal with regular data, point-based methods, like PointNet (Qi et al., 2017a), PointCNN (Li et al., 2018) and PointSIFT (Jiang et al., 2018), create a new paradigm to analyze unordered point cloud. Despite of the remarkable performance achieved, they limit themselves in processing the aligned 3D point clouds with canonical orientation. Even for the commonly used dataset, ModelNet (Wu et al., 2015), the orientations of the models are manually and roughly aligned, meaning that dealing with the problem of orientation bring in many inconveniences. Existing methods usually apply plenty of data augmentation to make the model robust to rotation. In our preliminary experiments, however, we find that rotating the point cloud along several axes may cause some degrees of degradation in performance (see Figure 2). This demonstrates that the networks learn overmuch orientation information and obscure the intrinsic geometry characteristic. PRIN (You et al., 2018) translates the sparse points signal into voxel grids by Density-Aware Adaptive Sampling and employs Spherical Voxel Convolution to obtain approximate rotation-invariant features for every point. Such operations only improve the robustness to orientation, but can not achieve the goal of strict rotation invariance. PPF-FoldNet (Deng et al., 2018) associates every point with a reference point and uses point pair feature (Drost et al., 2010) to substitute the original one. Despite of its strict rotation invariance, it is easy to construct different point pairs that are mapped into the same high-dimensional feature, which we believe impairs the capability of representations. In this paper, we propose the point projection feature
. Specifically, we select three main axes of the object and project every point to these axes to obtain a 3 dimensional feature, along with the norm of the vector starting from the central point to this point. The intuition is that the relative location relationship between points and the selected axes keeps fixed when rotating. Thus, original 3D coordinates are mapped to the specially-designed 4D feature space, in which the representation of every point is rotation invariant to orientations. We use such mapping in two branches. In the main branch, we map the whole points to 4D space and feed the new representations to PointNet-based backbone to extract global features; in the side branch, we findnearest neighboring points for every point and apply graph aggregation operation to perceive local shape structure. Thus, the final encoded feature is independent of the orientation of input point cloud. In addition, we hold the opinion that points in different position are of unequal importance for geometry perception. To be specific, corner points and edges are visually sensitive than those in flat regions. Automatically emphasizing such key points is essential for improving the quality of the obtained feature. Since there is no evidence for PointNet-like networks to automatically detect key points without extra supervision, we manually design a simple but effective key point detector to guide the neural network to focus on such key points.
Major contributions in this paper are threefold. First, we propose the point projection feature, a rotation-invariant representation that encodes the original coordinates of point cloud. Second, we design graph aggregation operation to mine local structure and explicitly introduce key point descriptor to emphasize the regions that are crucial for recognition. Third, extensive experiments demonstrate the superiority of our method in dealing with rotated point clouds.
2. Related Work
2.1. Deep learning on regular 3D data
3D shape analysis relies on the quality of features extracted from 3D shapes. The appearance of large-scale 3D shape repositories and the development of hardware make it possible to leverage powerful deep networks to understand 3D data, and deep feature based methods outperform the traditional hand-crafted descriptors in most 3D vision tasks. Pioneer works(Maturana and Scherer, 2015; Wu et al., 2015; Qi et al., 2016) typically base on voxels for voxels are regularly arranged and suitable to feed into 3D convolution networks. However, 3D convolution occupies far more memories due to an extra dimension compared to 2D convolution, which limit the resolution of the voxels to be processed. In addition, 3D shape is perceived by its surface. Thus, operating on the elements inside the surface is a waste of computing resources. Another intuitive idea is to convert 3D shapes into a collection of views from multiple perspectives, then the proven techniques of 2D convolution can be adopted (Su et al., 2015; Feng et al., 2018). View-based methods adopt view pooling to eliminate the order of views and become gradually robust to orientations as the number of views increases. Even so, they can only form a global descriptor, meaning that they can not carry out the delicate tasks like point labeling and matching.
2.2. Deep learning on point cloud
Different from other regular 3D representations, point cloud is difficult to mine local geometry and be processed by deep neural networks. Recently, researchers show increasing interest in point cloud processing (Hermosilla et al., 2018; Yin et al., 2018; Roveri et al., 2018). PointNet (Qi et al., 2017a)
is a pioneering work to study point cloud. The main idea of PointNet is to use point-wise convolution to map the original 3D coordinates to a high-dimension feature space, followed by a max pooling or average pooling operation to eliminate the effect of points permutation. Despite of its capacity to extract a global feature to represent point cloud, it neglects the exploring of forming local shape descriptors, making it hard to distinguish tiny diversity between contour-analogical 3D shapes. Several follow ups attempt to mine point cloud local structure in different ways. PointNet++(Qi et al., 2017b) divides the whole point set into subsets and a simplified PointNet is applied repetitively for each of the subsets. These local features are grouped to make up a global representation. Due to the complicated process of division and grouping and repeated forward propagation, PointNet++ becomes time-consuming and sensitive to tuning, which often lead to worse results compared to PointNet in our preliminary experiments. PointCNN (Li et al., 2018) proposes to find K nearest neighbor points for every point, from which a transformation matrix is learned to re-permutate these points, aiming to achieve permutation equivalence and perceive local regions. Since pre-multiplying original feature matrix by the learned transformation matrix is equal to swap or linearly recombine features in each row of original matrix, permutation equivalence can not be guaranteed. KCNet (Shen et al., 2018) proposes a shallow Kernel Correlation (KC) layer and K-NN Graph to incorporate local feature. We believe that deepening the KC layer may help mine more abstract and discriminative signals. All methods mentioned above need orientation-aligned point clouds as input, which limits their spread to practice when the prior of orientation is unknown. PRIN (You et al., 2018) resorts to Spherical Voxel Convolution to extract features that are robust to orientations without data augmentation in the training process, but the performance degrades when testing with rotated data, indicating that it can not achieve strictly rotation-invariant representations.
3. Method Description
In this section, we introduce the proposed SRINet. The goal of our method is to obtain rotation-invariant representations for 3D point clouds, which can be used in the later applications ranging from classification to segmentation. We map the 3D coordinates into 4D point projection feature space and mine the features in both local and global receptive fields. We describe the details in the following subsections.
3.1. Point Projection Feature
Suppose the input point cloud is a set of points with random orientations, and we put the mass center at the origin. The coordinates are also interpreted as the vectors starting from the origin. From the input vectors, we can freely choose three linearly independent axes (orthogonality is not necessary). Without loss of generality, we choose the vector with the maximum norm as axis 1 , the vector with the minimal norm as axis 2 , and the multiplication cross result of and as axis 3 . These three axes are scaled to unit norm. Clearly, no matter how the 3D object rotates, the relative location relationship between these axes and points keeps consistent. Then the original point cloud is encoded as a collection of point projection features
where represents the three axes and denotes point projection mapping
We do not further calculate the angles between vectors because these patterns are difficult for neural networks to learn. Obviously, different points will not collide after mapping to 4 dimensional feature space.
Proposition 1. Point Projection Feature is invariant to the rotation of point cloud.
Proof. Let us consider the point and three selected axes , then part components of the 4D feature are remarked as
where . For simplicity, we denote , and thus . Then we can construct the matrix from the vector :
Given matrix , the vector
can be obtained by applying singular value decomposition,, and . Note that, the axes are related to the values of and the elements keep fixed if the axes keep fixed. Rotating the original point cloud with orthogonal rotation matrix will not change the result matrix : .
Proposition 2. Given the 4D point projection feature and 3 selected axes, the original point can be uniquely identified.
Proof. By constructing the matrix and applying SVD, the coordinate of point can be easily calculated. Note that the particular solution
is not up to a orthogonal matrix since the three axes are settled.
3.2. Local structure exploiting by graph aggregation
In PointNet, transform matrix is learned from the whole points to integrate the features attached to every vertex, which is considered to impair the capability of perceiving local structure (Qi et al., 2017b; Shen et al., 2018). Thus, it is necessary to extract features from local regions. Here we denote the local region as the K nearest neighboring points of a center point, and substract the coordinate of center point to neglect the relative position relationship to the whole point cloud. Intuitively, neighboring points construct local geometry structure together. Mining local geometry features requires to exchange information between neighboring points and eliminate the effect of point permutation.
Only applying graph convolution among local points lacks interaction between features, and we conjecture that combining similar signatures may result in more salient ones. Inspired from PointNet, which learns a transform matrix and post-multiplies the feature matrix, we learn a similar transform matrix from local points and pre-multiply the feature matrix. Since we know, every row in feature matrix represents a feature vector attached to a point and pre-multiplying the feature matrix by transform matrix may linearly recombine the features. After that, graph convolution and pooling operations are used for feature update and fusion, which can be formulated as
where P denotes the pooling operation across neighboring points and F denotes graph-based convolution applied on each of the local points
Here, we update each of the local features using the points in the neighboring region around the center point, which is slightly different from the original definition of GCN (Kipf and Welling, 2016; Bronstein et al., 2017).
Compared to EdgeConv operation proposed in (Wang et al., 2018), which updates the local feature point by point, our newly generated feature of one point takes all points in local region into consideration. And we use max pooling to achieve permutation invariance and screen out the most salient signature among the local points. This aggregated signature reflects high-level abstract feature of local region and can be concatenated with the global feature to form a complete point cloud representation.
3.3. Key points detection
In this context, key points denote the points lying on the edges or corners of the object. Exsiting works, such as (Liu et al., 2018), use attention module to highlight regions that are beneficial for recognition. In such data-driven methods, the degree of importance of each point is automatically learned without ground-truth key points for supervision, making it difficult to distinguish which point is truly important. Besides, it is hard to say the improvement of performance comes from the attention mechanism or from the increased number of parameters. We believe that more accurate information can be obtained by exploiting the intrinsic property of point cloud. The commonly used 3D corner detector, Harris 3D (Sipiran and Bustos, 2011) achieves satisfying results, but is time-consuming and depends on parameter settings. It is universally acknowledged that the normals of points can reflect shape feature. Thus, we assign a response for every point by considering the changes of normals in its neighboring region
where denotes the normal at point . Though simple, we find that it works well and the response of point cloud is visualized in Figure 6. High responses appear in the regions of edges, especially at the corners. The calculated responses are integrated in the global representations of point cloud before global max pooling operation.
3.4. Network Architecture
The overall pipeline of the proposed method is demonstrated in Figure 1
. The input point cloud is fed into two branches to extract both global and local features. Both of the two branches begin with point projection operation, mapping the 3D coordinates into 4 dimensional feature space. For the backbone, we use multilayer perceptron (MLP) to abstract pointwise feature. For the side branch, we leverage Graph Aggregation operation, which first learns a transform matrix from local points and pre-multiplies the feature matrix to recombine signatures, followed by graph convolutions and a max pooling layer to update features and form a local descriptor. The features from two branches are concatenated and then decorated with key point response values in two ways: pointwise multiplication or summation. We use global max pooling operation to eliminate the effect of point permutation and obtain a complete representation for point cloud. Classification and segmentation tasks share the the same representation of point cloud. In classification task, three extra fully-connected layers are used to serve as a classifier. In segmentation task, we replicate the representation and concatenate it with the features in previous layer, and then feed it to a three-layer MLP to produce scores for each point.
In this section, we validate the effectiveness of the proposed architecture in point cloud classification and part segmentation tasks, and conduct ablation study to evaluate the contribution of each components. SRINet is implemented with Tensorflow and runs on GTX1080Ti. We use Adam(Kingma and Ba, 2014)
optimizer with initial learning rate 0.001 for training and decrease by 0.3 for 20 epochs in all experiments. For data augmentation, noise is added to perturb the object locations. We train the networks for 250 epochs to guarantee the convergence of model.
4.1. Point Cloud Classification
Dataset. We conduct classification experiments on ModelNet40 (Wu et al., 2015). The dataset consists of 12,311 CAD models from 40 categories, 9,843 of them are split for training and 2,468 for testing. Note that the orientation of these models are roughly aligned. We follow the same experimental settings as (Qi et al., 2017a). For each model, we uniformly sample 1024 points along with their normals as the network input.
Table 1 compares the results of our method with several state-of-the-art works. NR/NR means not rotation of point clouds in both training and testing. NR/AR means training with no rotation augmentation and testing with arbitrary rotations. Our method gets the highest accuracy when testing with arbitrary rotations and outperforms other methods by a large margin. We also achieve comparable results compared to PointNet on non-rotation data. Besides, we achieve equal accuracy in rotation and non-rotation test settings, which means the obtained representation for point cloud is strictly rotation-invariant. PRIN degrades slightly when testing with rotations and shows strong robustness to rotations. Other works, however, fail to recognize the object with unseen orientations.
4.2. Part Segmentation
Dataset. We evaluate SRINet for part segmentation task on ShapeNet part dataset (Yi et al., 2016). The dataset consists of 16,881 3D point cloud objects from 16 categories. The objects from various categories are segmented into 50 parts in total, and there are no overlap parts across different categories. For each object, a semantic label is assigned to every point. Each object contains no more than 5 parts. We use the processed dataset provided by (Qi et al., 2017a) and randomly sample 2048 points with their normals from each object.
The rotation-invariant representations for point cloud can be also used for part segmentation task. We compare our work with PointNet (Qi et al., 2017a), PointNet++ (Qi et al., 2017a), SyncSpecCNN (Yi et al., 2017), Kd-Network (Klokov and Lempitsky, 2017) and PRIN (You et al., 2018). We follow the same experimental settings as (You et al., 2018) in evaluation, and three groups of settings are listed as follows:
1. Train and test with no rotations.
2. Train with no rotation augmentations and test with arbitrary rotations.
3. Train with 10/20/30 rotations for every model and test with arbitrary rotations.
The results are shown in Table 2. State-of-the-art methods, such as PointNet, use orientation-aligned point clouds as their input and achieve good performance in the original task, but show great performance degradation when dealing with rotated point clouds. Training with increasing rotation augmentations may help improve the robustness to rotation, but the results are still worse than PRIN and ours. Besides, the improvement is obtained at a price of aggravating burden on computing resources. PRIN is not sensitive to rotations, however, it fails to achieve strictly rotation invariance. Our method neglects the effect of orientation and obtains best performance in segmenting rotated point cloud. The visualization of our segmentation results are demonstrated in Figure 7. We train these three models without rotation augmentation and rotate the input point clouds by a random angle when testing.
4.3. Ablation Study
Graph Aggregation. Graph aggregation operation is introduced in the side branch to exploit local geometry structure, and aggregate the features attached to the neighboring points around the center. We find it useful in point cloud recognition task, meaning that incorporating local structure helps perceive the global geometry. It also makes sense in segmentation task and improves the performance slightly. This is because precise segmentation requires global perception and local information only plays a secondary rule. The quantitative results for eliminating Graph Aggregation module are shown in Table 3.
Key Points Detection. Intuitively, mining the skeleton and key points of an object is beneficial to recognize the whole shape. We directly define the key point response value instead of adopting learnable neural-network based attention mechanism. We combine the response values and global point cloud representations in two ways: multiplication and summation. As shown in Table 4, combination by summation is proved to be useful. However, combination by multiplication results in worse performance than that with no key point detection module. And we remove the key points detection module to observe the effect to the whole model. The accuracy in classification experiment drops (from to ), and IoU value drops (from to ) without detecting key points. Though simple, this module brings in stable improvement for classification and segmentation tasks.
4.4. The effect of parameters
Number of nearest neighboring points. We need to find K nearest neighboring points for each point in Graph Aggregation operation. From Table 5, we can see that the number of neighboring points is not crucial in classification, but greatly affects the segmentation task. As the number of neighboring points increases, the performance of segmentation keeps going up. We conjecture that segmentation relies on the receptive field of local region, and broader receptive field may lead to better perception of global shape.
|KNN point number||16||25||36||49||64|
Number of input points. We vary the number of sampled points in the input point cloud to see if the proposed model is robust to the resolution of point cloud. The number ranges from 256 to 2048, shown in Table 6. We obtain the best results for both two tasks when the number of points is set to 1024. There is a tiny swing when going left or right from 1024, which suggests that SRINet is capable of extract valid local information despite of the different distributions of local regions, and sampling 1024 points from the original point cloud is an optimum choice to cover the whole object.
4.5. Comparing with Point Pair Feature
There exist several works that adopt point pair feature to reformulate the coordinates of point cloud and achieve strictly rotation invariance (Deng et al., 2018; Birdal and Ilic, 2015, 2017). Here, we compare it with the proposed point projection feature. In preliminary experiments, we find it difficult for neural networks to extract discriminative patterns from the original point pair features, which incorporate calculating the angle between two defined vectors. Thus, we step back and replace the angles with their cosine values that can be calculated by the inner product of two normalized vectors. For fair comparison, we use PointNet architecture and conduct classification task on ModelNet40. The original 3D coordinates of point cloud are converted to 4D point pair features and point projection features respectively, and then fed to the network. The one with point projection features gets an accuracy of , while the counterpart only gets . This implies that the proposed point projection feature preserves more relative location information between points compared to point pair feature, and shows great superiority in terms of point cloud recognition.
In this paper, we proposed SRINet to extract the strictly rotation-invariant representation of point cloud. Point projection feature was introduced to reformulate the original 3D coordinates. We used graph aggregation to mine local structure and key point detection to guide the network to perceive the 3D shape. Experiments on classification and part segmentation tasks showed that our method outperforms other methods in dealing with rotated point clouds. In the future work, the choice of more stable axes needs to be further exploited to reduce the loss when converting the 3D coordinates to point projection features. Besides, how to better understand the point projection feature and generalize it to more applications is also an interesting work worth to be done in the future.
This work was supported by National Key Research and Development Program of China (2017YFB1002601), National Natural Science Foundation of China (Grant No.: 61672043 and 61672056) and Key Laboratory of Science, Technology and Standard in Press Industry (Key Laboratory of Intelligent Press Media Technology).
Point pair features based object detection and pose estimation revisited. In 2015 International Conference on 3D Vision, pp. 527–535. Cited by: §4.5.
Cad priors for accurate and flexible instance reconstruction.
Proceedings of the IEEE International Conference on Computer Vision, pp. 133–142. Cited by: §4.5.
Geometric deep learning: going beyond euclidean data. IEEE Signal Processing Magazine 34 (4), pp. 18–42. Cited by: §3.2.
Ppf-foldnet: unsupervised learning of rotation invariant 3d local descriptors. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 602–618. Cited by: §1, §4.5.
Model globally, match locally: efficient and robust 3d object recognition.
2010 IEEE computer society conference on computer vision and pattern recognition, pp. 998–1005. Cited by: §1.
- GVCNN: group-view convolutional neural networks for 3d shape recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 264–272. Cited by: §2.1.
- Monte carlo convolution for learning on non-uniformly sampled point clouds. In SIGGRAPH Asia 2018 Technical Papers, pp. 235. Cited by: §2.2.
- Pointsift: a sift-like network module for 3d point cloud semantic segmentation. arXiv preprint arXiv:1807.00652. Cited by: §1.
- Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §4.
- Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907. Cited by: §3.2.
- Escape from cells: deep kd-networks for the recognition of 3d point cloud models. In Proceedings of the IEEE International Conference on Computer Vision, pp. 863–872. Cited by: §4.2, Table 1, Table 2.
- PointCNN: convolution on x-transformed points. In Advances in Neural Information Processing Systems, pp. 828–838. Cited by: §1, §2.2.
- Point2Sequence: learning the shape representation of 3d point clouds with an attention-based sequence to sequence network. arXiv preprint arXiv:1811.02565. Cited by: §3.3, Table 1.
- Voxnet: a 3d convolutional neural network for real-time object recognition. In 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 922–928. Cited by: §2.1.
- Pointnet: deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 652–660. Cited by: §1, §2.2, §4.1, §4.2, §4.2, Table 1, Table 2.
- Volumetric and multi-view cnns for object classification on 3d data. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5648–5656. Cited by: §2.1.
- Pointnet++: deep hierarchical feature learning on point sets in a metric space. In Advances in Neural Information Processing Systems, pp. 5099–5108. Cited by: §2.2, §3.2, Table 1, Table 2.
- Pointpronets: consolidation of point clouds with convolutional neural networks. In Computer Graphics Forum, Vol. 37, pp. 87–99. Cited by: §2.2.
- Mining point cloud local structures by kernel correlation and graph pooling. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4548–4557. Cited by: §2.2, §3.2.
- Harris 3d: a robust extension of the harris operator for interest point detection on 3d meshes. The Visual Computer 27 (11), pp. 963. Cited by: §3.3.
- Multi-view convolutional neural networks for 3d shape recognition. In Proceedings of the IEEE international conference on computer vision, pp. 945–953. Cited by: §2.1.
- Dynamic graph cnn for learning on point clouds. arXiv preprint arXiv:1801.07829. Cited by: §3.2.
- 3d shapenets: a deep representation for volumetric shapes. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1912–1920. Cited by: §1, §2.1, §4.1.
- A scalable active framework for region annotation in 3d shape collections. ACM Transactions on Graphics (TOG) 35 (6), pp. 210. Cited by: §4.2.
- Syncspeccnn: synchronized spectral cnn for 3d shape segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2282–2290. Cited by: §4.2, Table 2.
- P2P-net: bidirectional point displacement net for shape transform. ACM Transactions on Graphics (TOG) 37 (4), pp. 152. Cited by: §2.2.
- PRIN: pointwise rotation-invariant network. arXiv preprint arXiv:1811.09361. Cited by: §1, §2.2, §4.2, Table 1, Table 2.