With the development of 3D sensors such as structured light, time-of-flight and LIDAR, 3D data can be easily acquired and directly processed in many applications, such as autonomous driving, 3D face recognition and etc. In general, 3D data is encoded in the form of point cloud, which directly records the coordinates of the sampled points on the object surface. A key challenge for point cloud processing is that the input data is highly rotation-variant, which means a 3D object possesses rotated clones in infinite attitudes. This remains an intractable problem even for recently proposed deep 3D models such as PointNet[Qi et al.2017a], PointNet++ [Qi et al.2017b] and DGCNN [Wang et al.2018].
To alleviate the rotation variance problem, typical approaches either use a spatial transformer module as in the original PointNet[Qi et al.2017a] or apply extensive data augmentation during the training phase. However, it requires higher model capacity and brings extra computation burdens. Other methods such as Spherical CNN [Esteves et al.2018] and SFCNN [Rao, Lu, and Zhou2019] focus on converting the point cloud into some special structures to extract a rotation-invariant feature, which might suffer from loss of information.
In this paper, we introduce a novel PCA-RI (PCA Rotation-Invariant) representation to endow deep 3D models with rotation invariance by expressing the point cloud in the intrinsic frame. Such a frame should be stable regardless of arbitrary rotations. In other words, the expressed coordinates don’t change no matter how the object rotates. Besides, the intrinsic frame should be able to tolerate small distortions, thus providing a consistent representation for similar objects. Recall that PCA (principal component analysis) is designed to detect the main directions along which the variance is large for high-dimensional input data. These directions encode the intrinsic structure of the input data and maintain absolutely rotation-equivariant, which exactly offers an effective way to define the desired frames.
More specifically, we apply PCA techniques to obtain three principal components of a point cloud, which is used as the corresponding
axis of the intrinsic frame. After that, we project the point cloud onto the new frame and use the transformed coordinates as our PCA-RI representation for the point cloud, which is shown in Figure 1. A complete rotation-invariant ability will come immediately with this method as can be proven in later sections. Compared with the previous works, our PCA-RI representation has the advantages of simplicity and generality. It can be flexibly embedded into the current deep neural networks to fundamentally improve their robustness against rotation transformation.
One concern with our proposed approach is that we are not certain of the direction of each principal component. Thus for a point cloud, there exist two directions for each frame axis, which we call frame ambiguity. To address the problem, we propose a multi-frame approach to enumerate all the possible frames derived from the principal component analysis algorithm. After that, we feed all the PCA-RI representations of these frames to the deep model and aggregate all the output features via a self-attention module. In the end, we apply an average-pooling operation after the self-attention module to extract a final feature vector for downstream tasks. To empirically validate the effectiveness of our method, we conduct a comprehensive experimental study on ModelNet40[Wu et al.2015] classification and SHREC’17 [Savva et al.2017] perturbed retrieval tasks. The experimental results demonstrate that our approach can achieve near state-of-the-art performance on rotation-augmented dataset for ModelNet40 [Wu et al.2015] classification and outperform other models on SHREC’17 [Savva et al.2017] perturbed retrieval task.
In summary, the key contributions of this paper are as follows:
We propose a theoretically rotation-invariant and absolutely information-lossless point cloud representation.
We further introduce a multi-frame approach based on a self-attention module, which can effectively address the problem of frame ambiguity.
Extensive experiments further demonstrate the correctness and effectiveness of our method.
Deep Learning for 3D Objects
Motivated by the breakthrough results of convolutional neural networks in 2D images, increasing attention has been drawn to developing such methods for geometric data. One intuitive idea is to convert irregular point clouds into regular 3D grids by voxelization[Maturana and Scherer2015, Qi et al.2016], since its format is similar to pixel and easy to transfer to existing frameworks. However, it is inevitable to suffer from loss of resolution and high computational demand. To avoid the shortcoming of naive voxelization, kd-tree [Klokov and Lempitsky2017] and octree [Riegler, Osman Ulusoy, and Geiger2017] based methods hierarchically partition space to exploit input sparsity. But these methods focus more on subdivision of a volume rather than local geometric structure.
An important architectural model that directly processes point cloud is PointNet [Qi et al.2017a]
, which adopts spatial transform networks and a symmetry function to maintain the invariance of permutation. After that, many point-based learning approaches focus on how to efficiently capture local features based on PointNet[Qi et al.2017a]. For instance, PointNet++ [Qi et al.2017b] applies PointNet [Qi et al.2017a] structure in local point sets with different resolutions and accumulates local features in a hierarchical architecture. In DGCNN [Wang et al.2018], EdgeConv is proposed as a basic block to build networks, in which the edge features between points and their neighbors are exploited.
Recently, attention mechanisms [Bahdanau, Cho, and Bengio2014, Show2015, Gregor et al.2015, Yang et al.2016, Chen et al.2017] have become an integral part of models that must capture global dependencies. In particular, self-attention [Cheng, Dong, and Lapata2016, Parikh et al.2016, Vaswani et al.2017], also called intra-attention, exhibits a better balance between the ability to model long-range dependencies and the computational efficiency. The self-attention module calculates response at a position as a weighted sum of the features at all positions, where the weights called attention vectors are calculated with only a small computational cost. Vaswani et al. [Vaswani et al.2017]
further demonstrate that machine translation models could achieve state-of-the-art results by solely using a self-attention model.
Rotation-Invariant Network for 3D Objects
The rotation robustness is essential in real-world applications of point cloud processing systems. Previous works have attempted to equip the existing neural networks with the property of rotation invariance. A straightforward method is to train a deep model with great amounts of rotation-augmented data. Although data augmentation is effective to some extent, it is computationally expensive during the training phase. Furthermore, the previous study [Esteves et al.2018] has shown that aggressive data augmentation like arbitrary 3D rotations on input data will still harm the recognition performance. PointNet [Qi et al.2017a] applies spatial transformer network (STN) to canonicalize the input data but further experiments demonstrate that model with STN still suffers from great performance drop on arbitrary rotation-augmented 3D dataset.
In closely related works, Esteves et al. [Esteves et al.2018] propose a special convolutional operation with local rotation invariance, which can generalize well to unseen rotations. Besides, Rao et al. [Rao, Lu, and Zhou2019] design a trainable neural network to project the original points onto the fractal structure adaptively, which makes their model resistant to arbitrary rotations. While the theoretical foundations of these approaches are well-studied, they have primarily been applied to spherical shapes or projected onto a structure, which might suffer from loss of information. ClusterNet [Chen et al.2019] introduces a point cloud representation by using rigorously rotation-invariant operator such as the inner product between points. Although ClusterNet [Chen et al.2019] claims their representation is conditional information-lossless, their experiments conducted on ModelNet40 [Wu et al.2015] reveal that this representation still degrades the performance on 3D tasks.
In this section, we first introduce our PCA-RI representation for point cloud based on principal component analysis. Then we explain how to address the problem of frame ambiguity in a deep neural network by multi-frame fusion based on a self-attention module. In the end, we present how our method can be embedded into deep 3D models.
The main idea of our method is to find an intrinsic frame determined by the object shape. The intrinsic frame should provide the same representations for all rotated clones of the identical object. In addition, it should be capable of tolerating small distortion of the object shape. That is, similar objects will provide similar frames and representations. To this end, we propose a PCA-RI (PCA Rotation Invariant) representation based on the classical PCA (principal component analysis).
Let represents a point cloud, which directly encodes the coordinates of the sampled points on the object surface. Note that the coordinate value of each point depends on the selection of the coordinate system, namely the frame. The intrinsic frame is such a frame that can be automatically detected from the object structure.
Use to denote the mean of a point cloud and to denote its corresponding covariance matrix, which is a semi-definite symmetric matrix. Then can be calculated as follows:
Then we use eigendecomposition to find the eigenvector of the covariance matrix , which satisfies the following equation:
Obviously, there are three eigenvalues, denoted aswith three unit normalized corresponding eigenvectors . After that, we use to define the intrinsic frame and express the point cloud in the new frame with the order :
in which represents the redefined coordinate value in the intrinsic frame. Now we will prove that the intrinsic coordinate of will not change with rigid rotations.
Suppose the point cloud is rotated in the original frame, giving another rotated point cloud representation with where represents a rigid rotation matrix. It’s not hard to see that the corresponding covariance matrix of satisfies
Obviously, we have:
which means and are the eigenvalues and eigenvectors of respectively. Denote the redefined coordinate of in the intrinsic frame as . Thus we have:
As shown in Equation (6), the redefined coordinate value of each point in the intrinsic frame remains invariant no matter how the point cloud is rotated. From geometric perspective, our PCA-RI representation merely adjusts the arbitrary rotated clones of the identical point cloud to a consistent pose. It reveals that our PCA-RI representation is absolutely information-lossless. Apart from this, our approach is very general and can be applied to the current neural network architectures.
Note that the above rotation invariance assumes that there are three distinct eigenvalues so that we can define axis according to the order of
, which we call axis significance. If the axis significance is weak, i.e. the eigenvalues are close, we will not be able to detect a stable intrinsic frame. For instance, if the shape is composed of three intersected orthogonal lines of the same length, then the covariance matrix is an identity matrixand . This will result in infinite intrinsic frames and meanwhile the property of rotation invariance will not hold anymore. Fortunately, the axis significance will be preserved in general cases as shown in the later experiments.
Another noteworthy point is whether the intrinsic frames are consistent across intra-class objects. For example, the axes of intrinsic frames for desks are all roughly along the edges. We argue that for the same category, the principal components are close, thus ensuring the frames consistent. As shown in Figure 2, we list some examples for cup, chair and lamp categories. For each category, the first row denotes the manually aligned objects while the second row represents the objects aligned with our intrinsic frames. It’s not hard to see that the intrinsic frames are consistent across the intra-class samples in most cases. Note that we cannot still achieve absolute alignment like manual alignment for the reason that PCA pays more attention to the data distribution of point cloud. Despite this, theoretical analysis and extensive experiments still demonstrate that the canonicalization of our PCA-RI representation can essentially reduce the learning difficulty of the neural network by replacing infinite rotating attitudes with some fixed poses and meanwhile retaining the original point cloud information intact.
Frame Ambiguity Elimination
One concern with our proposed method is that when we try to define the new coordinates using eigendecomposition, we are not sure about the direction of as the following equation (7) also holds.
Specifically, the process of eigenvector computation provides no means for assessing the sign of each eigenvector so that the individual eigenvector has an arbitrary sign. It means that for the identical point cloud, there exist two directions for each frame axis, which we denote as frame ambiguity. Figure 3 illustrates the phenomenon of frame ambiguity.
To address the issue, we adopt a multi-frame approach to fuse the results. Denoting the deep model which we are going to endow with rotation invariance by a function . The denotes a feature vector generated by the deep model when given the input point cloud . Here we suppose the centroid of the input point cloud is on the origin. Our fusion scheme can be abstracted as follows:
in which we introduce a fusion function to obtain a final feature descriptor from multiple PCA-RI representations with denoting the frame.
In order to achieve absolute rotation invariance, we require the fusion function to be independent on the order of the frames with = and denoting the permutation respectively as follows:
A naive approach is to directly apply an average or max pooling operation on. However, we find that the direct pooling operation disregards a lot of relationship among the features, which limits the discriminability of the final feature.
To alleviate the problem, we apply a self-attention module derived from [Vaswani et al.2017] before pooling layers to pay more attention to the relationship between the multi-frame features. Following the notations of [Vaswani et al.2017], the transformed feature derived from the self-attention module can be expressed as follows with shared parameter matrices , and :
From Equation (10) we can observe that the attention module aims to allocate weight to multi-frame feature and accumulate the weighted features.
Use to denote the self-attention transformation as follows:
We care about whether the is invariant to the input order of as Equation (12) shows:
Fortunately, this equation holds as the sum operation in Equation (10) doesn’t care about the order of items. In summary, our transformed features derived from the self-attention module are independent on the order of input frames.
With these transformed features , we further apply an average-pooling operation to obtain a final feature vector for further processing, which can be summarized as follows:
Note that we adopt the average-pooling operation as it achieves better performance than the max-pooling operation in our experiment.
Embedded into Deep Architectures
|SubVolSup MO [Qi et al.2016]||voxel||89.5||85.0||45.5|
|Spherical CNN [Esteves et al.2018]||projected voxel||88.9||86.9||76.7|
|MVCNN 80x [Su et al.2015]||views||90.2||86.0||81.5|
|RotationNet 20x [Kanezaki, Matsushita, and Nishida2018]||views||92.4||80.0||20.2|
|PointNet [Qi et al.2017a]||xyz||89.2||83.6||14.7|
|PointNet++ [Qi et al.2017b]||xyz||89.3||85.0||28.6|
|SFCNN [Rao, Lu, and Zhou2019]||xyz||91.4||90.1||84.8|
|ClusterNet [Chen et al.2019]||xyz||87.1||87.1||87.1|
|DGCNN [Wang et al.2018]||xyz||91.9||88.3||37.8|
|DGCNN (without STN) [Wang et al.2018]||xyz||91.6||88.1||36.3|
|Furuya [Furuya and Ohbuchi2016]||0.814||0.683||0.706||0.656||0.754||0.607||0.539||0.503||0.476||0.560||0.566|
|Tatsuma [Tatsuma and Aono2009]||0.705||0.769||0.719||0.696||0.783||0.424||0.563||0.434||0.418||0.479||0.557|
|Zhou [Bai et al.2016]||0.660||0.650||0.643||0.567||0.701||0.443||0.508||0.437||0.406||0.513||0.487|
|Spherical CNN [Esteves et al.2018]||0.717||0.737||-||0.685||-||0.450||0.550||-||0.444||-||0.565|
|SFCNN [Rao, Lu, and Zhou2019]||0.778||0.751||0.752||0.705||0.813||0.656||0.539||0.536||0.483||0.580||0.594|
|DGCNN (without STN) [Wang et al.2018]||0.768||0.717||0.719||0.672||0.782||0.640||0.527||0.515||0.449||0.564||0.561|
|DGCNN [Wang et al.2018]||0.774||0.723||0.725||0.679||0.789||0.640||0.531||0.521||0.454||0.567||0.567|
As we have claimed, our method can be flexibly embedded into the current neural architectures. In this part, we adopt DGCNN [Wang et al.2018] as our basic architecture and further demonstrate how to endow it with rotation-invariance.
The extended architecture, depicted in Figure 4, consists of four modules: PCA-RI representation module, EdgeConv module, self-attention module and classification module. The EdgeConv module contains eight EdgeConv blocks, which share the same weight parameters. Each block consists of five layers with layer output size 64, 64, 64, 128, 1024 respectively. Since our PCA-RI can maintain rotation-invariance, we remove the spatial transform network (STN) of DGCNN [Wang et al.2018] as STN is mainly designed to make model resistant to affine transformation.
For each input point cloud, we first convert it into eight PCA-RI representations and feed them to the eight EdgeConv blocks during the training phase. These blocks further produce eight output features, which will be aggregated by a self-attention module followed by an average pooling layer to obtain a final feature for the downstream tasks. For simplicity, we denote this model as our multi-frame model.
Another possible architecture is to apply only one EdgeConv block and meanwhile remove the self-attention module and the average-pooling layer with other parts unchanged. We call this single-frame model. During the training phase, the PCA-RI module randomly selects one of the eight representations as input, which can also improve the the rotation robustness of models as can be seen in the following experiments.
Note that our approach doesn’t need to apply any rotation augmentation on the training data, which extremely reduces the computational burdens.
In this section, ModelNet40 [Wu et al.2015] is used as the benchmark for 3D classification task. Next, we conduct experiments on ShapeNet Core55 [Chang et al.2015] for the retrieval task. In the end, we provide some ablation analysis of our approach.
ModelNet 3D Shape Classification
We first evaluate the rotation robustness of our proposed method on ModelNet40 [Wu et al.2015] benchmarks for 3D classification task and further compare our method with other state-of-the-art 3D shape classification models.
ModelNet40 [Wu et al.2015] is used as the benchmark for 3D classification tasks. ModelNet40 [Wu et al.2015] dataset consists of 12,311 CAD models from 40 manmade object categories. We use the standard split following PointNet [Qi et al.2017a] where 9,843 is used for training and 2,468 is used for testing. Since each CAD model in ModelNet40 [Wu et al.2015] is composed of many mesh faces, we sample 2,048 points from them uniformly with respect to face area and then shift and normalize each point cloud into with centroid on the origin. Only the (x, y, z) coordinates of the sampled points are used and the original meshes are discarded.
Following Spherical CNN [Esteves et al.2018], we evaluate our model using three different settings: 1) training and testing with azimuthal rotations (z/z), 2) training and testing with arbitrary rotations (SO3/SO3), and 3) training with azimuthal rotations while testing with arbitrary rotations (z/SO3).
Table 1 shows the comparisons between our proposed method and the previous methods. All competing methods using azimuthal rotations augmentation suffer a sharp drop on the arbitrary rotation-augmented test set, even for the SO(3) equivariant method Spherical CNN [Esteves et al.2018] (2 % and 12.2 % drop in SO3/SO3 and z/SO3 respectively) and SFCNN [Rao, Lu, and Zhou2019] (1.3 % and 6.6 % drop in SO3/SO3 and z/SO3 respectively) while our approach consistently maintains superior performance across different settings. Furthermore, it illustrates that rotation-augmentation can indeed improve the rotation robustness of models but still has a large margin with our proposed method and SFCNN [Rao, Lu, and Zhou2019] on the SO(3)/SO(3) setting.
Note that SFCNN [Rao, Lu, and Zhou2019] can achieve 0.3% better performance than ours on the SO3/SO3 setting. Nevertheless, SFCNN [Rao, Lu, and Zhou2019] has to apply a complicated operation to project the point cloud onto a fractal structure, which might lead to loss of information of the original point cloud. Given the rather simple architecture of our model and the information-lossless input representation we use, we interpret our performance as strong empirical support for the effectiveness of our method.
SHREC’17 3D Shape Retrieval
We also conduct 3D shape retrieval experiments on ShapeNet Core55 [Chang et al.2015] benchmark using its perturbed dataset, which contains random SO(3) arbitrary rotations.
ShapeNet Core55 [Chang et al.2015] benchmark has two evaluation datasets: normal and perturbed. For normal dataset, all model data is consistently aligned while in the perturbed dataset each model data has been randomly rotated by a uniformly sampled rotation. In order to validate the rotation robustness of our approach, we only consider the perturbed dataset which contains a total of 51,190 3D models with 55 categories. 70% of the dataset is used for training, 10% for validation, and 20% for testing.
Following the experimental settings in Spherical CNN [Esteves et al.2018]
, we train the classification model on the 55 core classes with joint supervision of triplet loss and softmax loss. We use the output of the layer before the score prediction layer as our feature vector and compute the distance between samples by cosine similarity.
SHREC’17 [Savva et al.2017]
provides several evaluation metrics including Precision, Recall, F1, mAP and normalized discounted cumulative gain (NDCG). These metrics are computed under both micro and macro context. We evaluate our method and compare it to the prior models using the official metrics. In addition, following[Savva et al.2017] we use the average of the micro and macro mAP as the final score to rank the performance.
In Table 2, comprehensive comparisons between our approach and various state-of-the-art methods are presented. As we can see, our approach outperforms all other models including the previous state-of-the-art SFCNN [Rao, Lu, and Zhou2019] under both macro and micro context in terms of most metrics. More importantly, our method is more scalable and flexible without extra complicated operation.
Analysis of Architecture
Since our PCA-RI representation can be processed to be compatible with many architectures dealing with point cloud, we further enhance PointNet [Qi et al.2017b] and PointNet++ [Qi et al.2017b] with our PCA-RI representation. As shown in Table 3, the enhanced DGCNN [Wang et al.2018] (without STN), PointNet [Qi et al.2017a] and PointNet++ [Qi et al.2017b] by using the PCA-RI representation outperform the original models by a large margin on arbitrary rotation-augmented dataset for ModelNet40 [Wu et al.2015] classification tasks.
|DGCNN (without STN)||87.4|
|DGCNN (without STN)||88.2|
|DGCNN (without STN)||88.8|
Analysis of Self-Attention Module
For our multi-frame approach, how to aggregate the features for all intrinsic frames is important to extract a discriminative feature for further processing. As shown in Table 4, our experimental results demonstrate that the performance of directly applying a pooling layer on all the individual features can be improved by adding a self-attention module before the pooling layer. In addition, it illustrates that the average-pooling scheme is a more robust and beneficial to our classification tasks compared with using a max-pooling operation.
Analysis of Frame Stability
An important requirement for our approach is that the intrinsic frame should be stable. For an identical object, we hope that the intrinsic frame derived from different sampled point clouds will maintain consistent. To this end, we make a statistic on the stability regarding the sampling. Our experiment further shows that the average rotation angles between the intrinsic frames derived from two sampled point clouds of the same mesh are and for ModelNet40 [Wu et al.2015] and ShapeNet Core55 [Chang et al.2015] respectively, indicating that the sampled point clouds can have consistent intrinsic frames.
As aforementioned, another influential factor of our frame stability is the axis significance, which is related to how different the eigenvalues are. We list the distributions of eigenvalue ratio in Figure 5 for ModelNet40 [Wu et al.2015] and ShapeNet Core55 [Chang et al.2015]. Our experimental results demonstrate that more than eighty percent of the point clouds have significant axis order with the ratio of being smaller than 0.8.
|Self Attention + Max pooling||88.5|
|Self Attention + Avg pooling||88.8|
|Self Attention + Max pooling||89.5|
|Self Attention + Avg pooling||89.8|
In this paper, we introduce a rotation-invariant representation based on principal component analysis to enhance the rotation robustness for 3D deep models. In order to handle the sign ambiguity of eigenvectors, we adopt a multi-frame strategy to aggregate all the feature vectors by self-attention mechanisms, which can still preserve the property of rotation invariance theoretically while achieving better performance than directly pooling. Despite its simplicity, our approach is very effective and can be easily embedded to 3D deep models. Extensive experimental results on ModelNet40 and ShapeNet Core55 benchmark demonstrate the superiority of our novel representation.
- [Bahdanau, Cho, and Bengio2014] Bahdanau, D.; Cho, K.; and Bengio, Y. 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473.
- [Bai et al.2016] Bai, S.; Bai, X.; Zhou, Z.; Zhang, Z.; and Jan Latecki, L. 2016. Gift: A real-time and scalable 3d shape search engine. In , 5023–5032.
- [Chang et al.2015] Chang, A. X.; Funkhouser, T.; Guibas, L.; Hanrahan, P.; Huang, Q.; Li, Z.; Savarese, S.; Savva, M.; Song, S.; Su, H.; et al. 2015. Shapenet: An information-rich 3d model repository. arXiv preprint arXiv:1512.03012.
- [Chen et al.2017] Chen, X.; Mishra, N.; Rohaninejad, M.; and Abbeel, P. 2017. Pixelsnail: An improved autoregressive generative model. arXiv preprint arXiv:1712.09763.
[Chen et al.2019]
Chen, C.; Li, G.; Xu, R.; Chen, T.; Wang, M.; and Lin, L.
Clusternet: Deep hierarchical cluster network with rigorously rotation-invariant representation for point cloud analysis.In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 4994–5002.
- [Cheng, Dong, and Lapata2016] Cheng, J.; Dong, L.; and Lapata, M. 2016. Long short-term memory-networks for machine reading. arXiv preprint arXiv:1601.06733.
- [Esteves et al.2018] Esteves, C.; Allen-Blanchette, C.; Makadia, A.; and Daniilidis, K. 2018. Learning so (3) equivariant representations with spherical cnns. In Proceedings of the European Conference on Computer Vision (ECCV), 52–68.
- [Furuya and Ohbuchi2016] Furuya, T., and Ohbuchi, R. 2016. Deep aggregation of local 3d geometric features for 3d model retrieval. In BMVC, 121–1.
- [Gregor et al.2015] Gregor, K.; Danihelka, I.; Graves, A.; Rezende, D. J.; and Wierstra, D. 2015. Draw: A recurrent neural network for image generation. arXiv preprint arXiv:1502.04623.
[Kanezaki, Matsushita, and
Kanezaki, A.; Matsushita, Y.; and Nishida, Y.
Rotationnet: Joint object categorization and pose estimation using multiviews from unsupervised viewpoints.In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 5010–5019.
- [Klokov and Lempitsky2017] Klokov, R., and Lempitsky, V. 2017. Escape from cells: Deep kd-networks for the recognition of 3d point cloud models. In Proceedings of the IEEE International Conference on Computer Vision, 863–872.
- [Maturana and Scherer2015] Maturana, D., and Scherer, S. 2015. Voxnet: A 3d convolutional neural network for real-time object recognition. In 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 922–928. IEEE.
- [Parikh et al.2016] Parikh, A. P.; Täckström, O.; Das, D.; and Uszkoreit, J. 2016. A decomposable attention model for natural language inference. arXiv preprint arXiv:1606.01933.
- [Qi et al.2016] Qi, C. R.; Su, H.; Nießner, M.; Dai, A.; Yan, M.; and Guibas, L. J. 2016. Volumetric and multi-view cnns for object classification on 3d data. In Proceedings of the IEEE conference on computer vision and pattern recognition, 5648–5656.
[Qi et al.2017a]
Qi, C. R.; Su, H.; Mo, K.; and Guibas, L. J.
Pointnet: Deep learning on point sets for 3d classification and segmentation.In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 652–660.
- [Qi et al.2017b] Qi, C. R.; Yi, L.; Su, H.; and Guibas, L. J. 2017b. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. In Advances in neural information processing systems, 5099–5108.
- [Rao, Lu, and Zhou2019] Rao, Y.; Lu, J.; and Zhou, J. 2019. Spherical fractal convolutional neural networks for point cloud recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 452–460.
- [Riegler, Osman Ulusoy, and Geiger2017] Riegler, G.; Osman Ulusoy, A.; and Geiger, A. 2017. Octnet: Learning deep 3d representations at high resolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 3577–3586.
- [Savva et al.2017] Savva, M.; Yu, F.; Su, H.; Kanezaki, A.; Furuya, T.; Ohbuchi, R.; Zhou, Z.; Yu, R.; Bai, S.; Bai, X.; et al. 2017. Large-scale 3d shape retrieval from shapenet core55: Shrec’17 track. In Proceedings of the Workshop on 3D Object Retrieval, 39–50. Eurographics Association.
- [Show2015] Show, A. 2015. Tell: Neural image caption generation with visual attention. Kelvin Xu et. al.. arXiv Pre-Print 23.
- [Su et al.2015] Su, H.; Maji, S.; Kalogerakis, E.; and Learned-Miller, E. 2015. Multi-view convolutional neural networks for 3d shape recognition. In Proceedings of the IEEE international conference on computer vision, 945–953.
[Tatsuma and Aono2009]
Tatsuma, A., and Aono, M.
Multi-fourier spectra descriptor and augmentation with spectral clustering for 3d shape retrieval.The Visual Computer 25(8):785–804.
- [Vaswani et al.2017] Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A. N.; Kaiser, Ł.; and Polosukhin, I. 2017. Attention is all you need. In Advances in neural information processing systems, 5998–6008.
- [Wang et al.2018] Wang, Y.; Sun, Y.; Liu, Z.; Sarma, S. E.; Bronstein, M. M.; and Solomon, J. M. 2018. Dynamic graph cnn for learning on point clouds. arXiv preprint arXiv:1801.07829.
- [Wu et al.2015] Wu, Z.; Song, S.; Khosla, A.; Yu, F.; Zhang, L.; Tang, X.; and Xiao, J. 2015. 3d shapenets: A deep representation for volumetric shapes. In Proceedings of the IEEE conference on computer vision and pattern recognition, 1912–1920.
- [Yang et al.2016] Yang, Z.; He, X.; Gao, J.; Deng, L.; and Smola, A. 2016. Stacked attention networks for image question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, 21–29.