Introduction
With the development of 3D sensors such as structured light, timeofflight and LIDAR, 3D data can be easily acquired and directly processed in many applications, such as autonomous driving, 3D face recognition and etc. In general, 3D data is encoded in the form of point cloud, which directly records the coordinates of the sampled points on the object surface. A key challenge for point cloud processing is that the input data is highly rotationvariant, which means a 3D object possesses rotated clones in infinite attitudes. This remains an intractable problem even for recently proposed deep 3D models such as PointNet
[Qi et al.2017a], PointNet++ [Qi et al.2017b] and DGCNN [Wang et al.2018].To alleviate the rotation variance problem, typical approaches either use a spatial transformer module as in the original PointNet
[Qi et al.2017a] or apply extensive data augmentation during the training phase. However, it requires higher model capacity and brings extra computation burdens. Other methods such as Spherical CNN [Esteves et al.2018] and SFCNN [Rao, Lu, and Zhou2019] focus on converting the point cloud into some special structures to extract a rotationinvariant feature, which might suffer from loss of information.In this paper, we introduce a novel PCARI (PCA RotationInvariant) representation to endow deep 3D models with rotation invariance by expressing the point cloud in the intrinsic frame. Such a frame should be stable regardless of arbitrary rotations. In other words, the expressed coordinates don’t change no matter how the object rotates. Besides, the intrinsic frame should be able to tolerate small distortions, thus providing a consistent representation for similar objects. Recall that PCA (principal component analysis) is designed to detect the main directions along which the variance is large for highdimensional input data. These directions encode the intrinsic structure of the input data and maintain absolutely rotationequivariant, which exactly offers an effective way to define the desired frames.
More specifically, we apply PCA techniques to obtain three principal components of a point cloud, which is used as the corresponding
axis of the intrinsic frame. After that, we project the point cloud onto the new frame and use the transformed coordinates as our PCARI representation for the point cloud, which is shown in Figure 1. A complete rotationinvariant ability will come immediately with this method as can be proven in later sections. Compared with the previous works, our PCARI representation has the advantages of simplicity and generality. It can be flexibly embedded into the current deep neural networks to fundamentally improve their robustness against rotation transformation.
One concern with our proposed approach is that we are not certain of the direction of each principal component. Thus for a point cloud, there exist two directions for each frame axis, which we call frame ambiguity. To address the problem, we propose a multiframe approach to enumerate all the possible frames derived from the principal component analysis algorithm. After that, we feed all the PCARI representations of these frames to the deep model and aggregate all the output features via a selfattention module. In the end, we apply an averagepooling operation after the selfattention module to extract a final feature vector for downstream tasks. To empirically validate the effectiveness of our method, we conduct a comprehensive experimental study on ModelNet40
[Wu et al.2015] classification and SHREC’17 [Savva et al.2017] perturbed retrieval tasks. The experimental results demonstrate that our approach can achieve near stateoftheart performance on rotationaugmented dataset for ModelNet40 [Wu et al.2015] classification and outperform other models on SHREC’17 [Savva et al.2017] perturbed retrieval task.In summary, the key contributions of this paper are as follows:

We propose a theoretically rotationinvariant and absolutely informationlossless point cloud representation.

We further introduce a multiframe approach based on a selfattention module, which can effectively address the problem of frame ambiguity.

Extensive experiments further demonstrate the correctness and effectiveness of our method.
Related Work
Deep Learning for 3D Objects
Motivated by the breakthrough results of convolutional neural networks in 2D images, increasing attention has been drawn to developing such methods for geometric data. One intuitive idea is to convert irregular point clouds into regular 3D grids by voxelization
[Maturana and Scherer2015, Qi et al.2016], since its format is similar to pixel and easy to transfer to existing frameworks. However, it is inevitable to suffer from loss of resolution and high computational demand. To avoid the shortcoming of naive voxelization, kdtree [Klokov and Lempitsky2017] and octree [Riegler, Osman Ulusoy, and Geiger2017] based methods hierarchically partition space to exploit input sparsity. But these methods focus more on subdivision of a volume rather than local geometric structure.An important architectural model that directly processes point cloud is PointNet [Qi et al.2017a]
, which adopts spatial transform networks and a symmetry function to maintain the invariance of permutation. After that, many pointbased learning approaches focus on how to efficiently capture local features based on PointNet
[Qi et al.2017a]. For instance, PointNet++ [Qi et al.2017b] applies PointNet [Qi et al.2017a] structure in local point sets with different resolutions and accumulates local features in a hierarchical architecture. In DGCNN [Wang et al.2018], EdgeConv is proposed as a basic block to build networks, in which the edge features between points and their neighbors are exploited.SelfAttention
Recently, attention mechanisms [Bahdanau, Cho, and Bengio2014, Show2015, Gregor et al.2015, Yang et al.2016, Chen et al.2017] have become an integral part of models that must capture global dependencies. In particular, selfattention [Cheng, Dong, and Lapata2016, Parikh et al.2016, Vaswani et al.2017], also called intraattention, exhibits a better balance between the ability to model longrange dependencies and the computational efficiency. The selfattention module calculates response at a position as a weighted sum of the features at all positions, where the weights called attention vectors are calculated with only a small computational cost. Vaswani et al. [Vaswani et al.2017]
further demonstrate that machine translation models could achieve stateoftheart results by solely using a selfattention model.
RotationInvariant Network for 3D Objects
The rotation robustness is essential in realworld applications of point cloud processing systems. Previous works have attempted to equip the existing neural networks with the property of rotation invariance. A straightforward method is to train a deep model with great amounts of rotationaugmented data. Although data augmentation is effective to some extent, it is computationally expensive during the training phase. Furthermore, the previous study [Esteves et al.2018] has shown that aggressive data augmentation like arbitrary 3D rotations on input data will still harm the recognition performance. PointNet [Qi et al.2017a] applies spatial transformer network (STN) to canonicalize the input data but further experiments demonstrate that model with STN still suffers from great performance drop on arbitrary rotationaugmented 3D dataset.
In closely related works, Esteves et al. [Esteves et al.2018] propose a special convolutional operation with local rotation invariance, which can generalize well to unseen rotations. Besides, Rao et al. [Rao, Lu, and Zhou2019] design a trainable neural network to project the original points onto the fractal structure adaptively, which makes their model resistant to arbitrary rotations. While the theoretical foundations of these approaches are wellstudied, they have primarily been applied to spherical shapes or projected onto a structure, which might suffer from loss of information. ClusterNet [Chen et al.2019] introduces a point cloud representation by using rigorously rotationinvariant operator such as the inner product between points. Although ClusterNet [Chen et al.2019] claims their representation is conditional informationlossless, their experiments conducted on ModelNet40 [Wu et al.2015] reveal that this representation still degrades the performance on 3D tasks.
Approach
In this section, we first introduce our PCARI representation for point cloud based on principal component analysis. Then we explain how to address the problem of frame ambiguity in a deep neural network by multiframe fusion based on a selfattention module. In the end, we present how our method can be embedded into deep 3D models.
PCARI Representation
The main idea of our method is to find an intrinsic frame determined by the object shape. The intrinsic frame should provide the same representations for all rotated clones of the identical object. In addition, it should be capable of tolerating small distortion of the object shape. That is, similar objects will provide similar frames and representations. To this end, we propose a PCARI (PCA Rotation Invariant) representation based on the classical PCA (principal component analysis).
Let represents a point cloud, which directly encodes the coordinates of the sampled points on the object surface. Note that the coordinate value of each point depends on the selection of the coordinate system, namely the frame. The intrinsic frame is such a frame that can be automatically detected from the object structure.
Use to denote the mean of a point cloud and to denote its corresponding covariance matrix, which is a semidefinite symmetric matrix. Then can be calculated as follows:
(1) 
Then we use eigendecomposition to find the eigenvector of the covariance matrix , which satisfies the following equation:
(2) 
Obviously, there are three eigenvalues, denoted as
with three unit normalized corresponding eigenvectors . After that, we use to define the intrinsic frame and express the point cloud in the new frame with the order :(3) 
in which represents the redefined coordinate value in the intrinsic frame. Now we will prove that the intrinsic coordinate of will not change with rigid rotations.
Suppose the point cloud is rotated in the original frame, giving another rotated point cloud representation with where represents a rigid rotation matrix. It’s not hard to see that the corresponding covariance matrix of satisfies
(4) 
Obviously, we have:
(5) 
which means and are the eigenvalues and eigenvectors of respectively. Denote the redefined coordinate of in the intrinsic frame as . Thus we have:
(6) 
As shown in Equation (6), the redefined coordinate value of each point in the intrinsic frame remains invariant no matter how the point cloud is rotated. From geometric perspective, our PCARI representation merely adjusts the arbitrary rotated clones of the identical point cloud to a consistent pose. It reveals that our PCARI representation is absolutely informationlossless. Apart from this, our approach is very general and can be applied to the current neural network architectures.
Note that the above rotation invariance assumes that there are three distinct eigenvalues so that we can define axis according to the order of
, which we call axis significance. If the axis significance is weak, i.e. the eigenvalues are close, we will not be able to detect a stable intrinsic frame. For instance, if the shape is composed of three intersected orthogonal lines of the same length, then the covariance matrix is an identity matrix
and . This will result in infinite intrinsic frames and meanwhile the property of rotation invariance will not hold anymore. Fortunately, the axis significance will be preserved in general cases as shown in the later experiments.Another noteworthy point is whether the intrinsic frames are consistent across intraclass objects. For example, the axes of intrinsic frames for desks are all roughly along the edges. We argue that for the same category, the principal components are close, thus ensuring the frames consistent. As shown in Figure 2, we list some examples for cup, chair and lamp categories. For each category, the first row denotes the manually aligned objects while the second row represents the objects aligned with our intrinsic frames. It’s not hard to see that the intrinsic frames are consistent across the intraclass samples in most cases. Note that we cannot still achieve absolute alignment like manual alignment for the reason that PCA pays more attention to the data distribution of point cloud. Despite this, theoretical analysis and extensive experiments still demonstrate that the canonicalization of our PCARI representation can essentially reduce the learning difficulty of the neural network by replacing infinite rotating attitudes with some fixed poses and meanwhile retaining the original point cloud information intact.
Frame Ambiguity Elimination
One concern with our proposed method is that when we try to define the new coordinates using eigendecomposition, we are not sure about the direction of as the following equation (7) also holds.
(7) 
Specifically, the process of eigenvector computation provides no means for assessing the sign of each eigenvector so that the individual eigenvector has an arbitrary sign. It means that for the identical point cloud, there exist two directions for each frame axis, which we denote as frame ambiguity. Figure 3 illustrates the phenomenon of frame ambiguity.
To address the issue, we adopt a multiframe approach to fuse the results. Denoting the deep model which we are going to endow with rotation invariance by a function . The denotes a feature vector generated by the deep model when given the input point cloud . Here we suppose the centroid of the input point cloud is on the origin. Our fusion scheme can be abstracted as follows:
(8) 
in which we introduce a fusion function to obtain a final feature descriptor from multiple PCARI representations with denoting the frame.
In order to achieve absolute rotation invariance, we require the fusion function to be independent on the order of the frames with = and denoting the permutation respectively as follows:
(9) 
A naive approach is to directly apply an average or max pooling operation on
. However, we find that the direct pooling operation disregards a lot of relationship among the features, which limits the discriminability of the final feature.To alleviate the problem, we apply a selfattention module derived from [Vaswani et al.2017] before pooling layers to pay more attention to the relationship between the multiframe features. Following the notations of [Vaswani et al.2017], the transformed feature derived from the selfattention module can be expressed as follows with shared parameter matrices , and :
(10) 
From Equation (10) we can observe that the attention module aims to allocate weight to multiframe feature and accumulate the weighted features.
Use to denote the selfattention transformation as follows:
(11) 
We care about whether the is invariant to the input order of as Equation (12) shows:
(12) 
Fortunately, this equation holds as the sum operation in Equation (10) doesn’t care about the order of items. In summary, our transformed features derived from the selfattention module are independent on the order of input frames.
With these transformed features , we further apply an averagepooling operation to obtain a final feature vector for further processing, which can be summarized as follows:
(13) 
Note that we adopt the averagepooling operation as it achieves better performance than the maxpooling operation in our experiment.
Embedded into Deep Architectures
Method  input  input size  z/z  SO3/SO3  z/SO3 

SubVolSup MO [Qi et al.2016]  voxel  89.5  85.0  45.5  
Spherical CNN [Esteves et al.2018]  projected voxel  88.9  86.9  76.7  
MVCNN 80x [Su et al.2015]  views  90.2  86.0  81.5  
RotationNet 20x [Kanezaki, Matsushita, and Nishida2018]  views  92.4  80.0  20.2  
PointNet [Qi et al.2017a]  xyz  89.2  83.6  14.7  
PointNet++ [Qi et al.2017b]  xyz  89.3  85.0  28.6  
SFCNN [Rao, Lu, and Zhou2019]  xyz  91.4  90.1  84.8  
ClusterNet [Chen et al.2019]  xyz  87.1  87.1  87.1  
DGCNN [Wang et al.2018]  xyz  91.9  88.3  37.8  
DGCNN (without STN) [Wang et al.2018]  xyz  91.6  88.1  36.3  
Ours (singleframe)  xyz  89.1  89.1  89.1  
Ours (multiframe)  xyz  89.8  89.8  89.8 
micro  macro  
Method  P@N  R@N  F1@N  mAP  NDCG  P@N  R@N  F1@N  mAP  NDCG  score 
Furuya [Furuya and Ohbuchi2016]  0.814  0.683  0.706  0.656  0.754  0.607  0.539  0.503  0.476  0.560  0.566 
Tatsuma [Tatsuma and Aono2009]  0.705  0.769  0.719  0.696  0.783  0.424  0.563  0.434  0.418  0.479  0.557 
Zhou [Bai et al.2016]  0.660  0.650  0.643  0.567  0.701  0.443  0.508  0.437  0.406  0.513  0.487 
Spherical CNN [Esteves et al.2018]  0.717  0.737    0.685    0.450  0.550    0.444    0.565 
SFCNN [Rao, Lu, and Zhou2019]  0.778  0.751  0.752  0.705  0.813  0.656  0.539  0.536  0.483  0.580  0.594 
DGCNN (without STN) [Wang et al.2018]  0.768  0.717  0.719  0.672  0.782  0.640  0.527  0.515  0.449  0.564  0.561 
DGCNN [Wang et al.2018]  0.774  0.723  0.725  0.679  0.789  0.640  0.531  0.521  0.454  0.567  0.567 
Ours (singleframe)  0.789  0.738  0.739  0.703  0.803  0.671  0.546  0.539  0.479  0.585  0.591 
Ours (multiframe)  0.801  0.747  0.749  0.714  0.814  0.679  0.563  0.553  0.495  0.601  0.605 
As we have claimed, our method can be flexibly embedded into the current neural architectures. In this part, we adopt DGCNN [Wang et al.2018] as our basic architecture and further demonstrate how to endow it with rotationinvariance.
The extended architecture, depicted in Figure 4, consists of four modules: PCARI representation module, EdgeConv module, selfattention module and classification module. The EdgeConv module contains eight EdgeConv blocks, which share the same weight parameters. Each block consists of five layers with layer output size 64, 64, 64, 128, 1024 respectively. Since our PCARI can maintain rotationinvariance, we remove the spatial transform network (STN) of DGCNN [Wang et al.2018] as STN is mainly designed to make model resistant to affine transformation.
For each input point cloud, we first convert it into eight PCARI representations and feed them to the eight EdgeConv blocks during the training phase. These blocks further produce eight output features, which will be aggregated by a selfattention module followed by an average pooling layer to obtain a final feature for the downstream tasks. For simplicity, we denote this model as our multiframe model.
Another possible architecture is to apply only one EdgeConv block and meanwhile remove the selfattention module and the averagepooling layer with other parts unchanged. We call this singleframe model. During the training phase, the PCARI module randomly selects one of the eight representations as input, which can also improve the the rotation robustness of models as can be seen in the following experiments.
Note that our approach doesn’t need to apply any rotation augmentation on the training data, which extremely reduces the computational burdens.
Experiments
In this section, ModelNet40 [Wu et al.2015] is used as the benchmark for 3D classification task. Next, we conduct experiments on ShapeNet Core55 [Chang et al.2015] for the retrieval task. In the end, we provide some ablation analysis of our approach.
ModelNet 3D Shape Classification
We first evaluate the rotation robustness of our proposed method on ModelNet40 [Wu et al.2015] benchmarks for 3D classification task and further compare our method with other stateoftheart 3D shape classification models.
Data
ModelNet40 [Wu et al.2015] is used as the benchmark for 3D classification tasks. ModelNet40 [Wu et al.2015] dataset consists of 12,311 CAD models from 40 manmade object categories. We use the standard split following PointNet [Qi et al.2017a] where 9,843 is used for training and 2,468 is used for testing. Since each CAD model in ModelNet40 [Wu et al.2015] is composed of many mesh faces, we sample 2,048 points from them uniformly with respect to face area and then shift and normalize each point cloud into with centroid on the origin. Only the (x, y, z) coordinates of the sampled points are used and the original meshes are discarded.
Results
Following Spherical CNN [Esteves et al.2018], we evaluate our model using three different settings: 1) training and testing with azimuthal rotations (z/z), 2) training and testing with arbitrary rotations (SO3/SO3), and 3) training with azimuthal rotations while testing with arbitrary rotations (z/SO3).
Table 1 shows the comparisons between our proposed method and the previous methods. All competing methods using azimuthal rotations augmentation suffer a sharp drop on the arbitrary rotationaugmented test set, even for the SO(3) equivariant method Spherical CNN [Esteves et al.2018] (2 % and 12.2 % drop in SO3/SO3 and z/SO3 respectively) and SFCNN [Rao, Lu, and Zhou2019] (1.3 % and 6.6 % drop in SO3/SO3 and z/SO3 respectively) while our approach consistently maintains superior performance across different settings. Furthermore, it illustrates that rotationaugmentation can indeed improve the rotation robustness of models but still has a large margin with our proposed method and SFCNN [Rao, Lu, and Zhou2019] on the SO(3)/SO(3) setting.
Note that SFCNN [Rao, Lu, and Zhou2019] can achieve 0.3% better performance than ours on the SO3/SO3 setting. Nevertheless, SFCNN [Rao, Lu, and Zhou2019] has to apply a complicated operation to project the point cloud onto a fractal structure, which might lead to loss of information of the original point cloud. Given the rather simple architecture of our model and the informationlossless input representation we use, we interpret our performance as strong empirical support for the effectiveness of our method.
SHREC’17 3D Shape Retrieval
We also conduct 3D shape retrieval experiments on ShapeNet Core55 [Chang et al.2015] benchmark using its perturbed dataset, which contains random SO(3) arbitrary rotations.
Data
ShapeNet Core55 [Chang et al.2015] benchmark has two evaluation datasets: normal and perturbed. For normal dataset, all model data is consistently aligned while in the perturbed dataset each model data has been randomly rotated by a uniformly sampled rotation. In order to validate the rotation robustness of our approach, we only consider the perturbed dataset which contains a total of 51,190 3D models with 55 categories. 70% of the dataset is used for training, 10% for validation, and 20% for testing.
Results
Following the experimental settings in Spherical CNN [Esteves et al.2018]
, we train the classification model on the 55 core classes with joint supervision of triplet loss and softmax loss. We use the output of the layer before the score prediction layer as our feature vector and compute the distance between samples by cosine similarity.
SHREC’17 [Savva et al.2017]
provides several evaluation metrics including Precision, Recall, F1, mAP and normalized discounted cumulative gain (NDCG). These metrics are computed under both micro and macro context. We evaluate our method and compare it to the prior models using the official metrics. In addition, following
[Savva et al.2017] we use the average of the micro and macro mAP as the final score to rank the performance.In Table 2, comprehensive comparisons between our approach and various stateoftheart methods are presented. As we can see, our approach outperforms all other models including the previous stateoftheart SFCNN [Rao, Lu, and Zhou2019] under both macro and micro context in terms of most metrics. More importantly, our method is more scalable and flexible without extra complicated operation.
Ablation Analysis
Analysis of Architecture
Since our PCARI representation can be processed to be compatible with many architectures dealing with point cloud, we further enhance PointNet [Qi et al.2017b] and PointNet++ [Qi et al.2017b] with our PCARI representation. As shown in Table 3, the enhanced DGCNN [Wang et al.2018] (without STN), PointNet [Qi et al.2017a] and PointNet++ [Qi et al.2017b] by using the PCARI representation outperform the original models by a large margin on arbitrary rotationaugmented dataset for ModelNet40 [Wu et al.2015] classification tasks.
Method  Model  Accuracy (%) 

Original Model  PointNet  82.3 
PointNet++  85.0  
DGCNN (without STN)  87.4  
Ours (singleframe)  PointNet  85.7 
PointNet++  87.4  
DGCNN (without STN)  88.2  
Ours (multiframe)  PointNet  86.2 
PointNet++  87.9  
DGCNN (without STN)  88.8 
Analysis of SelfAttention Module
For our multiframe approach, how to aggregate the features for all intrinsic frames is important to extract a discriminative feature for further processing. As shown in Table 4, our experimental results demonstrate that the performance of directly applying a pooling layer on all the individual features can be improved by adding a selfattention module before the pooling layer. In addition, it illustrates that the averagepooling scheme is a more robust and beneficial to our classification tasks compared with using a maxpooling operation.
Analysis of Frame Stability
An important requirement for our approach is that the intrinsic frame should be stable. For an identical object, we hope that the intrinsic frame derived from different sampled point clouds will maintain consistent. To this end, we make a statistic on the stability regarding the sampling. Our experiment further shows that the average rotation angles between the intrinsic frames derived from two sampled point clouds of the same mesh are and for ModelNet40 [Wu et al.2015] and ShapeNet Core55 [Chang et al.2015] respectively, indicating that the sampled point clouds can have consistent intrinsic frames.
As aforementioned, another influential factor of our frame stability is the axis significance, which is related to how different the eigenvalues are. We list the distributions of eigenvalue ratio in Figure 5 for ModelNet40 [Wu et al.2015] and ShapeNet Core55 [Chang et al.2015]. Our experimental results demonstrate that more than eighty percent of the point clouds have significant axis order with the ratio of being smaller than 0.8.
Conclusion
Input Size  Method  Accuracy(%) 

Max Pooling  87.9  
Avg Pooling  88.4  
Self Attention + Max pooling  88.5  
Self Attention + Avg pooling  88.8  
Max Pooling  89.1  
Avg Pooling  89.5  
Self Attention + Max pooling  89.5  
Self Attention + Avg pooling  89.8 
In this paper, we introduce a rotationinvariant representation based on principal component analysis to enhance the rotation robustness for 3D deep models. In order to handle the sign ambiguity of eigenvectors, we adopt a multiframe strategy to aggregate all the feature vectors by selfattention mechanisms, which can still preserve the property of rotation invariance theoretically while achieving better performance than directly pooling. Despite its simplicity, our approach is very effective and can be easily embedded to 3D deep models. Extensive experimental results on ModelNet40 and ShapeNet Core55 benchmark demonstrate the superiority of our novel representation.
References
 [Bahdanau, Cho, and Bengio2014] Bahdanau, D.; Cho, K.; and Bengio, Y. 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473.

[Bai et al.2016]
Bai, S.; Bai, X.; Zhou, Z.; Zhang, Z.; and Jan Latecki, L.
2016.
Gift: A realtime and scalable 3d shape search engine.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, 5023–5032.  [Chang et al.2015] Chang, A. X.; Funkhouser, T.; Guibas, L.; Hanrahan, P.; Huang, Q.; Li, Z.; Savarese, S.; Savva, M.; Song, S.; Su, H.; et al. 2015. Shapenet: An informationrich 3d model repository. arXiv preprint arXiv:1512.03012.
 [Chen et al.2017] Chen, X.; Mishra, N.; Rohaninejad, M.; and Abbeel, P. 2017. Pixelsnail: An improved autoregressive generative model. arXiv preprint arXiv:1712.09763.

[Chen et al.2019]
Chen, C.; Li, G.; Xu, R.; Chen, T.; Wang, M.; and Lin, L.
2019.
Clusternet: Deep hierarchical cluster network with rigorously rotationinvariant representation for point cloud analysis.
In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 4994–5002.  [Cheng, Dong, and Lapata2016] Cheng, J.; Dong, L.; and Lapata, M. 2016. Long shortterm memorynetworks for machine reading. arXiv preprint arXiv:1601.06733.
 [Esteves et al.2018] Esteves, C.; AllenBlanchette, C.; Makadia, A.; and Daniilidis, K. 2018. Learning so (3) equivariant representations with spherical cnns. In Proceedings of the European Conference on Computer Vision (ECCV), 52–68.
 [Furuya and Ohbuchi2016] Furuya, T., and Ohbuchi, R. 2016. Deep aggregation of local 3d geometric features for 3d model retrieval. In BMVC, 121–1.
 [Gregor et al.2015] Gregor, K.; Danihelka, I.; Graves, A.; Rezende, D. J.; and Wierstra, D. 2015. Draw: A recurrent neural network for image generation. arXiv preprint arXiv:1502.04623.

[Kanezaki, Matsushita, and
Nishida2018]
Kanezaki, A.; Matsushita, Y.; and Nishida, Y.
2018.
Rotationnet: Joint object categorization and pose estimation using multiviews from unsupervised viewpoints.
In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 5010–5019.  [Klokov and Lempitsky2017] Klokov, R., and Lempitsky, V. 2017. Escape from cells: Deep kdnetworks for the recognition of 3d point cloud models. In Proceedings of the IEEE International Conference on Computer Vision, 863–872.
 [Maturana and Scherer2015] Maturana, D., and Scherer, S. 2015. Voxnet: A 3d convolutional neural network for realtime object recognition. In 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 922–928. IEEE.
 [Parikh et al.2016] Parikh, A. P.; Täckström, O.; Das, D.; and Uszkoreit, J. 2016. A decomposable attention model for natural language inference. arXiv preprint arXiv:1606.01933.
 [Qi et al.2016] Qi, C. R.; Su, H.; Nießner, M.; Dai, A.; Yan, M.; and Guibas, L. J. 2016. Volumetric and multiview cnns for object classification on 3d data. In Proceedings of the IEEE conference on computer vision and pattern recognition, 5648–5656.

[Qi et al.2017a]
Qi, C. R.; Su, H.; Mo, K.; and Guibas, L. J.
2017a.
Pointnet: Deep learning on point sets for 3d classification and segmentation.
In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 652–660.  [Qi et al.2017b] Qi, C. R.; Yi, L.; Su, H.; and Guibas, L. J. 2017b. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. In Advances in neural information processing systems, 5099–5108.
 [Rao, Lu, and Zhou2019] Rao, Y.; Lu, J.; and Zhou, J. 2019. Spherical fractal convolutional neural networks for point cloud recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 452–460.
 [Riegler, Osman Ulusoy, and Geiger2017] Riegler, G.; Osman Ulusoy, A.; and Geiger, A. 2017. Octnet: Learning deep 3d representations at high resolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 3577–3586.
 [Savva et al.2017] Savva, M.; Yu, F.; Su, H.; Kanezaki, A.; Furuya, T.; Ohbuchi, R.; Zhou, Z.; Yu, R.; Bai, S.; Bai, X.; et al. 2017. Largescale 3d shape retrieval from shapenet core55: Shrec’17 track. In Proceedings of the Workshop on 3D Object Retrieval, 39–50. Eurographics Association.
 [Show2015] Show, A. 2015. Tell: Neural image caption generation with visual attention. Kelvin Xu et. al.. arXiv PrePrint 23.
 [Su et al.2015] Su, H.; Maji, S.; Kalogerakis, E.; and LearnedMiller, E. 2015. Multiview convolutional neural networks for 3d shape recognition. In Proceedings of the IEEE international conference on computer vision, 945–953.

[Tatsuma and Aono2009]
Tatsuma, A., and Aono, M.
2009.
Multifourier spectra descriptor and augmentation with spectral clustering for 3d shape retrieval.
The Visual Computer 25(8):785–804.  [Vaswani et al.2017] Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A. N.; Kaiser, Ł.; and Polosukhin, I. 2017. Attention is all you need. In Advances in neural information processing systems, 5998–6008.
 [Wang et al.2018] Wang, Y.; Sun, Y.; Liu, Z.; Sarma, S. E.; Bronstein, M. M.; and Solomon, J. M. 2018. Dynamic graph cnn for learning on point clouds. arXiv preprint arXiv:1801.07829.
 [Wu et al.2015] Wu, Z.; Song, S.; Khosla, A.; Yu, F.; Zhang, L.; Tang, X.; and Xiao, J. 2015. 3d shapenets: A deep representation for volumetric shapes. In Proceedings of the IEEE conference on computer vision and pattern recognition, 1912–1920.
 [Yang et al.2016] Yang, Z.; He, X.; Gao, J.; Deng, L.; and Smola, A. 2016. Stacked attention networks for image question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, 21–29.
Comments
There are no comments yet.