3D shape representation learning plays a central role in shape analysis and understanding, which has a wide range of applications such as shape classification [28, 13, 19], retrieval [9, 5, 10], semantic segmentation [24, 21] and instance segmentation [41, 15]. Among the multiple representation forms of 3D shapes, 3D point clouds, benefited from its easy access, have become one of the most popular 3D shape forms in recent years. Specifically, the point clouds consist of a set of unordered points, each of which is composed of 3D coordinates, possibly with some additional attributes such as normal, color and material.
However, learning discriminative shape representation directly on point clouds is still challenging in 3D shape analysis and understanding. Recent studies for learning point cloud representations usually involve the following three steps. Each input point cloud is first split into some local regions. Then, the corresponding features of local regions are extracted using shared Multi-Layer Perceptron (MLP) or kd-trees 
. Finally, the extracted local region features are aggregated into a global feature vector as the shape representation[29, 23, 24]
. Most of the previous methods mainly focus on how to enhance the process of local region feature extraction, while often employ a simple pooling-based layer[28, 29, 23] to aggregate these extracted features. However, such pooling-based feature aggregation methods do not take adequately the spatial relationships among local regions into account. So far, how to aggregate those learned local region features and their spatial relationships still remains the challenges in existing methods of point cloud representation learning. In this paper, we first argue the importance of learning spatial relationships for aggregating local region features with respect to the following two reasons. (1) For point clouds with similar local regions, the differences in the spatial arrangements of these local regions are important for learning the discriminative features. (2) Considering the permutation invariant nature of point clouds, it is important to learn the intrinsic spatial relationship between each part and the whole, in order to constitute the permutation invariant knowledge for point cloud recognition.
is to extract the most significant characteristics (such as the engine of an airplane) in local regions of the point cloud step by step, through a deep neural network with the pooling structure (such as max-pooling). However, the problem is that the pooling-based methods will filter out the spatial relationships of different areas on the feature map. Thus, it only considers the existences of characteristics in local regions, while the spatial arrangements between these regions will not be preserved. As a result, most of the existing methods usually fail to learn the spatial relationships among the local regions, which further limits the ability of the network for learning discriminative 3D shape representation.
To address the aforementioned problem, we propose a novel deep learning network, named Point2SpatialCapsule, for aggregating geometric features and spatial relationships of local regions on point clouds, which aims to learn more discriminative shape representation. Inspired by the recently developed capsule network , Point2SpatialCapsule employs the dynamic routing to aggregate the local region features and their spatial relationships. Fig. 1 illustrates the comparison between our dynamic routing based Point2SpatialCapsule and the previous feature aggregation methods like max-pooling used in PointNet . Note that the max-pooling feature aggregation in PointNet  only considers the existences of characteristics in local regions, but in contrast our Point2SpatialCapsule can explicitly handle the spatial relationships between local region features through dynamic routing of capsule network. This advantage encourages us to consider adopting capsule network for 3D point clouds representation learning.
However, the problem is that, the original implementation of capsule network is designed for 2D image recognition, where the log priors in capsule network are bounded to the fixed locations on the 2D feature maps . In contrast, for 3D point clouds, the locations of random sampled input points are disordered and their absolute position coordinates may not always keep consistent. As a result, it is difficult to find a direct mapping that can generate the features encoded with fixed spatial locations. What’s worse, the previous capsule based methods failed to address such problem, most of which directly generate the capsules from a single global feature vectors. Such practice leads to the loss of spatial relationships between local regions. As a result, the log priors in routing algorithm between capsules can not learn the spatial relationships of local regions, which greatly limits the representation ability of capsules. In this paper, we argue the importance for encoding the fixed spatial locations into capsules, which aims to efficiently utilize the representation ability of log priors for learning the spatial relationships between local regions on point clouds.
In order to solve the above limitations, two novel modules are specially designed in Point2SpatialCapsule to achieve local region feature aggregation as follows. (1) The first module, named geometric feature aggregation, aims to aggregate the extracted local region features in the feature space. Here, the term “geometric” indicates that this module aggregates the geometric information, like the coordinates of central points and the shapes of local regions represented by the feature vectors, into the centers of local feature clusters, which aims to resolve the disorder problem of local regions. (2) The second module, named spatial relationship aggregation, is to apply routing algorithm on the learned feature clusters. The term “spatial-aware” indicates that the capsules are encoded with the spatial locations, which is to guarantee the direct mapping between the log priors and the fixed locations in the 3D space. Therefore, we call them the spatial-aware capsules, which allows the network to efficiently learn the spatial relationships between local regions. Fig. 2 shows the visualized demonstration of the advantage of spatial relationship aggregation. Because of the shifting and rotation of point clouds, the changing locations of local region features in 3D space also change the log priors. To resolve this issue, the geometric feature aggregation clusters the input local region features into the learnable cluster centers, which are irrelevant to the input points and relatively invariant in the feature space. Therefore, the routing algorithm can efficiently learn the log priors for aggregating the spatial relationships between local regions. Our main contributions are summarized as follows.
We propose a novel deep network, i.e. Point2SpatialCapsule, for learning more discriminative shape representations of point clouds. Compared with the traditional pooling-based methods, Point2SpatialCapsule can explicitly learn not only geometric features of local regions but also the spatial relationships among them.
We propose the geometric feature aggregation to resolve the disorder problem of local regions, where the local region features are aggregated into the learnable cluster centers, which are explicitly encoded with the spatial locations from the original 3D space.
We propose the spatial relationship aggregation to further utilize the spatial locations encoded in the feature clusters. Compared to the previous capsule network based methods, the spatial relationship aggregation can learn more discriminative spatial relationships between local regions by establishing a direct mapping between log priors and the spatial locations through feature clusters.
Ii Related Work
In this section, we mainly review the methods related to 3D shape representation learning based on deep learning networks. The existing methods can be roughly divided into four categories according to various 3D shape forms that are learned from, including voxels, point clouds, views and meshes.
Ii-a Point Cloud Based Methods
Recent studies of point cloud representation learning mainly focus on the local feature extraction and integration. PointNet  is the pioneering work of introducing deep learning into point cloud representation learning, which independently learns the features of each point and aggregates the learned features into a global feature with the max-pooling layer. After that, plenty of the follow-up studies [23, 24, 49, 42] focus on how to better integrate the contextual information of local regions on point clouds. For example, PointNet++  designed the hierarchical feature learning architecture based on PointNet to encode multi-scale local areas. Following the convolutional structure of PointNet++, successors such as PointCNN  and SpiderCNN  investigated some improved convolution operations which aggregate the neighbors of a given point by edge attributes in the local region graph. Different from the idea of using convolution structure, Point2Sequence  introduced the sequential model (i.e. RNN) to capture the fine-grained contextual information of features in local regions. Specifically, Point2Sequence arranges the features into a sequence according to the size of the region scale, and then uses a RNN to capture the contextual information within the local regions. However, the problem is that most of the above methods fail to consider the spatial relationships among different local regions when aggregating the extracted local region features, where the usual practice for these methods is to use the pooling layer to learn the global feature from the local ones.
More recent studies focus on how to improve the local region feature extraction [19, 26, 54]. These methods have shown impressive potentials in the semantic segmentation task on point cloud. For examples, A-CNN  was proposed to annularly arrange the neighbor points and apply the convolution network on these arranged points to learn the local region features. RS-CNN  designed a shape-aware convolution to learn the local region features from the relation between points.
The proposed Point2SpatialCapsule mainly focus on how to aggregate the feature and relationships of local regions after extracting local features. The usual practice for previous methods is to apply the strategy of bottom-to-top point cloud feature aggregation [18, 40, 22, 21, 31]. For example, Kd-Net  performs multiplicative transformations according to the subdivisions of point clouds based on the kd-trees. SO-Net  employs a SOM to build the spatial distribution of the input point cloud, which allows hierarchical feature extraction on both individual points and SOM nodes. However, most of the above methods use max-pooling as a feature aggregation method, which inevitably filters out the spatial relationships among local regions. On the other hand, PVNet  is also a notable method that considers the local feature aggregation, which focuses on mining the difference in importance between the local features. It employs high-level global features from the multi-view data of input 3D shapes to mine the relative correlations between different local features from the point cloud data. Same as the above-mentioned methods, PVNet only learns the different contributions among local regions, while the spatial relationships among these regions are not considered.
Ii-B View-based Methods
The dominant performance of multi-view based methods on the task of 3D shape retrieval comes from the research progress of measuring the similarities between 2D image features [8, 3, 12, 11]. As one of the pioneering work, GIFT  adopted the Hausdorff distance to measure the similarity between the view sets of two 3D shapes. Another notable research direction is to focus on PANORAMA views of 3D shapes, where a PANORAMA view can be regarded as the seamless aggregation of multiple views captured on a circle. For examples, DeepPano  introduced a row-wise max-pooling to relief the effect of rotation about the up-oriented direction, and Sfikas et al.  introduced CNN for learning the global features from the PANORAMA views in a consisitent order. To explore the potential of attention mechanism, the methods like 3DViewGraph  have been proposed to integrate the spatial pattern correlations of unordered views with attention weights, and Part4Features  developed a novel multi-attention mechanism for aggregating the learned local parts.
More recently, SeqViews2SeqLabels  was proposed to learn 3D features via aggregating sequential views by RNN, which aims to eliminate the effect of rotation of 3D shapes. Compared with the previous pooling based methods, the RNN-based SeqViews2SeqLabels suffers less from the content and the spatial location loss. Similarly, as an unsupervised approaches, VIP-GAN  trains an RNN-based neural network architecture to solve multiple view inter-prediction tasks for each shape.
Ii-C Voxel-based Methods
. For supervised learning the representation of 3D voxels, 3DShapeNets
adopted the convolutional restricted Boltzmann machine to learn the representation of 3D voxels. O-CNN learns the representation of 3D voxel based on a novel octree structure. And Han et al. 
proposed a novel permutation voxelization strategy to learn high-level and hierarchical 3-D local features from raw 3-D voxels. For unsupervised learning, methods like VConv-DAE
use the fully convolutional autoencoder for unsupervised learning the voxel representation by reconstruction. However, the problem is, considering the induced complexity and limitations of directly exploiting the sparsity of voxel grids, it is difficult to introduce the large scale or flexible deep networks for representation learning. Therefore, more recent methods such as OctNet and kd-net  consider to utilize the scalable indexing structures for solving this problem, where deep neural networks can be further adopted for achieving more impressive results.
Ii-D Mesh-based Methods
As for mesh-based methods, to explore the effectiveness of the heat diffusion based descriptor, Xie et al.  proposed a shape feature learning scheme based on auto-encoders, where the model can extract the features that are insensitive to the deformations. By fully utilizing the spectral domain, Xie et al.  further proposed to learn a novel binary spectral shape descriptor with the deep neural network for 3D shape correspondence. Recently, BoSCC  was introduced for a spatially enhanced 3D shape representation based on bag of spatial context correlations. And more recently, Deep Spatiality  was also proposed to simultaneously learn 3D global and local features with novel coupled softmax.
Ii-E Capsule Networks
The ability of capsule network  for capturing spatial relationships comes from the dynamic routing algorithm and the log priors, which are bound to the absolute location on the input feature maps. Specifically, the capsule network learns the log priors by considering the relationships between the absolute locations on the feature map and the high-level capsules. Then, through the dynamic routing algorithm, which is based on the learned log priors, the high-level capsules can integrate the low-level features and their spatial relationships among different locations on the feature maps. This advantage promotes us to consider applying the capsule network to 3D point cloud representation learning.
and natural language processing (NLP)[51, 45, 27]. However, as for the application of capsule network in 3D shape representation learning, there are a few methods proposed in recent years. For example, 3D-CapsNet  adopts the capsule network for 3D shape classification tasks based on volumetric data, and 3D-Point-Capsule  learns the point cloud representation and part segmentations in an unsupervised way. And for supervised learning, 3DCapsule  applies the capsule network as an extension of fully-connected layers for point cloud classification.
An important problem of the above methods is that they all build the capsule layers over the global feature (usually produced by the fully-connected layer or max-pooling) of point clouds, where the spatial relationships between local region features have been filtered out by the network. Therefore, the log priors in routing algorithm cannot learn the spatial distribution among the extracted local features, which limits the biggest advantage of capsule network for aggregation spatial relationships of local regions.
Therefore, to address this problem of previous methods, Point2SpatialCapsule aggregates the features into clusters in feature space, and applies the routing algorithm between these aggregated clusters. In the research of point cloud representation learning, methods like PointNetVLAD  have adopted the similar clustering strategy, i.e. NetVLAD , for feature aggregation. However, different from the previous methods that only cluster features for aggregating regions with similar geometric characteristics (e.g. shapes), our method takes one step further to not only considering geometric characteristics, but also explore the potentials for aggregating spatial relationships between these regions. Specifically, Point2SpatialCapsule produces the clusters for both the features and their coordinates, in order to explicitly preserve the features and their spatial location.
Iii Shape Representation Learning with Point2SpatialCapsule
An overview of shape representation learning network with Point2SpatialCapsule is shown in Fig. 3. The whole network consists of three main parts as follows. (1) The first part is the multi-scale local feature extraction, which is a PointNet++ based network for extracting the features from multi-scale local regions on point clouds (see Sec. III-A). (2) The second part is Point2SpatialCapsule, which is composed of two main modules for aggregating the learned features into the global shape representation. Here, the first module, i.e. geometric feature aggregation, is to aggregate local region features into clusters (see Sec. III-B). The second module, i.e. spatial relationship aggregation, is to aggregate the feature clusters and their spatial relationships into global feature representation (see Sec. III-C). In this section, we will also detail the training procedure of Point2SpatialCapsule (see Sec. III-D). (3) The third part is the task oriented network used for various tasks such as shape segmentation (see Sec. III-E).
Iii-a Multi-scale Local Feature Extraction
The first part of our network is the multi-scale local feature extraction, as shown in Fig. 3(a). Given a set of input points , by following the practice of PointNet++  and ShapeContextNet , we iteratively produce a subsampling with points as the centroids of the local regions using farthest point sampling (FPS), such that the newly added point is the farthest point (in metric distance) from the rest sampled points . Then, for each sampled point, the
nearest neighbor (kNN) searching is employed to findneighbors for this point, under different scale areas. Followed by a grouping layer, the sampled point and its neighbors are grouped as a tensor for scale . After that, a simple but effective MLP layer is employed to extract the features of all neighbor points, producing a tensor with shape . Finally, a max-pooling layer is applied to integrate the point features in each scale to produce the scale feature of dimension for scale . For points in total and scales for each point, the multi-scale local feature extraction layer produces multi-scale features, forming a tensor of shape as its output.
In the implementation, we apply two layers of multi-scale local feature extraction for hierarchically extracting features from point clouds.
Iii-B Point2SpatialCapsule: Geometric Feature Aggregation
In this subsection, we detail the first module of Point2SpatialCapsule, which aims to aggregate the extracted features into clusters and encodes these features with spatial locations.
As shown in Fig. 3(b), before clustering features, the module of geometric feature aggregation first applies the multi-scale shuffling to enhance the diversity of features. Then the features are aggregated into clusters and encoded with the spatial locations (e.g. the absolute locations in the 3D space) from the original 3D space.
Iii-B1 Multi-scale Shuffling
Different from the previous methods that apply the pooling-based strategy for integrating the features extracted from multi-scale regions, we propose the multi-scale shuffling layer to build the shuffled features. The reason for adding this layer is demonstrated in Fig. 4, as explained below. When searching the neighbor points in a large scale, the searching areas of two adjacent centroids will overlap with each other and output the same neighbor points. As a result, the adjacent points will tend to have the similar features for large scale, which can reduce the diversity of features and introduce an initial clustering center for the subsequent clustering layer. On the other hand, the features of small scales between two centroids are dissimilar because of small overlaps. Therefore, the multi-scale shuffling is introduced to smooth the perceived range of features between different scales and enhance the feature diversity, by mixing the dissimilar features of small scales with the similar features of large scales. As a result, the multi-scale shuffling can promote the network to consider all input features equally and alleviate the problem of similar features.
The effect of multi-scale shuffling is shown in Fig. 5. Specifically, given a point with scale features of dimension, which forms a tensor with the shape , the multi-scale shuffling periodically rearranges the elements in the tensor into a tensor of shape , where is an integer. Thus, for points in total, the multi-scale shuffling will produce shuffled features of dimension , resulting in a tensor of size .
The multi-scale shuffling is inspired by the subpixel convolution 
for image upsampling, where the number of area scalescan be considered as the size of image, and the feature dimension can be regarded as the channels of feature maps. However, different from subpixel convolution which is designed for speeding up the calculations and reducing the amount of parameters in the network, the multi-scale shuffling used in our method aims to enhance the diversity of scale features. We will quantitatively explore the importance of the multi-scale shuffling in ablation studies in Sec. IV-E.
Iii-B2 Feature Aggregation with Spatial Encodings
The purpose of this layer is to aggregate the shuffled features into the learnable feature cluster centers, which can be regarded as the latent embeddings describing the semantic patterns of the local regions features. To achieve this purpose, we propose to cluster the features in the feature space and their coordinates in the original 3D space. After that, the cluster centers in both the feature space and the 3D space are fused to produce the feature-spatial embeddings, as illustrated in Fig. 6(a).
Although the traditional clustering methods like k-means can be adopted to produce the feature cluster centers, their computational cost may be very high because of the huge number of features to be clustered. Therefore, inspired by the recent development of NetVLAD, we adopt the soft-assignment for learning the clustering centers for the input shuffled local features. Specifically, the network learns cluster centers for input features, denoted as , as colored by yellow in Fig. 6(a). For each cluster center , the layer produces a feature embedding , which is an aggregated representation over the whole input shuffled features , denoted by
where and are the weights and biases, respectively, that determine the contribution of each local feature to the cluster center . During training, all the weights, biases and the cluster centers are updated through back-propagation algorithm.
To explicitly encode the spatial locations of local features into their cluster centers, we first cluster the coordinates of input points into the coordinates cluster centers , which is the same process as described above and colored by green in Fig. 6(a). The spatial embeddings for coordinates is given as
Then, the produced local feature embedding and its corresponding spatial embedding are concatenated to form an explicit feature-spatial embedding .
Iii-C Point2SpatialCapsule: Spatial Relationship Aggregation
In Fig. 6(b), we show the overall architecture of previous methods [55, 1] for building the capsules, and compare it with our proposed Point2SpatialCapsule shown in Fig. 6(a). The main difference is that Point2SpatialCapsule builds the spatial-aware capsules based on cluster centers with spatial encodings, while the previous studies simply build the capsules based on the single representation vector generated by fully-connection or pooling based local feature aggregator. As a result, the previous methods fail to preserve the spatial relationships between local regions, which further limits the representation learning ability of dynamic routing.
In this subsection, in order to efficiently learn the prior logs, we first independently generate the spatial-aware capsules from the feature-spatial embeddings using rearrange and squashing. Then, we propose to apply routing algorithm between the spatial-aware capsules.
Iii-C1 Rearrange and Squashing
To build the spatial-aware capsules from the feature-spatial embeddings produced by the geometric feature aggregation module, we deviate from the 2D practice of the original capsule network 
. In the original capsule network, the spatial-aware capsule aggregates its representation vector by collecting the output logits across different channels at the same location on the feature maps. In Point2SpatialCapsule, since we have built the feature-spatial embeddings encoded with the spatial locations, we can consider that each embedding corresponds to a fixed location, which is the learnable cluster center. Therefore, we can directly rearrange the output and use the fully-connected layer with a squashing activation to produce the spatial-aware capsules. The rearrange layer is to split the feature-spatial embeddingsinto several short vectors . As shown in Fig. 7, the input feature-spatial embedding is split into vectors, each of which is combined with the spatial embedding. Then, we follow a squashing layer, as denote by
where the spatial-aware capsules are generated as the final output of this layer.
Iii-C2 Routing Algorithm
Given the input spatial-aware capsules, we follow  to apply dynamic routing algorithm to obtain the digit capsule. Specifically, the digit capsule is the output of weighted sum of the prediction vector followed by the squashing layer, which can be formulated as
where is the th spatial-aware capsule and is a learnable matrix. The coupling coefficients  is determined by the iterative dynamic routing process, denote by
In 2D capsule network, the are log priors that only depend on the fixed locations and the type of two capsules. In our network, because the disordered input features are clustered as the feature-spatial embeddings by soft-assignment, these features are bounded to the fixed locations (which are the cluster centers) in the feature space. Therefore, the dynamic routing can learn the log priors between these centers and the digit capsules.
Before training, all of the log priors are initialized to zero. During training, are learned discriminatively at the same time with other parameters in the network, by adding the scalar product of and , i.e.
Iii-D Point2SpatialCapsule: Training
Following the practice of , Point2SpatialCapsule uses the reconstruction loss and the classification loss for supervised point cloud representation learning.
The length of each digit capsule indicates the probability that the characteristic represented by this capsule exists in the input point clouds. During training, the margin loss is adopted for shape classification defined as
where if class is the true label; otherwise, . , and are the hyper parameters.
We further reconstruct the input point clouds using four fully-connected layers, with each layer followed by a
activation and batch normalization except for the last layer. The digit capsule corresponding to the true label is used as the input representation vector to the reconstruction network. The chamfer loss between the original point cloudand the reconstructed point cloud is adopted as the reconstruction loss , as denoted by
The total loss for training is the weighted sum of margin loss and the reconstruction loss, as denote by
where for all the experiments in this paper.
Iii-E Model Adjustments for Part Segmentation
The goal of part segmentation is to predict a semantic label for each point in the point cloud. There are two alternative ways for acquiring the per-point feature for each point from the global feature: duplicating the global feature with times [28, 42]
, or performing upsampling by interpolation[29, 22]. In this paper, we follow the second way to duplicate the vectors in digit capsules belonging to the true label. Then we concatenate the duplicated vectors with the shuffled features. The interpolation layers are used for propagating the features from shape level to point level by upsampling, as shown in Fig. 8.
Iv-a Experimental Setup
The 3D shape classification and retrieval experiments are conducted on two subsets of the Princeton ModelNet dataset , i.e. ModelNet40 and ModelNet10. The ModelNet40 dataset contains 12,311 shapes which belong to 40 categories. We follow the same training and split settings as , which contains 9,843 shapes for training and 2,468 shapes for testing, respectively. The ModelNet10 dataset is a relatively small dataset which contains the 10 common categories of ModelNet40. Following , we split the ModelNet10 into 2,468 training samples and 909 testing samples. Since the original ModelNet provides CAD models represented by vertices and faces, we use the prepared ModelNet10/40 data from  for fair comparison. The part segmentation task is conducted on the ShapeNet part dataset , which contains 16,881 models from 16 categories and is split into training, validation and testing following PointNet++. There are 2048 points sampled for each 3D shape, where each point in a point cloud object belongs to certain one of 50 part classes and each point cloud contains 2 to 5 parts.
Iv-A2 Classification and Retrieval Settings
Because the length of representation vector in digit capsule indicates the probability that certain characteristic exists in the input point clouds. In the case of Point2SpatialCapsule, the characteristic of digit capsule is the class label. Thus, we choose the digit capsule with the biggest length as the predicted label for shape classification. For the shape retrieval task, we use the Euclidean distances between the length vectors of point clouds for similarity measurement. Such similarity measurement is in accordance with the way how capsule stores information. What’s more, a direct comparison between the length vectors requires less computational cost than comparing representation vectors in capsules.
Iv-A3 Implementation Details
In this paper, we use two multi-scale local feature extraction layer for hierarchically extracting features from point clouds. For the first feature extractor, the input is 1024 points associated with their x, y and z coordinates, from which 512 points is sampled using farthest point sampling. For each sampled point, we select nearest neighbor points of four scales. The MLPs used in the first block have units for each layer. The second feature extractor samples 256 points out of the 512 points. The number of points for kNN search is the same as the first block. The MLPs for the second block have the units of for each layer. The parameter for multi-scale shuffling is 2. The number of the cluster centers is and the dimension is . In the rearrange and squashing layer, we split each embedding into 16 16-dimensional short vectors, which form 1024 16-dimensional spatial-aware capsules in total.
Iv-B 3D Shape Classification
Table I compares Point2SpatialCapsule with the existing state-of-the-art methods of point cloud representation learning in terms of shape classification accuracy under ModelNet10 and ModelNet40, respectively. For fair comparison, all the results in Table I are obtained under the same input, which handles with raw point sets. Point2Capusule achieves a superior result () under ModelNet40, which is higher than the baseline method PointNet++ by . Specially, Point2Capusule with additional normal vectors achieves the best results ( and ), compared with the best additional-input method SO-Net  ( and ), under ModelNet10 and ModelNet40, respectively.
We note that both PointNet++ and Point2SpatialCapsule use a multi-scale local feature extraction strategy, where the difference lies in the method used for aggregating local features. The PointNet++ applies max-pooling for aggregating the local features, while Point2SpatialCapsule uses the geometric feature aggregation with spatial relationship aggregation for learning the global representation. Therefore, the improvement in classification accuracy of Point2SpatialCapsule proves the effectiveness of the proposed network for local feature aggregations.
3DCapsule  is the work most related to our Point2SpatialCapsule in Table I. As already discussed in Sec.II, 3DCapsule simply applies the capsule network on the global features produced by a pooling/full-connected layer, which falls into the scenario of information loss of the spatial locations.
In contrast, our Point2SpatialCapsule applies dynamic routing on the feature-spatial embeddings generated by the geometric feature aggregation module, which can aggregate both the features and their spatial location. The experimental results in Table I shows the implementation of capsule network in our Point2SpatialCapsule is more effective than the implementation of 3DCapsule.
Compared with PointCNN  and DGCNN , Point2SpatialCapsule still achieves the best results. We note that, PointCNN and DGCNN are also CNN-based neural network, which aims to preserve the spatial locations and spatial relationships of local regions. However, both of them use the max-pooling for aggregating the local region features, which filters out the spatial locations and relationships, especially when aggregating the local features into the global features. As shown Table I, the proposed Point2SpatialCapsule yields better performance than the PointCNN and DGCNN, which demonstrates the superior advantages of Point2SpatialCapsule for preserving the spatial locations and relationships.
As seen in Table I, our Point2SpatialCapsule outperforms most of the xyz-input methods on point clouds. Specifically, our result is ranked the first place under ModelNet10 (), and ranked the second place under ModelNet40 () which is slightly lower than RS-CNN  by . As claimed in , RS-CNN performed “ten voting tests with random scaling and averages the predictions” during testing. In contrast, we only apply the single model prediction for fair comparison with most of the existing methods [22, 1, 18]. Moreover, when using additional normal vectors as the input, the proposed Point2SpatialCapsule can achieve the best performance among all reported results under ModelNet10 () and ModelNet40 (), respectively. This convincingly verifies the effectiveness of Point2SpatialCapsule.
|PointNet++ ||+ norm||-||91.9|
|SO-Net ||+ norm||95.7||93.4|
Iv-C 3D Shape Retrieval
|mean||Intersection over Union (IoU)|
In Table II, we compare the proposed Point2Capusules with counterpart methods in 3D shape retrieval task, in terms of mean average precisions (mAPs). Since most of the methods focusing on 3D shape retrieval are based on multi-views of 3D models, in this subsection, we also quote the experimental results of the multi-view based methods to verify the effectiveness of Point2SpatialCapsule. Note that, the results of PointNet and PointNet++ are obtained by following the same training procedure as described in their original papers, which are denoted by in this table.
As shown in Table II, our method has achieved a comparable retrieval accuracy compared with multi-view based methods on both ModelNet10 and ModelNet40. Specifically, Point2SpatialCapsule achieves the best retrieval accuracy on ModelNet40, among all reported retrieval results. Point2Capusles achieves the second place result () on ModelNet10, which is slight lower than SPNet  by . However, Point2SpatialCapsule still beats SPNet by on ModelNet40 in terms of mAPs, which shows a more balanced performance of Point2SpatialCapsule over different scales of datasets. The comparison of precision-recall (PR) curves under ModelNet40 and ModelNet10 are shown in Fig. 9, where the results of Point2SpatialCapsule show the high performance for the 3D shape retrieval task.
The better performance of Point2SpatialCapsule can be dedicated to the following two reasons. First, the Point2SpatialCapsule is able to learn to encode the spatial locations of local features, which can produce a more discriminative representation for point clouds. Second, the digit capsules provide a more interpretable features for representing the point clouds, which is the length vector. Compared with the traditional single vector representations, in which the high-level characteristics are implicitly encoded in the latent feature space, the length of digit capsule explicitly indicates the probability that the characteristics appear in the point clouds. Therefore, using the distance between length vectors of digit capsules is more effective and interpretable for 3D shape retrieval.
Iv-D 3D Shape Part Segmentation
In Table III, we also report the performance of Point2SpatialCapsule on the part segmentation task in terms of the Intersection over Union (IoU) . As shown in Table III, our Point2SpatialCapsule achieves the mean instance IoU of , which outperforms the baseline method PointNet++ on 13 categories out of total 16 categories. Note that, same as PointNet++, Point2SpatialCapsule also employs the multi-scale sampling and grouping strategy for local feature extraction. Therefore, the experimental results prove that Point2Sequence improves the quality of local feature extraction, and leads to the better performance on the segmentation task. Fig. 10 visualizes some examples of our segmentation results, where our results are highly consistent with the ground truth.
Note that, segmentation application needs discriminative features of local regions. Although Point2Capsule is proposed for global shape features by encoding the information of spatial locations in local regions, rather than producing more discriminative features of local regions like RS-CNN  and PointCNN , we still achieve comparable results in segmentation results.
Iv-E Ablation Studies
In this section, we keep the settings of the network the same as described in Sec.III, except for the specified part for ablation study. We first investigate the influence of each part to our model, and then we analyze three important hyper-parameters in terms of classification accuracy on ModelNet40.
Iv-E1 The Influence of Each Part to Point2SpatialCapsule
In order to investigate the effect of each part in Point2SpatialCapsule, we develop and evaluate three different variations of our model as follows. (1) ‘No-Multi’ is the model without multi-scale shuffling, where the output of the multi-scale feature extractor is the direct input to the soft-assignment layer. (2) ‘No-VLAD’ is the model without the geometric feature aggregation, where the output of multi-scale feature extraction layer is directly reshaped as spatial-aware capsules and input to the dynamic routing layer. (3) ‘No-Caps’ is the model without the capsule net, where the output of geometric feature aggregation module is concatenated as a single vector and directly fed into the fully-connected layer for shape classification.
The experimental results are shown in Table IV. From the results we can find that each part of Point2SpatialCapsule contributes to the model performance. We note that, the No-VLAD model achieves the worst performance among the four models, which means that directly applying the capsule network on the point cloud impairs the model’s representational ability. The result of No-VLAD model supports our point of view that dynamic routing cannot learn the log priors directly from the disordered point clouds, and verifies the effectiveness of the proposed geometric feature aggregation. The results of No-Multi proves the importance of applying multi-scale shuffling for smoothing the perceived range between features of different scales. The significant improvement of Full-Model compared to the No-Caps verifies the superior advantage of capsule network for aggregating local features in point cloud recognition.
Iv-E2 The Analysis of Capsules Net
Following the common practice of , we investigate the influence of iterations in dynamic routing. As shown in Table V, we report the model performance with 1, 3 and 5 iterations of dynamic routing. According to , multiple iterations will increase the model’s learning ability but may also cause the problem of overfitting. As for Point2SpatialCapsule, we find that dynamic routing with 1 iteration is already enough for learning the point cloud features.
Iv-E3 The Analysis of Geometric Feature Aggregation
We also analysis the influence of cluster centers in NetVLAD. As shown in Table VI, the model achieves the best result with 64 cluster centers. The explanations are two-fold: (1) the small number of cluster centers could reduce the representational ability of feature embeddings; (2) the slight reduce in performance of the large number of cluster centers is the result of producing similar feature embeddings, which leads to the information redundancy and hinders the model learning more discriminative local features.
Iv-E4 The Analysis of Reconstruction Loss
In Table VII, we discuss the influence of reconstruction loss, where is the weight factor as specified in Eq. (10). From the results, we find that a large leads to the decreasing of model performance, which in our opinion is the result of a slower learning process cause by the large reconstruction loss weight, especially during the early stage of training. On the other hand, the experimental results also prove the reconstruction loss useful. Compared with a small weight () and the model without reconstruction (), the model with outperforms them by 0.65% and 0.98%, respectively. In Fig. 11, we visualize the reconstruction results on the test set of ModelNet40, from which we can find that Point2SpatialCapsule can learn to produce a relatively satisfactory result, despite of the simple reconstruction network employed in the model.
In this paper, we propose a spatial-aware network, named Point2SpatialCapsule, to jointly aggregate geometric feature and spatial relationships of local regions on point cloud. The proposed Point2SpatialCapsule has a wide range of potential applications, which can be combined with other local feature extraction methods of multi-scale regions for learning the global shape representation of 3D point clouds. Compared with the previous feature aggregation methods, Point2SpatialCapsule has the ability to integrate both the geometric features of local regions and the spatial relationships among them. The features of local regions are aggregated by spatial-ware capsules with dynamic routing, which can preserve the spatial relationships between the extracted features. Experiments show that our network can achieve superior performance on point cloud classification, retrieval and part segmentation tasks under differen datasets.
3DCapsule: Extending the capsule architecture to classify 3D point clouds. In
IEEE Winter Conference on Applications of Computer Vision, Cited by: §II-E, Fig. 6, §III-C, §IV-B, §IV-B, TABLE I.
NetVLAD: CNN architecture for weakly supervised place recognition.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5297–5307. Cited by: §II-E, §III-B2.
-  (2017) GIFT: Towards scalable 3D shape retrieval. IEEE Transactions on Multimedia 19 (6), pp. 1257–1271. Cited by: §II-B, TABLE II.
-  (2018) Object classification from 3D volumetric data with 3D capsule networks. In IEEE Global Conference on Signal and Information Processing, Cited by: §II-E.
Parts4Feature: Learning 3D global features from generally semantic parts in multiple views.
International Joint Conference on Artificial Intelligence, Cited by: §I, §II-B.
-  (2019) Unsupervised learning of 3D local features from raw voxels based on a novel permutation voxelization strategy. IEEE Transactions on Cybernetics 49 (2), pp. 481–494. Cited by: §II-C.
-  (2017) BoSCC: Bag of spatial context correlations for spatially enhanced 3D shape representation. IEEE Transactions on Image Processing 26 (8), pp. 3707–3720. Cited by: §II-D.
-  (2019) 3D2SeqViews: Aggregating sequential views for 3D global feature learning by CNN with hierarchical attention aggregation. IEEE Transactions on Image Processing 28 (8), pp. 3986–3999. Cited by: §II-B.
-  (2019) View inter-prediction GAN: Unsupervised representation learning for 3D shapes by learning global shape memories to support local view predictions. In 33rd AAAI Conference on Artificial Intelligence, Cited by: §I, §II-B, TABLE II.
-  (2019) SeqViews2SeqLabels: Learning 3D global features via aggregating sequential views by RNN with attention. IEEE Transactions on Image Processing 28 (2), pp. 658–672. Cited by: §I, §II-B, TABLE II.
-  (2019) Y2Seq2Seq: Cross-modal representation learning for 3D shape and text by joint reconstruction and prediction of view and word sequences. The Thirty-Third AAAI Conference on Artificial Intelligence (AAAI). Cited by: §II-B.
-  (2019) Multi-Angle Point cloud-VAE: Unsupervised feature learning for 3D point clouds from multiple angles by joint self-reconstruction and half-to-half prediction. In IEEE International Conference on Computer Vision (ICCV), Cited by: §II-B.
-  (2019) 3DViewGraph: Learning global features for 3D shapes from a graph of unordered views with attention. In International Joint Conference on Artificial Intelligence, Cited by: §I, §II-B.
-  (2018) Deep Spatiality: Unsupervised learning of spatially-enhanced global and local 3D features by deep neural network with coupled softmax. IEEE Transactions on Image Processing 27 (6), pp. 3049–3063. Cited by: §II-D.
-  (2019) 3D-SIS: 3D semantic instance segmentation of RGB-D scans. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4421–4430. Cited by: §I.
-  (2018) CapsuleGAN: Generative adversarial capsule network. In Proceedings of the European Conference on Computer Vision, Cited by: §II-E.
Exploiting the PANORAMA representation for convolutional neural network classification and retrieval. In 3DOR, Cited by: §II-B, TABLE II.
-  (2017) Escape from cells: Deep kd-networks for the recognition of 3D point cloud models. In IEEE International Conference on Computer Vision (ICCV), pp. 863–872. Cited by: §I, §II-A, §II-C, §IV-B, TABLE I, TABLE III.
-  (2019) A-CNN: Annularly convolutional neural networks on point clouds. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7421–7430. Cited by: §I, §II-A, TABLE I.
-  (2018) Capsules for object segmentation. arXiv:1804.04241. Cited by: §II-E.
-  (2019) Octree guided CNN with spherical kernels for 3D point clouds. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §I, §II-A, TABLE I.
-  (2018) SO-Net: Self-organizing network for point cloud analysis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9397–9406. Cited by: §I, §II-A, §III-E, §IV-A1, §IV-B, §IV-B, TABLE I, TABLE III.
-  (2018) PointCNN: Convolution on x-transformed points. In Advances in Neural Information Processing Systems, pp. 820–830. Cited by: §I, §II-A, §IV-B, §IV-D, TABLE I, TABLE III.
-  (2019) Point2Sequence: Learning the shape representation of 3D point clouds with an attention-based sequence to sequence network. In 33rd AAAI Conference on Artificial Intelligence, Cited by: §I, §I, §II-A, TABLE I, TABLE III.
-  (2011) Computing the inner distances of volumetric models for articulated shape description with a visibility graph. IEEE Transactions on Pattern Analysis and Machine Intelligence 23 (12), pp. 2538–2544. Cited by: §II-C.
-  (2019) Relation-shape convolutional neural network for point cloud analysis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8895–8904. Cited by: §II-A, §IV-B, §IV-D, TABLE I, TABLE III.
-  (2018) Attention-based capsule networks with dynamic routing for relation extraction. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, Cited by: §II-E.
-  (2017) PointNet: Deep learning on point sets for 3D classification and segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §I, §I, §I, §I, §II-A, §III-E, §IV-D, TABLE I, TABLE II, TABLE III.
-  (2017) PointNet++: Deep hierarchical feature learning on point sets in a metric space. In Advances in Neural Information Processing Systems, pp. 5099–5108. Cited by: §I, §I, §II-A, §III-A, §III-E, §IV-A1, TABLE I, TABLE II, TABLE III.
-  (2016) Volumetric and multi-view CNNs for object classification on 3D data. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5648–5656. Cited by: TABLE II.
-  (2017) OctNet: Learning deep 3D representations at high resolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §II-A, §II-C.
-  (2017) Dynamic routing between capsules. In Advances in Neural Information Processing Systems, pp. 3856–3866. Cited by: §I, §I, §I, §II-E, §III-C1, §III-C2, §III-D, §III-D, §IV-E2.
-  (2016) SHREC’16 track large-scale 3D shape retrieval from ShapeNet core55. In Proceedings of the Eurographics Workshop on 3D Object Retrieval, Cited by: §IV-A1, TABLE II.
-  (2016) VConv-DAE: Deep volumetric shape learning without object labels. In European Conference on Computer Vision, pp. 236–250. Cited by: §II-C.
-  (2018) Mining point cloud local structures by kernel correlation and graph pooling. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §I, TABLE I, TABLE III.
-  (2015) DeeppPano: Deep panoramic representation for 3-d shape recognition. IEEE Signal Processing Letters 22 (12), pp. 2339–2343. Cited by: §II-B, TABLE II.
-  (2016) Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1874–1883. Cited by: §III-B1.
-  (2018) PointNetVLAD: Deep point cloud based retrieval for large-scale place recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4470–4479. Cited by: §II-E.
-  (2012) Robust shape normalization of 3D articulated volumetric models. Computer-Aided Design 44 (12), pp. 1253–1268. Cited by: §II-C.
-  (2017) O-CNN: Octree-based convolutional neural networks for 3D shape analysis. ACM Transactions on Graphics 36 (4), pp. 72. Cited by: §II-A, §II-C, TABLE III.
-  (2019) Associatively segmenting instances and semantics in point clouds. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4096–4105. Cited by: §I.
-  (2018) Dynamic graph CNN for learning on point clouds. arXiv:1801.07829. Cited by: §II-A, §III-E, §IV-B, TABLE I, TABLE III.
-  (2019) PointConv: Deep convolutional networks on 3D point clouds. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9621–9630. Cited by: TABLE I.
-  (2015) 3D ShapeNets: A deep representation for volumetric shapes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1912–1920. Cited by: §II-C, §IV-A1.
-  (2018) MCapsNet: Capsule network for text with multi-task learning. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 4565–4574. Cited by: §II-E.
-  (2015) DeepShape: Deep learned shape descriptor for 3D shape matching and retrieval. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1275–1283. Cited by: §II-D.
-  (2016) Learned binary spectral shape descriptor for 3D shape correspondence. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3309–3317. Cited by: §II-D.
-  (2018) Attentional ShapeContextNet for point cloud recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4606–4615. Cited by: §III-A, TABLE I, TABLE III.
-  (2018) SpiderCNN: Deep learning on point sets with parameterized convolutional filters. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 87–102. Cited by: §II-A.
-  (2019) Modeling point clouds with self-attention and gumbel subset sampling. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3323–3332. Cited by: TABLE I.
-  (2018) Investigating capsule networks with dynamic routing for text classification. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 3110–3119. Cited by: §II-E.
-  (2018) SPNet: Deep 3D object classification and retrieval using stereographic projection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §IV-C, TABLE II.
-  (2018) PVNet: A joint convolutional network of point cloud and multi-view for 3D shape Recognition. In Proceedings of the 26th ACM international conference on Multimedia, pp. 1310–1318. Cited by: §II-A.
-  (2019) PointWeb: Enhancing local neighborhood features for point cloud processing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5565–5573. Cited by: §II-A, TABLE I.
-  (2019) 3D Point Capsule Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §II-E, Fig. 6, §III-C.