3D object classification of point clouds is an important task as laser scanners, or other depth sensors, generating point clouds are now a commodity on, e.g., autonomous vehicles, surveying vehicles, service robots and drones. There has been less progress using deep learning methods in the area of point clouds compared to 2D images and videos, partly because the data in a point cloud are typically unordered as opposed to the pixels in a 2D image, which means standard deep learning architectures are not applicable. Nevertheless, some works in the area use architectures circumventing this issue.
An architecture for classification that works on raw, unordered, 3D point cloud data often contains three key modules: a feature extraction layer for obtaining a richer representation of input point sets, an aggregation layer acting as a symmetric function aggregating information from points as well as removing the effect of varying ordering (point permutation), and lastly a classifier layer to classify the entire input point set (Figure1). The pioneer work in this area is PointNet 
, which uses a multilayer perceptron (mlp)
as a feature extraction layer, a max pooling operation as a symmetric function for aggregation, and a fully connected (fc) perceptron as a classifier.
More recent methods improve upon the first two stages, feature extraction and aggregation [16, 23, 12, 21, 22]. Feature extraction is improved by getting a richer description using local information instead of only global information [16, 23, 12]. In , aggregation is improved by using NetVlad . In the context of 3D point classification, there has, however, not been much effort improving the last part, the actual classification layer.
In this paper, we propose a novel extension of Capsule Networks  for the purpose of classification of 3D point clouds. The capsule concept was first proposed in  and later extended using a dynamic routing algorithm . To date, the capsule concept has not yet found widespread use as it is a recent innovation. However, where it has been used it has indeed improved the overall classification performance, for example,  improves upon state-of-the-art on the long time saturated MNIST dataset . The capsule routing algorithm learns global coherence by enforcing part-whole relationships to be learned. For example, if a person’s lip is replaced by an eye in a face, using the routing capsule algorithm chances of classifying the image as a face are reduced. Thanks to this capability, the capsule concept has some structural advantages over conventional networks which typically are only capable of extracting local translation invariances.
In this paper, we extend the Capsule Network concept to be applicable to 3D point cloud data, and we investigate the utility of such a structure when inserted in recent popular methods for 3D point cloud classification. To the best of our knowledge, this is the first time that the capsule concept has been applied to 3D point cloud data. Specifically, we investigate the effect on classification performance as well as robustness to noise via an ablation study considering alternative algorithms for the various parts. This new network design, called 3DCapsule, is non-trivial as more powerful aggregation functions used to introduce an ordering of the data, such as NetVlad, may not preserve a spatial relationship between the data points which the intuition behind the original Capsule Network design relies on.
The main contributions of our work are as follows:
We present a novel classifier, 3DCapsule, applicable to 3D point cloud classification. It is a drop-in replacement of the typically used fully connected classifier.
We add a new layer, ComposeCaps, which in lieu of lost spatial relationships caused by permutation invariance, learns a new, useful, mapping of capsules that can be exploited by the capsule network.
We demonstrate the benefit of our 3DCapsule and ComposeCaps via ablation studies against baseline architectures. We investigate both classification performance as well as robustness to outlier points and point perturbation.
2 Related Work
There are three common categories of 3D object classification based on deep learning networks. The first category uses a volumetric representation of the data as input to the network [24, 13]. The second category is a view-based method [20, 15], which projects the 3D shape onto a 2D image such that more classical 2D approaches can be utilized. The third category uses raw point cloud data as input which the network then operates on directly [16, 23, 12, 25, 22]. In this work, we focus on 3D object classification based on the latter, raw point cloud representation. This is because it does not suffer from severe scalability issues like the volumetric representation does, and it does not make any a priori assumptions onto which 2D planes, and how many, that the point cloud should be projected on like the view-based methods do.
is the first work which consumes raw point clouds directly without using any volumetric or view-based data representations. PointNet uses a multi-layer perceptron (mlp) to extract features from point sets, and instances of Spatial Transformer Networks (STN) in order to make it robust to certain transformations. Furthermore, a max pooling layer removes permutation, and aggregates information from the points. Classification is done via a fully connected layer at the end.
extracts richer features which also consider local information. A hierarchical neural network is proposed which applies PointNet recursively on partitions of point sets giving it the ability to learn local features from a range of contextual scales. , on the other hand, proposes an Edge Convolution (EdgeConv) operation in order to capture local structures by constructing a local neighborhood graph and applying convolution-like operations on the edges connecting neighboring pairs of points. 
introduces a permutation invariant network, called Self-Organization (SO-Net) network, which utilizes a spatial distribution of point clouds. SO-Net also uses a point cloud autoencoder as pre-training to enhance the performance.
Some methods, in turn, instead improve the aggregation module of PointNet .  proposes a structure similar to PointNet++  but replaces the mini-pointnet with a spectral convolution operation. To achieve this, a new pooling method, called Recursive cluster pooling is introduced. This pooling strategy aggregates information in the spectral domain from clusters that are related to each other.  proposes a new pooling method based on the NetVlad  idea. There, unordered point sets are mapped, based on clustering, to a new space that is invariant to permutation enabling a richer feature representation, than, e.g., max pooling.
To the best of our knowledge, classification of the representations in these architectures has been limited to using a fully connected layer. Here, we propose an alternative, 3DCapsule, an extension of Capsule Networks 
. Capsules are originally introduced as an alternative to convolutional neural network (CNN), which is the most common network in deep learning. Capsules can learn a representation of an image that is more robust to spatial and pose variations than a typical CNN thanks to using a vector output instead of a scalar one, and by replacing the max-pooling operation with a dynamic routing algorithm. Although Capsules have some advantages over CNNs, as demonstrated by the encouraging results on 2D image classification[18, 7] and segmentation , there is currently no work proposed based on the Capsule Network for 3D object classification. Therefore, in this paper, we propose an extension of Capsule Networks such that it becomes applicable to 3D point cloud data, and investigate the utility for the purpose of 3D object classification.
In this section, we describe our contribution named 3DCapsule, which is an essential building block in our proposed method. Putting 3DCapsule into context, we first describe briefly a typical 3D object classification architecture operating on point sets from point clouds. This includes a feature extraction stage and an aggregation stage, the latter which should be invariant to any permutation of the points in the point set. Once these typical elements are explained, we move on to describing our proposed method, 3DCapsule, in detail.
3.1 Feature extraction and aggregation
Suppose an unordered point set where , a set function needs to be defined that is invariant to any permutation of the point set,
where is the set function, is the feature extraction function, is the aggregation function which needs to be invariant to permutation of points, and represents a set of arguments associated with .
The feature extraction module maps raw point sets to a new, in most cases high dimensional, space where each point (or region around a point) has a richer representation. One choice of , which is used in PointNet , is . It is there implemented using a shared mlp layer, where the mlp is applied separately to each point without considering any local neighborhood. This results in extracting features carrying global information only. Another choice of , which is proposed by  called EdgeConv, is . In this case, point sets are represented by a directed graph and edge features based on -nearest neighbors are calculated. EdgeConv  is applied to edges extracting features which encode both global and local information from each point based on its neighborhood (see Figure 2).
Point sets resulting from a typical 3D scanning device, such as a LIDAR, are inherently unordered, although their spatial distribution is still invariant to permutation. Hence, we need a function that can aggregate per-point features into a feature vector that is independent of any ordering of the points in the point set. A simple and effective choice of this function, , is the max pooling operation, which is used in PointNet  and is invariant to permutation. It is also possible to use aggregation methods based on clustering, such as NetVlad 
. In NetVlad, a few initial point centers are selected via k-means clustering and residual distances between each center and its neighborhood points are calculated. As proven in , this renders a representation invariant to permutation (see Figure 3).
Finally, via a collection of , corresponding values of can be computed to form a vector where , which represents a feature vector of an input point set. This feature vector is invariant to any permutation of the input point set.
Suppose we have primary capsules where each one has dimension and 3DCaps each of dimension . ComposeCaps is, here, a fully connected layer mapping the feature vector to primary capsules:
where is the weight matrix of ComposeCaps, is its bias,
is a sigmoid activation function, andcontains the collection of all, newly composed, primary capsules as a vector.
The new capsules are then retrieved by splitting P as follows,
where is a capsule at the primary capsule layer.
A squashing function is defined which maps capsule values to the range ,
where v is the output vector of a capsule and s is its input. So, in order to find the output of each primary capsule the above squashing function is applied,
where is the output vector of a primary capsule.
The prediction vectors are obtained by
where , and the input to the 3DCaps layer is a weighted sum over all prediction vectors from the capsules in the primary capsule layer
where are coupling coefficients that are determined by the routing algorithm in an iterative loop
where the initial logits
are the log prior probabilities that capsuleshould be coupled to capsule .
The loss function used in the 3DCapsule is the sum of the margin loss and the reconstruction loss,
where is equal to 1 if class is correctly predicted, otherwise 0. The parameters were set as follows: , , and . The total margin loss is calculated by summing the individual margin loss for all classes. is the number of points in the set, is the reconstruction of point , and .
In this section, we provide an ablation study to investigate the effect of the 3DCapsule using different feature extractors and aggregation modules. The investigation includes accuracy measures as well as robustness with respect to corruptions such as outlier points and Gaussian perturbation noise. We also evaluate the effect of ComposeCaps and reconstruction loss within the 3DCapsule.
4.1 Network Architecture and Training
In all our experiments, we used the following parameters during network training. The Adam optimizer 5]
were also used, except in the 3DCapsule part. The network was implemented using TensorFlow and executed on an NVIDIA GTX1080Ti.
We used feature extraction modules from two different architectures (PointNet  and EdgeConv ). In the case of the PointNet structure, we employed a spatial transformer network of size , followed by two shared mlp layers (64,64), followed by the second spatial transformer network of size , ending with three mlp layers (64,128,1024). For the EdgeConv-based feature extraction module, the first EdgeConv block used three shared mlp layers (64,64,64) and the second EdgeConv consisted of a shared mlp of size (128). Finally a shared mlp of size (1024) was used to concatenate all features together.
In terms of aggregation modules, we also used two different approaches (max pooling and NetVlad ). In the case of max pooling, we got a feature vector of size 1024. Using NetVlad to retrieve a point feature vector, the number of cluster centers were set to 128 and the last mlp layer was set to size 128. In order to be compatible with NetVlad, the last mlp layer of the PointNet and EdgeConv feature extractors were replaced with an mlp layer of size 128, creating a feature vector output of size in the case of NetVlad.
Finally, as the classifier, we again used two different approaches (fully connected classifier as the baseline, and our contribution 3DCapsule. In the case of the fully connected classifier, we used three fully connected layers of sizes (512,256,#classes), with a dropout  of in the first two layers. When investigating our 3DCapsule, we used a fully connected layer of size 4000 as our ComposeCaps layer to map the feature vector output to the primary capsule input. 500 primary capsules with a dimension of 8 was used. In the 3DCaps layer, each capsule was of dimension 4.
In our experiments, we used two versions of the ModelNet dataset , ModelNet10 and ModelNet40. The ModelNet40 contains CAD models from 40 different classes, which were divided into training samples and testing samples. The ModelNet10 was split into training samples and testing samples. We followed the same experimental settings as in . In order to have a fair comparison, the prepared ModelNet 10/40 dataset from  is used, where each model is represented by points.
|PointNet ||NetVlad ||FC||87.2|
|PointNet ||NetVlad ||3DCapsule||90.6|
|EdgeConv ||NetVlad ||FC||91.2|
4.3 Ablation study of the 3DCapsule
In this section, we evaluate the effect of the 3DCapsule with respect to different feature extractors and aggregation modules. The baseline classifier used to compare with, is a fully connected layer (fc) as that is what is employed in comparable works operating on raw point clouds. Furthermore, we use two feature extraction modules based on PointNet  and EdgeConv, and two aggregation approaches, max pooling and NetVlad.
From Table 1, it is clear that regardless of which feature extraction method and/or aggregation method that is utilized, our 3DCapsule improves over using a regular fully connected layer as a classifier. The biggest improvements over using a fully connected classifier is observed when combined with the aggregator NetVlad. This is natural, as compared to max pooling, NetVlad outputs a much richer representation of the data. A fully connected classifier may not be able to make use of this richer information, whereas our 3DCapsule can due to the fact that it takes advantage of the routing algorithm and capsule vector representation.
4.4 Effect of ComposeCaps
In this section, we evaluate the effect of ComposeCaps in our proposed method. We, again, test the effect using the two feature extraction operators PointNet  and EdgeConv  as well as the two aggregation methods, Maxpooling and NetVlad . The result is shown in Table 3. It can be observed that the ComposeCaps layer does indeed improve the performance in all combinations, demonstrating the utility of ComposeCaps.
|PointNet +NetVlad ||No||89.9|
|PointNet +NetVlad ||Yes||90.6|
|EdgeConv +NetVLad ||No||92.4|
|EdgeConv +NetVlad ||Yes||92.7|
4.5 Effect of Reconstruction Loss
In this section, we evaluate the effect of reconstruction loss in the 3DCapsule. As shown in Table 4, the reconstruction loss is crucial in all structures used in this paper.
|recons loss||PointNet+Max||PointNet +NetVlad||EdgeConv+Max||EdgeConv+NetVlad|
4.6 Robustness to Noise
In this section, we evaluate the robustness of our proposed method, 3DCapsule, with respect to two different types of data corruption: 1) outlier points, and 2) point perturbation. This robustness test was carried out in a similar way as the ablation study in Section 4.3, with the only difference being that the different network combinations were here subjected to varying levels of training and testing noise.
Input points were normalized to the unit sphere in this experiment. When creating the training data, both distorted and undistorted data were added together, enabling the networks to better learn how to deal with distorted data without unnecessarily reducing performance when exposed to undistorted data.
Outlier Test: In the outlier experiment, a varying number of outlier points (0, 1, 2, 5, 10, 20, 50, 100) were replacing the original points, both in the training and the testing samples. The result is shown in Figure 5. It is evident that a 3DCapsule enabled architecture is significantly more robust against outlier points in almost every combination of training and testing noise levels, compared to the baseline (fully connected classifier), irrespectively of which feature extractor and aggregator that is used. Furthermore, it can be seen that NetVlad  is generally achieving better performance than Maxpooling, and that gap tends to be greater when combined with 3DCapsule as expected.
In the point perturbation test, Gaussian noise was added to each point independently. The standard deviation of the added noise was in the rangein steps of and . The result is shown in Figure 6. It is evident that, again, a 3DCapsule enabled architecture is significantly more robust, also in the case of Gaussian perturbation noise, compared to the baseline that instead utilises a fully connected classifier. It can also be seen in the graphs that the fully connected classifier is not good at making use of the richer representation that NetVlad  provides, whereas 3DCapsule can unlock the information provided by the more powerful NetVlad aggregator.
4.7 Absolute Classification Performance
In this paper, we do not explicitly set out to construct a new architecture that will achieve state-of-the-art classification performance, but rather contribute with the design of a new component, 3DCapsule, that can be drop-in inserted into many other architectures. However, for completeness, we here compare our proposed method to state-of-the-art classification architectures operating on raw, unordered, point clouds that are unaffected by any additional noise.
We chose to compare the best combination of feature extractor and aggregator together with our 3DCapsule that we had found in the experiments in the previous sections. That is, EdgeConv plus NetVlad were used as feature extraction and aggregation modules respectively, together with our 3DCapsule. The comparison was made using the ModelNet40/ModelNet10 3D object recognition benchmark, and the results are shown in Table 5.
The lower part of the Table shows those methods that are apples-to-apples comparable with ours. That is, they use the same input data representation that we do (3D points only without any additional information such as, e.g., surface normals), and they also use the exact same number of input points, . It is evident that in an apples-to-apples comparison where we have applied our 3DCapsule, we outperform current state-of-the-art. In fact, it is only outperformed by SO-Net  when they are using significantly more points than we do (5000 vs 1024) as well as using surface normals.
|SpiderCNN||pc + normal||92.4||-|
|Local Spectral||pc + normal||92.1||-|
|PointNet++||pc + normal||91.9||-|
|pc + normal||93.4||95.7|
4.8 Time and Space Complexity
It is clear from Table 6 that the 3DCapsule adds additional complexity to the architectures to which it is applied. However, we believe that the benefit of significant increase in robustness to various kinds of noise (as shown in Figure 5 and Figure 6) outweighs these drawbacks. The performance on classification, without added noise, is also increased to a degree, which demonstrates the benefit of 3DCapsule in comparison to the common fully connected classifier, albeit at the cost of an increase in memory usage and inference time.
In this paper, we introduced the 3DCapsule which is an extension of the Capsule concept that makes it applicable to unordered point sets. It is intended as an alternative, drop-in replacement, in the case of 3D point cloud classification, to the commonly used fully connected classifier. A key insight is that the original Capsule concept, that operated on 2D images, implicitly relied on the fact that the pixels in such images are ordered and has a spatial relationship. Modern architectures for 3D point set classification typically introduce symmetrical functions to achieve invariance to point set permutation, however, this also invalidates the spatial relationship between feature vectors. Therefore, in this work, a ComposeCaps layer was introduced to find a new, meaningful, mapping of the feature vectors to capsules, that the 3DCapsule network could better exploit.
While achieveing improved classification performance in an apples-to-apples comparison against state-of-the-art methods on data that is not affected by additional noise, it is overwhelmingly demonstrated that using the 3DCapsule is superior when subjected to corrupted data such as a varying amount of outliers, or Gaussian noise. This is demonstrated against the typical fully connected classifier, and irrespectively of which feature extractor or aggregator module that it is combined with.
M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin,
S. Ghemawat, G. Irving, M. Isard, et al.
Tensorflow: a system for large-scale machine learning.In OSDI, volume 16, pages 265–283, 2016.
-  R. Arandjelovic, P. Gronát, A. Torii, T. Pajdla, and J. Sivic. Netvlad: Cnn architecture for weakly supervised place recognition. In , pages 5297–5307, 2016.
-  J. A. Hartigan and M. A. Wong. Algorithm as 136: A k-means clustering algorithm. Journal of the Royal Statistical Society. Series C (Applied Statistics), 28(1):100–108, 1979.
-  G. E. Hinton, A. Krizhevsky, and S. D. Wang. Transforming auto-encoders. In International Conference on Artificial Neural Networks, pages 44–51. Springer, 2011.
-  S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.
-  M. Jaderberg, K. Simonyan, A. Zisserman, and K. Kavukcuoglu. Spatial transformer networks. In Advances in neural information processing systems, pages 2017–2025, 2015.
-  A. Jaiswal, W. AbdAlmageed, and P. Natarajan. Capsulegan: Generative adversarial capsule network. arXiv preprint arXiv:1802.06167, 2018.
-  D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
-  R. Klokov and V. S. Lempitsky. Escape from cells: Deep kd-networks for the recognition of 3d point cloud models. In IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22-29, 2017, pages 863–872, 2017.
-  R. LaLonde and U. Bagci. Capsules for object segmentation. arXiv preprint arXiv:1804.04241, 2018.
-  Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
-  J. Li, B. M. Chen, and G. H. Lee. So-net: Self-organizing network for point cloud analysis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 9397–9406, 2018.
-  D. Maturana and S. Scherer. Voxnet: A 3d convolutional neural network for real-time object recognition. In Intelligent Robots and Systems (IROS), 2015 IEEE/RSJ International Conference on, pages 922–928. IEEE, 2015.
-  C. R. Qi, H. Su, K. Mo, and L. J. Guibas. Pointnet: Deep learning on point sets for 3d classification and segmentation. Proc. Computer Vision and Pattern Recognition (CVPR), IEEE, 1(2):4, 2017.
-  C. R. Qi, H. Su, M. Nießner, A. Dai, M. Yan, and L. J. Guibas. Volumetric and multi-view cnns for object classification on 3d data. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5648–5656, 2016.
-  C. R. Qi, L. Yi, H. Su, and L. J. Guibas. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. In Advances in Neural Information Processing Systems, pages 5099–5108, 2017.
-  F. Rosenblatt. Principles of Neurodynamics: Perceptrons and the Theory of Brain Mechanisms. Spartan Books, Washington, 1962. it Early work on what would now be referred to as a “connectionist” model.
-  S. Sabour, N. Frosst, and G. E. Hinton. Dynamic routing between capsules. In Advances in Neural Information Processing Systems, pages 3856–3866, 2017.
-  N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research, 15(1):1929–1958, 2014.
-  H. Su, S. Maji, E. Kalogerakis, and E. G. Learned-Miller. Multi-view convolutional neural networks for 3d shape recognition. In Proceedings of the IEEE international conference on computer vision, pages 945–953, 2015.
-  M. A. Uy and G. H. Lee. Pointnetvlad: Deep point cloud based retrieval for large-scale place recognition. arXiv preprint arXiv:1804.03492, 2018.
-  C. Wang, B. Samari, and K. Siddiqi. Local spectral graph convolution for point set feature learning. arXiv preprint arXiv:1803.05827, 2018.
-  Y. Wang, Y. Sun, Z. Liu, S. E. Sarma, M. M. Bronstein, and J. M. Solomon. Dynamic graph cnn for learning on point clouds. arXiv preprint arXiv:1801.07829, 2018.
-  Z. Wu, S. Song, A. Khosla, F. Yu, L. Zhang, X. Tang, and J. Xiao. 3d shapenets: A deep representation for volumetric shapes. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1912–1920, 2015.
-  Y. Xu, T. Fan, M. Xu, L. Zeng, and Y. Qiao. Spidercnn: Deep learning on point sets with parameterized convolutional filters. arXiv preprint arXiv:1803.11527, 2018.