3DCapsule: Extending the Capsule Architecture to Classify 3D Point Clouds

11/06/2018 ∙ by Ali Cheraghian, et al. ∙ CSIRO Australian National University 0

This paper introduces the 3DCapsule, which is a 3D extension of the recently introduced Capsule concept that makes it applicable to unordered point sets. The original Capsule relies on the existence of a spatial relationship between the elements in the feature map it is presented with, whereas in point permutation invariant formulations of 3D point set classification methods, such relationships are typically lost. Here, a new layer called ComposeCaps is introduced that, in lieu of a spatially relevant feature mapping, learns a new mapping that can be exploited by the 3DCapsule. Previous works in the 3D point set classification domain have focused on other parts of the architecture, whereas instead, the 3DCapsule is a drop-in replacement of the commonly used fully connected classifier. It is demonstrated via an ablation study, that when the 3DCapsule is applied to recent 3D point set classification architectures, it consistently shows an improvement, in particular when subjected to noisy data. Similarly, the ComposeCaps layer is evaluated and demonstrates an improvement over the baseline. In an apples-to-apples comparison against state-of-the-art methods, again, better performance is demonstrated by the 3DCapsule.



There are no comments yet.


This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

3D object classification of point clouds is an important task as laser scanners, or other depth sensors, generating point clouds are now a commodity on, e.g., autonomous vehicles, surveying vehicles, service robots and drones. There has been less progress using deep learning methods in the area of point clouds compared to 2D images and videos, partly because the data in a point cloud are typically unordered as opposed to the pixels in a 2D image, which means standard deep learning architectures are not applicable. Nevertheless, some works in the area use architectures circumventing this issue.

An architecture for classification that works on raw, unordered, 3D point cloud data often contains three key modules: a feature extraction layer for obtaining a richer representation of input point sets, an aggregation layer acting as a symmetric function aggregating information from points as well as removing the effect of varying ordering (point permutation), and lastly a classifier layer to classify the entire input point set (Figure 

1). The pioneer work in this area is PointNet [14]

, which uses a multilayer perceptron (mlp)


as a feature extraction layer, a max pooling operation as a symmetric function for aggregation, and a fully connected (fc) perceptron as a classifier.

More recent methods improve upon the first two stages, feature extraction and aggregation [16, 23, 12, 21, 22]. Feature extraction is improved by getting a richer description using local information instead of only global information [16, 23, 12]. In [21], aggregation is improved by using NetVlad [2]. In the context of 3D point classification, there has, however, not been much effort improving the last part, the actual classification layer.

Figure 1: A typical architecture of a 3D point cloud classification system. The system takes points of dimension and extracts point features of dimension

, followed by an aggregation module that is tasked with building a feature vector invariant to point permutation. Lastly, a classifier is used to classify the resulting feature vector into one of the


In this paper, we propose a novel extension of Capsule Networks [18] for the purpose of classification of 3D point clouds. The capsule concept was first proposed in [4] and later extended using a dynamic routing algorithm [18]. To date, the capsule concept has not yet found widespread use as it is a recent innovation. However, where it has been used it has indeed improved the overall classification performance, for example, [18] improves upon state-of-the-art on the long time saturated MNIST dataset [11]. The capsule routing algorithm learns global coherence by enforcing part-whole relationships to be learned. For example, if a person’s lip is replaced by an eye in a face, using the routing capsule algorithm chances of classifying the image as a face are reduced. Thanks to this capability, the capsule concept has some structural advantages over conventional networks which typically are only capable of extracting local translation invariances.

In this paper, we extend the Capsule Network concept to be applicable to 3D point cloud data, and we investigate the utility of such a structure when inserted in recent popular methods for 3D point cloud classification. To the best of our knowledge, this is the first time that the capsule concept has been applied to 3D point cloud data. Specifically, we investigate the effect on classification performance as well as robustness to noise via an ablation study considering alternative algorithms for the various parts. This new network design, called 3DCapsule, is non-trivial as more powerful aggregation functions used to introduce an ordering of the data, such as NetVlad, may not preserve a spatial relationship between the data points which the intuition behind the original Capsule Network design relies on.

The main contributions of our work are as follows:

  • We present a novel classifier, 3DCapsule, applicable to 3D point cloud classification. It is a drop-in replacement of the typically used fully connected classifier.

  • We add a new layer, ComposeCaps, which in lieu of lost spatial relationships caused by permutation invariance, learns a new, useful, mapping of capsules that can be exploited by the capsule network.

  • We demonstrate the benefit of our 3DCapsule and ComposeCaps via ablation studies against baseline architectures. We investigate both classification performance as well as robustness to outlier points and point perturbation.

2 Related Work

There are three common categories of 3D object classification based on deep learning networks. The first category uses a volumetric representation of the data as input to the network [24, 13]. The second category is a view-based method [20, 15], which projects the 3D shape onto a 2D image such that more classical 2D approaches can be utilized. The third category uses raw point cloud data as input which the network then operates on directly [16, 23, 12, 25, 22]. In this work, we focus on 3D object classification based on the latter, raw point cloud representation. This is because it does not suffer from severe scalability issues like the volumetric representation does, and it does not make any a priori assumptions onto which 2D planes, and how many, that the point cloud should be projected on like the view-based methods do.

Methods operating directly on raw point cloud data typically contains three main components: feature extraction, aggregation, and a classifier (see Figure 1). PointNet [14]

is the first work which consumes raw point clouds directly without using any volumetric or view-based data representations. PointNet uses a multi-layer perceptron (mlp) to extract features from point sets, and instances of Spatial Transformer Networks (STN) 

[6] in order to make it robust to certain transformations. Furthermore, a max pooling layer removes permutation, and aggregates information from the points. Classification is done via a fully connected layer at the end.

Recently, a few methods have been introduced which improve the performance of the feature extraction module of PointNet [14]. PointNet++ [16]

extracts richer features which also consider local information. A hierarchical neural network is proposed which applies PointNet

[14] recursively on partitions of point sets giving it the ability to learn local features from a range of contextual scales. [23], on the other hand, proposes an Edge Convolution (EdgeConv) operation in order to capture local structures by constructing a local neighborhood graph and applying convolution-like operations on the edges connecting neighboring pairs of points. [12]

introduces a permutation invariant network, called Self-Organization (SO-Net) network, which utilizes a spatial distribution of point clouds. SO-Net also uses a point cloud autoencoder as pre-training to enhance the performance.

Some methods, in turn, instead improve the aggregation module of PointNet [14]. [22] proposes a structure similar to PointNet++ [16] but replaces the mini-pointnet with a spectral convolution operation. To achieve this, a new pooling method, called Recursive cluster pooling is introduced. This pooling strategy aggregates information in the spectral domain from clusters that are related to each other. [21] proposes a new pooling method based on the NetVlad [2] idea. There, unordered point sets are mapped, based on clustering, to a new space that is invariant to permutation enabling a richer feature representation, than, e.g., max pooling.

Figure 2: Two examples of feature extraction: PointNet [14] (left) and EdgeConv (right). In PointNet [14], the output of the feature extraction module is only related to the point itself, whereas the EdgeConv output depends on the point itself as well as points in its neighborhood.

To the best of our knowledge, classification of the representations in these architectures has been limited to using a fully connected layer. Here, we propose an alternative, 3DCapsule, an extension of Capsule Networks [18]

. Capsules are originally introduced as an alternative to convolutional neural network (CNN), which is the most common network in deep learning. Capsules can learn a representation of an image that is more robust to spatial and pose variations than a typical CNN thanks to using a vector output instead of a scalar one, and by replacing the max-pooling operation with a dynamic routing algorithm. Although Capsules have some advantages over CNNs, as demonstrated by the encouraging results on 2D image classification 

[18, 7] and segmentation [10], there is currently no work proposed based on the Capsule Network for 3D object classification. Therefore, in this paper, we propose an extension of Capsule Networks such that it becomes applicable to 3D point cloud data, and investigate the utility for the purpose of 3D object classification.

3 Method

In this section, we describe our contribution named 3DCapsule, which is an essential building block in our proposed method. Putting 3DCapsule into context, we first describe briefly a typical 3D object classification architecture operating on point sets from point clouds. This includes a feature extraction stage and an aggregation stage, the latter which should be invariant to any permutation of the points in the point set. Once these typical elements are explained, we move on to describing our proposed method, 3DCapsule, in detail.

3.1 Feature extraction and aggregation

Suppose an unordered point set where , a set function needs to be defined that is invariant to any permutation of the point set,

where is the set function, is the feature extraction function, is the aggregation function which needs to be invariant to permutation of points, and represents a set of arguments associated with .

The feature extraction module maps raw point sets to a new, in most cases high dimensional, space where each point (or region around a point) has a richer representation. One choice of , which is used in PointNet [14], is . It is there implemented using a shared mlp layer, where the mlp is applied separately to each point without considering any local neighborhood. This results in extracting features carrying global information only. Another choice of , which is proposed by [23] called EdgeConv, is . In this case, point sets are represented by a directed graph and edge features based on -nearest neighbors are calculated. EdgeConv [23] is applied to edges extracting features which encode both global and local information from each point based on its neighborhood (see Figure 2).

Point sets resulting from a typical 3D scanning device, such as a LIDAR, are inherently unordered, although their spatial distribution is still invariant to permutation. Hence, we need a function that can aggregate per-point features into a feature vector that is independent of any ordering of the points in the point set. A simple and effective choice of this function, , is the max pooling operation, which is used in PointNet [14] and is invariant to permutation. It is also possible to use aggregation methods based on clustering, such as NetVlad [2]

. In NetVlad, a few initial point centers are selected via k-means 

[3] clustering and residual distances between each center and its neighborhood points are calculated. As proven in [21], this renders a representation invariant to permutation (see Figure 3).

Figure 3: Two examples of aggregation methods: Maxpooling (left) and NetVlad [2] (right) in a two dimensional space. In Maxpooling, the maximum point is selected in each dimension, so it derives a point which is invariant to permutation. NetVlad, based on predefined centers, computes the residuals between each center and all the points in its neighborhood rendering an invariant representation of the point set.

Finally, via a collection of , corresponding values of can be computed to form a vector where , which represents a feature vector of an input point set. This feature vector is invariant to any permutation of the input point set.

3.2 3DCapsule

Here, we introduce our contribution called 3DCapsule. The derivation will follow the notation shown in Figure 4 and [18].

Suppose we have primary capsules where each one has dimension and 3DCaps each of dimension . ComposeCaps is, here, a fully connected layer mapping the feature vector to primary capsules:


where is the weight matrix of ComposeCaps, is its bias,

is a sigmoid activation function, and

contains the collection of all, newly composed, primary capsules as a vector.

The new capsules are then retrieved by splitting P as follows,


where is a capsule at the primary capsule layer.

Figure 4: The 3DCapsule Architecture. The network takes points of dimension . Feature extraction, , and aggregation, , create the feature vector f which is input to the 3DCapsule. The 3DCapsule has three main parts: ComposeCaps maps the feature vector to PrimaryCaps which, in turn, creates primaryCaps of dimension . The routing-by-agreement algorithm is used to couple the PrimaryCaps to the 3DCaps, each of dimension . The reconstruction block attempts to reconstruct the input data and has a corresponding reconstruction loss.

A squashing function is defined which maps capsule values to the range ,


where v is the output vector of a capsule and s is its input. So, in order to find the output of each primary capsule the above squashing function is applied,


where is the output vector of a primary capsule.

The prediction vectors are obtained by


where , and the input to the 3DCaps layer is a weighted sum over all prediction vectors from the capsules in the primary capsule layer


where are coupling coefficients that are determined by the routing algorithm in an iterative loop


where the initial logits

are the log prior probabilities that capsule

should be coupled to capsule .

Finally, the output of 3DCaps is determined by the same routing algorithm as in the original Capsule formulation [18](see Algorithm 1).

1:procedure Routing(,)
2:     for all capsule in PrimaryCaps layer and capsule in 3DCaps layer:
3:     for  iterations do
4:         for all capsule in PrimaryCaps layer:
5:    ), eq 7
6:         for all capsule in 3DCaps layer:
7:    )
8:         for all capsule in 3DCaps layer:
9:    ), eq 3
10:         for all capsule in PrimaryCaps and capsule
11:    in 3DCaps layer:      return
Algorithm 1 Routing Algorithm

The loss function used in the 3DCapsule is the sum of the margin loss and the reconstruction loss,


where is equal to 1 if class is correctly predicted, otherwise 0. The parameters were set as follows: , , and . The total margin loss is calculated by summing the individual margin loss for all classes. is the number of points in the set, is the reconstruction of point , and .

4 Experiments

In this section, we provide an ablation study to investigate the effect of the 3DCapsule using different feature extractors and aggregation modules. The investigation includes accuracy measures as well as robustness with respect to corruptions such as outlier points and Gaussian perturbation noise. We also evaluate the effect of ComposeCaps and reconstruction loss within the 3DCapsule.

4.1 Network Architecture and Training

In all our experiments, we used the following parameters during network training. The Adam optimizer [8]

was used with an initial learning rate of 0.001, which was divided by 2 every 20 epochs, and a batch size of 16. Relu and Batch Normalization(BN) 


were also used, except in the 3DCapsule part. The network was implemented using TensorFlow 

[1] and executed on an NVIDIA GTX1080Ti.

We used feature extraction modules from two different architectures (PointNet [14] and EdgeConv [23]). In the case of the PointNet structure, we employed a spatial transformer network of size , followed by two shared mlp layers (64,64), followed by the second spatial transformer network of size , ending with three mlp layers (64,128,1024). For the EdgeConv-based feature extraction module, the first EdgeConv block used three shared mlp layers (64,64,64) and the second EdgeConv consisted of a shared mlp of size (128). Finally a shared mlp of size (1024) was used to concatenate all features together.

In terms of aggregation modules, we also used two different approaches (max pooling and NetVlad [2]). In the case of max pooling, we got a feature vector of size 1024. Using NetVlad to retrieve a point feature vector, the number of cluster centers were set to 128 and the last mlp layer was set to size 128. In order to be compatible with NetVlad, the last mlp layer of the PointNet and EdgeConv feature extractors were replaced with an mlp layer of size 128, creating a feature vector output of size in the case of NetVlad.

Finally, as the classifier, we again used two different approaches (fully connected classifier as the baseline, and our contribution 3DCapsule. In the case of the fully connected classifier, we used three fully connected layers of sizes (512,256,#classes), with a dropout [19] of in the first two layers. When investigating our 3DCapsule, we used a fully connected layer of size 4000 as our ComposeCaps layer to map the feature vector output to the primary capsule input. 500 primary capsules with a dimension of 8 was used. In the 3DCaps layer, each capsule was of dimension 4.

4.2 Datasets

In our experiments, we used two versions of the ModelNet dataset [24], ModelNet10 and ModelNet40. The ModelNet40 contains CAD models from 40 different classes, which were divided into training samples and testing samples. The ModelNet10 was split into training samples and testing samples. We followed the same experimental settings as in [14]. In order to have a fair comparison, the prepared ModelNet 10/40 dataset from [16] is used, where each model is represented by points.

feature extraction aggregation classifier Accuracy
PointNet [14] Maxpooling FC 89.2

PointNet [14]
Maxpooling 3DCapsule 89.9
PointNet [14] NetVlad [2] FC 87.2
PointNet [14] NetVlad [2] 3DCapsule 90.6
EdgeConv [23] Maxpooling FC 92.2
EdgeConv [23] Maxpooling 3DCapsule 92.4
EdgeConv [23] NetVlad [2] FC 91.2

EdgeConv [23]
NetVlad [2] 3DCapsule 92.7

Table 1: Ablation study on ModelNet40. We evaluate the effect of the 3DCapsule in comparison to the fully connected classifier baseline with various combinations of feature extraction and aggregation modules.
method Classiifer Representation Accuracy
PointNet++ [16] FC 90.7
PointNet++ [16] 3DCapsule 91.5
PointNet++ [16] FC 91.9
PointNet++ [16] 3DCapsule 92.4
Table 2: Ablation study on ModelNet40. We evaluate the effect of 3DCapsule on PointNet++[16] with various number of points representation

4.3 Ablation study of the 3DCapsule

In this section, we evaluate the effect of the 3DCapsule with respect to different feature extractors and aggregation modules. The baseline classifier used to compare with, is a fully connected layer (fc) as that is what is employed in comparable works operating on raw point clouds. Furthermore, we use two feature extraction modules based on PointNet [14] and EdgeConv[23], and two aggregation approaches, max pooling and NetVlad[2].

From Table 1, it is clear that regardless of which feature extraction method and/or aggregation method that is utilized, our 3DCapsule improves over using a regular fully connected layer as a classifier. The biggest improvements over using a fully connected classifier is observed when combined with the aggregator NetVlad. This is natural, as compared to max pooling, NetVlad outputs a much richer representation of the data. A fully connected classifier may not be able to make use of this richer information, whereas our 3DCapsule can due to the fact that it takes advantage of the routing algorithm and capsule vector representation.

In Table 2, we demonstrate that our method also benefits PointNet++ [16], which further shows the applicability to commonly used point cloud classification architectures. PointNet++ is capable of taking advantage of a richer point representation with additional, normal, features, .

4.4 Effect of ComposeCaps

In this section, we evaluate the effect of ComposeCaps in our proposed method. We, again, test the effect using the two feature extraction operators PointNet [14] and EdgeConv [23] as well as the two aggregation methods, Maxpooling and NetVlad [2]. The result is shown in Table 3. It can be observed that the ComposeCaps layer does indeed improve the performance in all combinations, demonstrating the utility of ComposeCaps.

Method ComposeCaps Accuracy
PointNet [14]+Maxpooling No 89.4
PointNet [14]+Maxpooling Yes 89.9
PointNet [14]+NetVlad [2] No 89.9
PointNet [14]+NetVlad [2] Yes 90.6
EdgeConv [23]+Maxpooling No 92.3
EdgeConv [23]+Maxpooling Yes 92.4
EdgeConv [23]+NetVLad [2] No 92.4
EdgeConv [23]+NetVlad [2] Yes 92.7
Table 3: The effect of the capsule mapping layer ComposeCaps inside the 3DCapsule.

4.5 Effect of Reconstruction Loss

In this section, we evaluate the effect of reconstruction loss in the 3DCapsule. As shown in Table 4, the reconstruction loss is crucial in all structures used in this paper.

recons loss PointNet+Max PointNet +NetVlad EdgeConv+Max EdgeConv+NetVlad
No 86.7 87.4 90.2 90.4
Yes 89.9 90.6 92.4 92.7
Table 4: The effect of the reconstruction loss function inside the 3DCapsule.

4.6 Robustness to Noise

In this section, we evaluate the robustness of our proposed method, 3DCapsule, with respect to two different types of data corruption: 1) outlier points, and 2) point perturbation. This robustness test was carried out in a similar way as the ablation study in Section 4.3, with the only difference being that the different network combinations were here subjected to varying levels of training and testing noise.

Outlier points in training set: 0 Outlier points in training set: 1 Outlier points in training set: 2 Outlier points in training set: 5 Outlier points in training set: 10 Outlier points in training set: 20 Outlier points in training set: 50 Outlier points in training set: 100
Figure 5: Robustness test with respect to outlier points. The bottom legend applies to all graphs, respective x-axis represents the number of outlier points in the test set, and each graph is an experiment with a different number of outlier points in the training set.

Input points were normalized to the unit sphere in this experiment. When creating the training data, both distorted and undistorted data were added together, enabling the networks to better learn how to deal with distorted data without unnecessarily reducing performance when exposed to undistorted data.

Outlier Test: In the outlier experiment, a varying number of outlier points (0, 1, 2, 5, 10, 20, 50, 100) were replacing the original points, both in the training and the testing samples. The result is shown in Figure 5. It is evident that a 3DCapsule enabled architecture is significantly more robust against outlier points in almost every combination of training and testing noise levels, compared to the baseline (fully connected classifier), irrespectively of which feature extractor and aggregator that is used. Furthermore, it can be seen that NetVlad [2] is generally achieving better performance than Maxpooling, and that gap tends to be greater when combined with 3DCapsule as expected.

Perturbation Test:

In the point perturbation test, Gaussian noise was added to each point independently. The standard deviation of the added noise was in the range

in steps of and . The result is shown in Figure 6. It is evident that, again, a 3DCapsule enabled architecture is significantly more robust, also in the case of Gaussian perturbation noise, compared to the baseline that instead utilises a fully connected classifier. It can also be seen in the graphs that the fully connected classifier is not good at making use of the richer representation that NetVlad [2] provides, whereas 3DCapsule can unlock the information provided by the more powerful NetVlad aggregator.

Perturbation of the training set: std=0.00 Perturbation of the training set: std=0.02 Perturbation of the training set: std=0.04 Perturbation of the training set: std=0.06 Perturbation of the training set: std=0.08 Perturbation of the training set: std=0.10
Figure 6: Robustness test with respect to perturbed points. The bottom legend applies to all graphs, respective x-axis represents the magnitude of the Gaussian noise in terms of standard deviation that was added to the test set, and each graph is an experiment with a different magnitude of Gaussian noise in the training set.

4.7 Absolute Classification Performance

In this paper, we do not explicitly set out to construct a new architecture that will achieve state-of-the-art classification performance, but rather contribute with the design of a new component, 3DCapsule, that can be drop-in inserted into many other architectures. However, for completeness, we here compare our proposed method to state-of-the-art classification architectures operating on raw, unordered, point clouds that are unaffected by any additional noise.

We chose to compare the best combination of feature extractor and aggregator together with our 3DCapsule that we had found in the experiments in the previous sections. That is, EdgeConv plus NetVlad were used as feature extraction and aggregation modules respectively, together with our 3DCapsule. The comparison was made using the ModelNet40/ModelNet10 3D object recognition benchmark, and the results are shown in Table 5.

The lower part of the Table shows those methods that are apples-to-apples comparable with ours. That is, they use the same input data representation that we do (3D points only without any additional information such as, e.g., surface normals), and they also use the exact same number of input points, . It is evident that in an apples-to-apples comparison where we have applied our 3DCapsule, we outperform current state-of-the-art. In fact, it is only outperformed by SO-Net [12] when they are using significantly more points than we do (5000 vs 1024) as well as using surface normals.

Method Representation Input ModelNet40 ModelNet10

pc 90.9 93.9
Kd-Net[9] pc 91.8 93.5
SpiderCNN[25] pc + normal 92.4 -
Local Spectral[22] pc + normal 92.1 -
PointNet++[16] pc + normal 91.9 -

pc + normal 93.4 95.7

pc 89.2 -
PointNet++[16] pc 90.6 -
EdgeConv[23] pc 92.2 -
Local Spectral[22] pc 91.5 -

pc 92.7 94.7
Table 5: Classification results on ModelNet40/ModelNet10. The bottom part represents an apples-to-apples comparison with ours using the same type and size of input, and the top part is the case when the restrictions on the input data are relaxed. Also, pc stand for point cloud.

4.8 Time and Space Complexity

It is clear from Table 6 that the 3DCapsule adds additional complexity to the architectures to which it is applied. However, we believe that the benefit of significant increase in robustness to various kinds of noise (as shown in Figure 5 and Figure 6) outweighs these drawbacks. The performance on classification, without added noise, is also increased to a degree, which demonstrates the benefit of 3DCapsule in comparison to the common fully connected classifier, albeit at the cost of an increase in memory usage and inference time.

Method Size/MB Forward/ms Train/h
PointNet++ [16] 12 163.2 20
SO-Net [12] 11.5 59.6 1.5
PointNet [14] 40 25.3 3-6
PointNet [14]+3DCapsule 71 116 8
EdgeConv [23] 21 94.6 -
EdgeConv [23]+3DCapsule 52 154 12
Table 6: Time and space complexity of point cloud based networks in ModelNet40 classification.

5 Conclusion

In this paper, we introduced the 3DCapsule which is an extension of the Capsule concept that makes it applicable to unordered point sets. It is intended as an alternative, drop-in replacement, in the case of 3D point cloud classification, to the commonly used fully connected classifier. A key insight is that the original Capsule concept, that operated on 2D images, implicitly relied on the fact that the pixels in such images are ordered and has a spatial relationship. Modern architectures for 3D point set classification typically introduce symmetrical functions to achieve invariance to point set permutation, however, this also invalidates the spatial relationship between feature vectors. Therefore, in this work, a ComposeCaps layer was introduced to find a new, meaningful, mapping of the feature vectors to capsules, that the 3DCapsule network could better exploit.

While achieveing improved classification performance in an apples-to-apples comparison against state-of-the-art methods on data that is not affected by additional noise, it is overwhelmingly demonstrated that using the 3DCapsule is superior when subjected to corrupted data such as a varying amount of outliers, or Gaussian noise. This is demonstrated against the typical fully connected classifier, and irrespectively of which feature extractor or aggregator module that it is combined with.