Log In Sign Up

PointCaps: Raw Point Cloud Processing using Capsule Networks with Euclidean Distance Routing

by   Dishanika Denipitiyage, et al.

Raw point cloud processing using capsule networks is widely adopted in classification, reconstruction, and segmentation due to its ability to preserve spatial agreement of the input data. However, most of the existing capsule based network approaches are computationally heavy and fail at representing the entire point cloud as a single capsule. We address these limitations in existing capsule network based approaches by proposing PointCaps, a novel convolutional capsule architecture with parameter sharing. Along with PointCaps, we propose a novel Euclidean distance routing algorithm and a class-independent latent representation. The latent representation captures physically interpretable geometric parameters of the point cloud, with dynamic Euclidean routing, PointCaps well-represents the spatial (point-to-part) relationships of points. PointCaps has a significantly lower number of parameters and requires a significantly lower number of FLOPs while achieving better reconstruction with comparable classification and segmentation accuracy for raw point clouds compared to state-of-the-art capsule networks.


page 1

page 2

page 3

page 4


3D Point-Capsule Networks

In this paper, we propose 3D point-capsule networks, an auto-encoder des...

Quaternion Equivariant Capsule Networks for 3D Point Clouds

We present a 3D capsule architecture for processing of point clouds that...

Affordance detection with Dynamic-Tree Capsule Networks

Affordance detection from visual input is a fundamental step in autonomo...

TimeCaps: Capturing Time Series Data with Capsule Networks

Capsule networks excel in understanding spatial relationships in 2D data...

3DCapsule: Extending the Capsule Architecture to Classify 3D Point Clouds

This paper introduces the 3DCapsule, which is a 3D extension of the rece...

Canonical Capsules: Unsupervised Capsules in Canonical Pose

We propose an unsupervised capsule architecture for 3D point clouds. We ...

The Multi-Lane Capsule Network (MLCN)

We introduce Multi-Lane Capsule Networks (MLCN), which are a separable a...

1 Introduction

Point clouds have been widely adopted in computer vision due to their applications in autonomous driving, augmented reality, robotics, and drones. A large variety of 3D sensors (e.g., LiDARs) used in such applications produce raw point clouds as their default output, requiring no additional processing. Even though raw point clouds are disordered and irregular, they are still a popular choice for 3D processing due to their ability to preserve the geometric information in 3D space without any discretization.

Deep learning based raw point cloud processing has gained wide adaptation in object classification, reconstruction, and segmentation. A prominent early attempt at directly processing raw point clouds is Pointnet Qi et al. (2017a), which learns a spatial representation of a point cloud and aggregates individual features to generate a global representation. One limitation of this work is that it discards the spatial arrangements of the points in local regions while aggregating features through the pooling operation. However, consideration of spatial arrangements is important because similar local regions have distinct spatial arrangements due to permutation invariance. Following the PointNet Qi et al. (2017a) architecture, PointNet++Qi et al. (2017b) proposed a hierarchical network architecture to combine local features. Furthermore, EdgeConv Wang et al. (2019) and PointCNN Li et al. (2018b)

proposed a convolutional local feature aggregator based on the neighbourhood graph. Above methods have mainly focused on improving global feature vector via different local feature extracting methods, predominately using k-NN-like clustering techniques to represent the point-to-part relationship. Learnable point-to-part relationship—manifested in the form of capsules 

Sabour et al. (2017)—is more powerful in this context. Therefore, we extend the capsule networks Sabour et al. (2017) to identify spatial relationship in local regions while considering the feature existence of local regions.

A capsule’s ability to learn the spatial relationship of local regions stems from dynamic routing algorithm which establishes the mapping between the lower level capsules and higher level capsules. In other words, a capsule’s activity vector is able to represent a specific type of object or an object part through this routing agreement. 3D-PointCapsNet Zhao et al. (2019a)

is the first architecture to formulate capsules with raw point clouds, generating the latent representation through fully-connected capsules resembling the multi-layer perceptron architecture. This approach is computationally intensive resulting in longer training and testing time. Furthermore, 3D-PointCapsNet 

Zhao et al. (2019a) fails at representing the entire point cloud as a single capsule causing the latent representation to be not linearly separable and requires a separate SVM for classification. A recent work Wen et al. (2020)

uses such a representation, but still suffers from high complexity due to feature aggregation through clustering. Nevertheless, there are two main problems associated with the capsule networks: 1) Since the logits values in the dynamic routing is bounded, dissimilarity between capsules ranges between

and . As a result, similarity gap of dissimilar capsules and similar capsules is reduced. 2) The original capsule network implementation assumes static pixel locations, However, point clouds are irregular.

In order to address these limitations, we propose PointCaps: a novel capsule based auto-encoder architecture, which has two novel convolutional capsule layers, to capture point-to-part spatial relationships and vice versa. Instead of using a traditional transformation matrix to transform low dimensional features to high dimensional features, our approach adopts the 2D convolution capsule idea into sparse 3D point clouds by creating capsules along the feature axis. Due to parameter sharing in convolutional capsules, there is a significant reduction in the number of parameters in the capsule layer, which reduces the computational complexity while providing better identification of geometric and spatial relationships between the parts. As a solution to the lower similarity gap in dynamic routing, we employ a novel routing algorithm: dynamic Euclidean distance ( based routing (ER) in multiple capsule layers instead of dynamic routing (DR). This increases the resolution of highly dissimilar capsules in between and instead of to . Moreover, we represent the entire point cloud as a single capsule by adopting the approach of Sabour et al. Sabour et al. (2017) and replacing the decoder with a class-independent decoder proposed by Rajasegaran et al. Rajasegaran et al. (2019). PointCaps’s ability to compress single point cloud to a vector of instantiation parameters enables us to explore the robustness of the model to noise while completing clasification and reconstruction tasks simultaniously. To recover lost fine-grained spatial information, we introduce a skip connection between the encoder and the decoder. Our contributions are three-fold:

  • We propose a novel capsule auto-encoder architecture to classify, reconstruct, and segment raw point clouds. Further, we propose a novel convolution capsule layer with dynamic Euclidean routing instead of dynamic routing to capture part-whole relationships.

  • To the best of our knowledge, PointCaps is among the first to adapt a class-independent decoder to reconstruct 3D point clouds.

  • We evaluate classification accuracy, reconstruction error, and segmentation accuracy (in terms of mean intersection over union (IoU)) of PointCaps using standard benchmarks, where our approach surpasses the current state-of-the-art of reconstruction error and provides comparable performance in point cloud classification and segmentation despite having less parameters and requiring less floating point operations per second (FLOPs) compared to previous capsule based architectures.

2 Related Work

Deep learning applications of point clouds include 3D object detection, object classification Qi et al. (2017a, b); Wang et al. (2021); Lee and Kim (2021), reconstruction Chen et al. (2015); Qi et al. (2017a); Cheraghian and Petersson (2019), scene labeling Lai et al. (2014); Qin et al. (2018), segmentation Qi et al. (2017a, b); Maturana and Scherer (2015), point cloud completion Yuan et al. (2018); Wu et al. (2021), layout inference Geiger and Wang (2015), and point cloud registration Wang and Solomon (2019); Zhang et al. (2019); Maiseli et al. (2017). The three main categories of 3D object classification based on the input to the deep learning network are volumetric representation Maturana and Scherer (2015); Wu et al. (2015), view-based Qi et al. (2016) and raw point cloud methods Qi et al. (2017a); Wen et al. (2020). In this paper, we will be focusing on raw 3D point cloud object classification and reconstruction.

Deep networks on point clouds: The capability of processing irregular, unordered point clouds through point-wise convolution and permutation invariant pooling proposed by PointNet Qi et al. (2017a) paved way for various point cloud-specific architectures such as PointNet++ Qi et al. (2017b), spherical convolution Lei et al. (2018), Monte-Carlo convolution Hermosilla et al. (2018), graph convolution Wang et al. (2021); Defferrard et al. (2016) and SO-Net Li et al. (2018a). Unlike PointNet, PontNet++ Qi et al. (2017b) aggregate local features into a global feature vector forming a hierarchical feature learning architecture through farthest point sampling. Thereafter, improved convolotion operations Li et al. (2018b) has been proposed to group local region features. SO-Net Li et al. (2018a) proposed self-organization networks where spatial distribution of point clouds are used in an auto-encoder architecture to enhance the performance. A better upsampling method was introduced in PU-Net Yu et al. (2018), and conversion of 2D grid into 3D surface was proposed by FoldingNet Yang et al. (2018). AtlasNet Groueix et al. (2018) is an extension of FoldingNet Yang et al. (2018) which uses multiple data patches. PPF-FoldNetDeng et al. (2018a), which is based on supervised PPFNet Deng et al. (2018b), uses FoldingNet decoder Yang et al. (2018) to enhance local feature extraction. However, all of the above methods use pooling operation to learn global features from local features based on the feature existence. We focus both existence of features in local regions and their spatial relationship through a capsule-based architecture and employ a new routing algorithm to aggregate the geometric features and spatial relationships in the local region.

Capsule networks: Hinton et al. Hinton et al. (2011)

proposed capsule networks, a novel method to group neurons, which greatly impacted object classification in deep learning. Sabour

et al. Sabour et al. (2017) extended this idea by proposing dynamic routing between capsules. The success of capsule networks in object classification translates well into 3D point cloud classification due to its ability to capture spatial relationships through dynamic routing. Moreover, the instantiating parameters available in these networks are capable of capturing various properties (e.g., size, position, and texture) of a particular entity. In view of this, our work focuses on a novel auto-encoder architecture to classify and segment raw point clouds, and achieves minimum reconstruction error using capsule networks.

Several recent works address the use of capsule networks in point cloud classification, reconstruction, and segmentation. The 3D-PointCapsNet Zhao et al. (2019a) is the first to devise a capsule network for raw point clouds where part segmentation classification is completed in an unsupervised way. However, it fails to model an entire point cloud as a single capsule. Several previous works Zhao et al. (2019b); Cheraghian and Petersson (2019) have proposed supervised capsule architectures for point cloud classification. Cheraghian et al. Cheraghian and Petersson (2019) applies capsule networks as a drop-in replacement for a fully connected classifier. However, these models are trained in a supervised manner, in contrast to our auto-encoder architecture in PointCaps. Point2SpatialCapsule Wen et al. (2020) uses capsule networks to encode fixed spatial locations into capsules. These capsule network architecturesCheraghian and Petersson (2019); Zhao et al. (2019a); Wen et al. (2020) directly use classification capsule layer with fully-connected capsule architecture for feature representation. This and the presence of k-NN clustering Wen et al. (2020) lead to high computational complexity. On the other hand, convolutional capsule layer (PointCapA and PointCapB described in Sec. 3) and the absence of k-NN clustering in PointCaps provides significant reduction in computational complexity.

3 Method

The 3D point capsule Network Zhao et al. (2019a), has proposed an end to end trainable auto encoder architecture for several common point cloud-related tasks. Inspired by the benefits of capsule network, we propose PointCaps, for processing a point cloud by simultaneous classification and reconstruction, and later achieve segmentation. Point cloud processing differs significantly from regular deep networks based vision tasks due to the irregular and unordered nature of point clouds. The high level processing pipeline of PointCaps: 1) reduces the size of the original data using convolutional capsule networks 2) generates the latent vector representations followed by reconstructing the point cloud using deconvolutions.

In the following sections, we first describe the overall PointCaps architecture. Second, we describe the different types of capsule layers we employ. Finally, we elaborate the routing mechanisms with Euclidean distance.

3.1 PointCaps Architecture

The proposed point-cloud classifier reconstructor network comprises an encoder with Euclidean and dynamic routing, and a class-independent decoder. The encoder architecture contains three types of capsule layers to learn spatial and geometric features of irregular, unordered 3D data. we designed the overall architecture as Fig. 1. The input to PointCaps is a 3D point cloud with points. We chose following the work by Zhao et al. Zhao et al. (2019a). Two 1D convolution layers process this input to produce a feature vector of length , one for each point. This layer is followed by a PointCapA capsule layer: (Fig. 1.: path A) which creates the point-to-part relationships. The second path (Fig. 1.: path B) generates point-to-part relationship through sparse subspace. The direct use of 2D capsule layer gives larger parameter space which increases the use of computational resources. Therefore we use PointCapA as a dimension reduction technique in the second path and follow PointCapC and PointCapB capsule layers to retain the essential properties of parts. Then the parts are regenerated using a PointCapA capsule layer. The two paths: path A and path B are concatenated. Using the concatenated output, the DigitCap capsule layer generates the latent representation of a point cloud. Note that PointCapB is an upsampling layer, deployed as an intermediate capsule layer after the PointCapA capsule layer, which generates various properties of a given part that is available in the point cloud. Furthermore, PointCapC layer is a generic convolutional layer, with squashed output.

It is important to note that the original argument of dynamic routing in capsule network Sabour et al. (2017)

was the capturing of intrinsic geometric properties of the object. These capsule representations sometimes may not correspond to human visible part segments in the object. However, we expect to centralize semantically similar regions in the object so that a human can identify. Furthermore, the coupling coefficient determines the agreement between the current output and the prediction using cosine similarity. As the coupling coefficient and logits are bounded, the gap between highly dissimilar capsules lies within

and . We increase this gap by making logits unbounded while keeping the coupling coefficient bounded using Euclidean distance. This increases the dissimilarity range between and . We employ novel PointCapA, which predicts the possible point-to-part representation for each point using a given dynamic Euclidean() distance routing algorithm (see Sec. 3.2).

Input point cloud

1D Conv. BN

1D Conv.

1D Conv.








Batch Norm. (BN)






Path B

Path A

ER (i=1)

DR (i=3)

ER (i=3), Reshape , 16

DR (i =3)


Path C

ER (i=3)



Figure 1: Model Architecture, PointCapA and PointCap B are convolutional capsule layers where PointCapA uses Euclidean routing and in the other capsules dynamic routing is applied.

3.1.1 PointCapA

Let be the input to the PointCapA layer, where and denote the number of input capsules and the input number of atoms (capsule dimension), respectively. Initially, each point with its features is considered as an input to the layer, and is considered as the output capsule from the layer . The output capsules correspond to different local regions in the point cloud. The activity vector parameter interprets different properties of local region such as size, orientation, and texture.

The operation of the PointCapA 1D convolutional capsule is as follows. First is convolved with number of kernels, forming , where

. Then swish activation function 

Ramachandran et al. (2017) is applied as the pre-activation function to , and then reshaped to generate the vote matrix which has the shape of . Using a 1D convolution and choosing the kernel height as one in the transformation matrix has two advantages: 1) it provides a solution to the order invariance problem 2) it allows the network to keep the value of (input number of capsules) unchanged. Then we feed the vote matrix to the routing algorithm as described in Sec. 3.2.

Similar to the approach used in Sabour et al. (2017), the transformation matrix learns the part-to-whole relationships between the lower and higher level capsules by updating the logits based on the similarity between the input capsule and the output capsule.

3.2 Routing Algorithm

Routing is a standard method that is used in capsule networks to identify the relevance between a lower level capsule and an upper level capsule Sabour et al. (2017). In PointCaps, we employ routing to generate point to parts relationships. Unlike Sabour et al. Sabour et al. (2017), where the agreement between the current output and prediction is the dot product between two quantities and the logits are updated based on the measurements for the next iteration, our novel Euclidean distance routing employs the Euclidean distance to find the relevance between capsule layers, and experimentally proving that Euclidean distance provides better performance compared to cosine similarity.

The operation of the routing algorithm is as follows. The routing algorithm maps a block of capsules in the child capsule to the parent capsule. Let the vote tensor (votes) be denoted by

. Following the Sabour et al. (2017), we initialize logits as where . Then the corresponding coupling coefficients are generated by applying the Softmax function, defined as


on logits , where and This results the iterative dynamic routing process. Here the logits are normalized over all the predicted capsules in layer because each single capsule in layer predicts the outputs for all the capsules in layer . Then these predictions are weighted by as


and is applied to the squash function, given by


The squash function is a non linear function that ensure the higher probability of the existence of an entity to converge to a length nearly

and lower probabilities t0 get a length of almost .

Sabour et al. Sabour et al. (2017) proposed cosine similarity as an agreement between the current output and the prediction . The logits are updated based on this similarity measure. We used Euclidean distance to calculate error between two quantities and updated the logits values using,

1:procedure Routing
2:Require: , and
3:      Let
4:     for  iterations do
5:         for all ,
6:         for all
7:         for all
8:         for all
9:     end for
10:     return
11:end procedure
Algorithm 1 Dynamic Euclidean Routing Algorithm

3.2.1 PointCapB

PointCapB is a 2D convolutional capsule layer. We use PointCapB in PointCaps architecture to identify the properties of the entities such as length, elongation and texture. PointCapB operates as follows. Let the input tensor to the PointCapB be , where is the number of entities, is the number of capsules in the layer and is the capsule dimension. First, the input tensor is reshaped into , where is the standard format of the input for the 2D convolution . Then the reshaped tensor is convolved with () number of 2D kernels, where the size of is . Note that the height and width of the input feature map for the 2D convolution are represented by and (), respectively. Maintaining the kernel height as

, width and stride as

enable PointCapB to get a vote for single capsule from layer . This process generates intermediate votes , where the width of the output can be calculated as


The intermediate votes are then reshaped into votes . The vote tensor has the shape of . We apply the pre-activation swish function Ramachandran et al. (2017). Then the votes are fed to the routing algorithm as proposed by Rajasegaran et al. Rajasegaran et al. (2019).

During routing, softmax function is applied on logits for each (initialized as ) to generate coupling coefficients . Here, we normalize the logits among all the predicted capsules from capsule tensor in layer . Each generated prediction in is weighted by a factor , which results in a single prediction . Then the squash function is applied to the single prediction . The level of agreement between and is measured using cosine similarity to update the logits in the next iteration of the routing.

3.2.2 PointCapC

Now we describe the architecture of the PointCapC. Let be the input to the PointCapC and be the output, where is the number of entities or points, and and have usual meaning. First is reshaped into a matrix of shape , and 1D convolution is applied with kernels having the shape . Then the output is reshaped into a tensor of shape , followed by the squash function to produce the output.

3.3 Class Independent Decoder with Skip Connection

The decoder network is used to reconstruct the input point cloud using the instantiation vector extracted from the DigitCap in the encoder network. In the original capsules Sabour et al. (2017), DigitCap is masked to extract activity vectors and then used with three fully-connected layers to reconstruct the input image. During the training, they mask the digit capsule output with the true label, and the activity vector of maximum length is used for the inference stage. This vectorization results in a matrix with zeros except for the row corresponding to the true class or predicted class. Here is the number of classes and is the classification capsule dimension. Hence, the network gets class information which indirectly makes the decoder class dependent. DeepCaps Rajasegaran et al. (2019) has claimed that class independent decoder based capsule networks are better for regularization.

To address this issue, in this paper, we use the class independent decoder proposed by Rajasegaran et alRajasegaran et al. (2019) which provides better regularization in terms of capsule encoding. The proposed decoder network uses a class independent network by passing only the activity vector. Here, the masked activity vector is , where is equivalent to true prediction in the training stage whereas for testing. The decoder learns different distributions of different physical parameters irrespective of the class which makes the decoder class-independent. The network consists of a single fully connected network followed by five deconvolution layers. Moreover, the convolution layer from the encoder is skip-connected to the intermediate layer in the decoder as shown in Fig. 1. Further, Chamfer distance loss is used as the reconstruction loss and the input point cloud is recreated at the final layer.

Loss function: The total auto-encoder loss is defined as the summation of classification and reconstruction losses, i.e.,


where , For a each class we use margin loss as indicated in the Eq. 3.3, where if the class is present and otherwise zero and and are the lower bound and upper bound of the correct and incorrect class. We use to reduce the effect of absent classes. The sum of the losses of all digit capsules is defined as total classification loss.


Similar to the other auto encoders, we use Chamfer Distance loss to measure the similarity between point clouds where and are two different point clouds with the same number of points.


4 Results and Discussion

Figure 2: Point clouds generated by our decoder. The first row of point clouds is generated using ShapeNet core13 Chang et al. (2015) and the second raw of the point clouds is generated using ShapeNetPart Yi et al. (2016) dataset. This shows the PointCaps ability to capture geometric properties of the point cloud. For example, we can observe the compression along the -axis when the dimension of the instantiation vector of ShapeNetPart lamp is changed between .


We implemented our system using Keras and Tensorflow. For training, we used RAdam optimizer 

Liu et al. (2019a) with an initial learning rate of and decayed it by over several steps. We conducted an extensive evaluation of our network for point cloud classification and reconstruction using four datasets: ModelNet10 Wu et al. (2015), ModelNet40 Wu et al. (2015), ShapeNetPart Yi et al. (2016), and ShapeNet Core13 Chang et al. (2015). For ShapeNetPart and ShapeNet Core13, we used points, and for ModelNet10 and ModelNet40 we used normal vectors with points. The model was trained using two GPUs: Nvidia P100 GPU and Nvidia V100 GPU.

4.1 Quantitative Evaluations

3D Reconstruction: We compare our method with both point-cloud based Deprelle et al. (2019); Zhao et al. (2019a); Sun et al. (2020) and mesh-based Chen et al. (2020); Mescheder et al. (2019); Chen and Zhang (2019) reconstruction methods using the Chamfer Distance matrix. We report quantitative results in Table 1. Our method outperforms the state-of-the-art methods on the ShapeNet Core13 Chang et al. (2015) dataset. PointCaps ( points) surpassed the results of AtlasNet Groueix et al. (2018) by , 3D-PointCapsNet Zhao et al. (2019a) by , and Canonical Capsules Sun et al. (2020) by . It is worthwhile to note that PointCaps is better than all the mesh-based reconstruction methods and achieves improvement compared to BSPNet Chen et al. (2020).

Method Input CD
OccNet Mescheder et al. (2019) mesh (1511) 2.54
IM-NET Chen and Zhang (2019) mesh (1204) 2.36
BSPNet Chen et al. (2020) mesh (1073) 1.43
AtlasNetV2 Deprelle et al. (2019) pc (2048) 1.22
3D-PointCapsNet Zhao et al. (2019a) pc (2048) 1.49
Canonical Cap Sun et al. (2020) pc (2048) 0.97
PointCaps pc (1024) 0.56
PointCaps pc (2048) 0.25
Table 1: Reconstruction quality of ShapeNet Core13 Chang et al. (2015) dataset. Reconstruction quality is reported by Chamfer Distance (CD) (multiplied by ). pc (#) denotes point cloud with # points and mesh (#) denotes mesh with # vertices

Point Cloud Classification: We test our model with ModelNet10 Wu et al. (2015) and ModelNet40 Wu et al. (2015) datasets and compare the model with existing network architectures. As shown in Table 2 PointCaps achieves second best performance with significantly lower computational cost (See Table 5) compared to capsule based methods. Moreover, PointCaps is slightly lower than the baseline method PointNet++ Qi et al. (2017b) by .

Method Input ModelNet10 ModelNet40
PointNet Qi et al. (2017a) - 89.2%
PointNet++ Qi et al. (2017b) - 91.9%
DGCNN Wang et al. (2019) - 92.2%
SAF-Net Lee and Kim (2021) - 93.4%
Kd-Net Klokov and Lempitsky (2017) 94.0% 91.8%
SO-Net Li et al. (2018a) 94.1% 90.9%
PointCNN Li et al. (2018b) - 91.7%
RS-CNN Liu et al. (2019b) - 93.6%
Grid-CNN Xu et al. (2020) 97.5 % 93.1%
CurveNet Xiang et al. (2021) 96.3% 94.2%
3DCapsule Cheraghian and Petersson (2019) 94.7% 91.5%
P2SCapsule Wen et al. (2020) 95.9% 93.7%
PointCaps 94.7% 91.7%
Table 2: ModelNet40 and modelNet10 classification accuracy comparison
Dataset Input PointCaps All DR All ER W/o skip connection
Accuracy CD Accuracy CD Accuracy CD Accuracy CD
ShapeNet core13 Chang et al. (2015) 94.12% 0.25 94.12% 0.34 93.84% 0.315 94.08% 17.42
ShapeNet part Yi et al. (2016) 98.33% 0.117 98.43% 0.337 98.29% 0.294 98.15% 4.07
ModelNet10 Wu et al. (2015) 95.13% 1.71 94.59% 1.19 94.69% 2.11 93.90% 9.67
ModelNet40 Wu et al. (2015) 87.6% 0.891 86.52% 1.023 87.5% 0.928 86.6% 18.13
Table 3: Comparison of classification accuracy and Chamfer Distance (CD) error () of PointCaps with different routing algorithms for three datasets where PointCaps provides better reconstruction and comparable accuracy improvement. Here, PointCaps denotes Euclidean routing at PointCapA while dynamic routing (DR) is used in other Caps (see Fig. 1.), All DR uses DR for all the capsules, and All ER uses ER for all the capsules in the model. The performance of PointCaps model without using skip connection between encoder and decoder is denoted as without skip connection

PointCap (Ours)

3D-pointCapsNet Zhao et al. (2019a)
Figure 3: Part representation with dynamic routing in 3D-pointCapsNet Zhao et al. (2019a) (Row 1) and with Euclidean routing in PointCapA (Row 2). 3D-pointCapsNet interprets 32 parts, each having 64 points, whereas PointCaps has 64 parts with different number of points. Note that PointCaps captures spatial relationships to form more human annotated local regions compared to 3D-pointCapsNet Zhao et al. (2019a)


Segmentation: In this section we evaluate the part segmentation of PointCaps. We use the ShapeNet Part dataset to train our model. Following the approach of 3D-PointCapsNet, we train two models: 1) using of dataset (hereafter referred to as training set) and 2) using of dataset (hereafter referred to as training set) as the training set. We used the complete testing dataset set to test our model. We use the same part segmentation evaluation method that is used in SO-Net Li et al. (2018a) and 3D-PointCapsNet Zhao et al. (2019a) to evaluate PointCaps: accuracy and IoU.

Method 1% data 5% Data
Acc IoU Acc IoU
SO-Net Li et al. (2018a) 0.78 0.64 0.84 0.69
3D-PointsCapsNet Zhao et al. (2019a) 0.85 0.67 0.86 0.70
PointCaps 0.85 0.69 0.87 0.72
Table 4: Part segmentation on ShapeNet-Part by learning only on the of the training data.

As seen in the Table 4, PointCaps surpasses 3D-PointCapsNet Zhao et al. (2019a) and SO-Net Li et al. (2018a) with respect to accuracy and IoU. PointCaps achieves an accuracy of and for the training set and training set, where as the respective values for SO-Net Li et al. (2018a) and 3D-PointCapsNet Zhao et al. (2019a) are and for training set and and for the training set. We also observe that PointCaps achieves better IoU compared to SO-Net Li et al. (2018a) and 3D-PointCapsNet Zhao et al. (2019a). For the training set, PointCaps achieves an IoU that is higher than SO-Net Li et al. (2018a) and higher than 3D-PointCapsNet Zhao et al. (2019a). For the percent training set, Pointcaps surpasses SO-Net Li et al. (2018a) and 3D-PointCapsNet Zhao et al. (2019a) by and respectively. These results prove that PointCaps achieves better segmentation quality than 3D-PointCapsNet Zhao et al. (2019a) and SO-Net Li et al. (2018a).

Computational Complexity: In this section, we compare the number of model parameters and FLOPs of PointCaps for ModelNet40 classification to recent capsule domain state-of-the-art models. Even though Point2SpatialCapsule outperforms in terms of accuracy, PointCaps achieves second best performance whilst significantly reducing the number of FLOPs, i.e., by . Moreover, PointCaps has Million parameters that is lower than Point2SpatialCapsule Wen et al. (2020) and lower than 3D-PointCapsNet Zhao et al. (2019a). Table 5 summarizes the results. It is worthwhile to note that, compared to the backbone structure PointNet++, PointCaps has reduction in number of FLOPs. Overall, PointCaps achieves performance comparable to the state-of-the-art models, with significantly lower computational complexity.

Method Params FLOPs
PointNet Qi et al. (2017a) 3.48 M 957 M
PointNet++ Qi et al. (2017b) 1.99 M 3136 M
3D-PointCapsNet Zhao et al. (2019a) 69.38 M 2231 M
P2SCapsule Wen et al. (2020) 22.95 M 2251 M
PointCaps 3.52 M 615 M
Table 5: The number of model parameters for the ModelNet40 dataset

Robustness to Noise: To evaluate the robustness of our architecture to noise, we train a noise-free version of ModelNet10 Wu et al. (2015)

dataset using two augmentation techniques; 1) point perturbation and 2) adding outliers, and evaluate the reconstruction loss and the accuracy matrix. In the perturbation test, Gaussian noise

is added to the points where . As shown in Fig. 4.(1), even though the network shows a considerable accuracy drop when , the network still achieves a minimum of accuracy. Our outlier test replaces various numbers of points in both training and testing sets. Fig. 4. (2,3) depicts this behaviour. In Fig. 4. (2, 3) the axis denotes the number of outlier points in the test set. The three colours represent different number of outliers in the training set. As shown in Fig. 4. (2, 3), PointCaps delivers more than accuracy up to outliers in the test set. We also observe that the accuracy increases when we add outliers during the training phase. Hence, we conclude that the PointCaps is significantly more robust to Gaussian noise and to anomalies and provides good reconstruction.

Figure 4:

Noise analysis on ModelNet10 dataset. (1) The network is trained without any perturbation and tested with Gaussian noise, with variance in the range

. (2-3) The network is subjected to different number of anomaly points (in the X axis we increase the number of outlier points in the test set) and the performance (accuracy and reconstruction) is analyzed on ModelNet10 dataset. The network is trained with various number of outliers with Gaussian noise . (4) An example of 100 Points replaced with Gaussian noise .

Data Generation by Perturbation: We analyze the ability of PointCaps to generate data by perturbing the instantiation parameters. To experiment on that, we add random noise to only one non-zero instantiation parameter at a time. As seen in Fig. 2, we can observe that the instantiation parameter creates specific changes in the reconstructed point cloud. Furthermore, we observe that the new data samples are not distorted. This proves that latent representation of the PointCaps is capable of capturing interpretable geometric properties and the PointCaps augmenting data with less distortion. We achieve low distortion in data augmentation by applying an upper bound of noise to the instantiation parameter where the maximum variance of noise was manually inspected.

Points to Part Capsule: Here we analyse the capability of PointCapA (Path A in Fig. 1) at representing point-to-part relationship with dynamic Euclidean routing. We compare the ability to represent point to part relationships of PointCaps (Euclidean routing based) with 3D-PointCapsNet (dynamic routing based). In 3D-PointCapsNet, the latent representation of the capsule is generated through cosine similarity. However, the PointCapA agreement is simply Euclidean distance. Fig. 3 illustrates the local part representation of capsules. As indicated in the Sec. 3.2 of the paper, each parent capsule has a logit which increases for the possible parent during routing. This represents the contribution of the lower level capsules to the higher level capsule. We use this logits to identify the relevant part labels for each point. As shown in Fig. 3, PointCapA is capable of specializing on the local regions compared to 3D-PointCapsNet (dynamic routing based).

4.2 Ablation Study

We first evaluate the impact of our novel routing algorithm, Euclidean distance routing (ER) for the accuracy and CD error. To evaluate this, we compare the accuracy and CD error of three implementations; 1) PointCaps: where PointCapA uses ER while PointCapB and Digitcap use DR, 2) All DR: where all capsule layers use DR and 3) All ER: where all capsule layers use ER. As shown in Table 3, the accuracy of PointCaps is slightly above or on-par compared to two other routing techniques. Except for ModelNet10 data set, PointCaps achieves the best CD. These observations confirm that the use of ER for PointCaps achieves better accuracy and CD. Moreover, we observe that PointCaps provides faster convergence for all the datasets.

Secondly, we evaluate the impact of skip connection on accuracy and CD error. We compare two implementations; 1) PointCaps: where a skip connection is used between the encoder and decoder and 2) W/o-skip-connection: which does not contain a skip connection. In Table 3, we show that PointCaps concurrently achieves better CD error ( improvement) for all the datasets. This observation proves our previous intuition for using a skip connection; the use of a skip connection results in lower reconstruction error.

5 Conclusion

In this work, we presented a novel capsule network based architecture for raw point cloud reconstruction, classification, and segmentation. Our approach of using 1D convolutional capsule architecture helps to significantly reduce computational complexity while retaining the global context. Our PointCapsA layers is capable of representing human-interpretable point-to-part relationships. We also introduced a novel routing mechanism, dynamic Euclidean distance routing (as opposed to dynamic routing), and class-independent latent representation. These improved reconstruction, classification, and segmentation accuracy of raw point clouds. Further, our proposed architecture is capable of augmenting data by perturbing instantiation parameters with no distortion.


  • A. X. Chang, T. Funkhouser, L. Guibas, P. Hanrahan, Q. Huang, Z. Li, S. Savarese, M. Savva, S. Song, H. Su, et al. (2015) Shapenet: an information-rich 3D model repository. arXiv preprint arXiv:1512.03012. Cited by: Figure 2, §4.1, Table 1, Table 3, §4.
  • X. Chen, K. Kundu, Y. Zhu, A. G. Berneshawi, H. Ma, S. Fidler, and R. Urtasun (2015) 3D object proposals for accurate object class detection. In NIPS, pp. 424–432. Cited by: §2.
  • Z. Chen, A. Tagliasacchi, and H. Zhang (2020) BSP-net: generating compact meshes via binary space partitioning.

    Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    Cited by: §4.1, Table 1.
  • Z. Chen and H. Zhang (2019) Learning implicit fields for generative shape modeling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5939–5948. Cited by: §4.1, Table 1.
  • A. Cheraghian and L. Petersson (2019) 3Dcapsule: extending the capsule architecture to classify 3D point clouds. In WACV, pp. 1194–1202. Cited by: §2, §2, Table 2.
  • M. Defferrard, X. Bresson, and P. Vandergheynst (2016) Convolutional neural networks on graphs with fast localized spectral filtering. 2016. No. Nips. Cited by: §2.
  • H. Deng, T. Birdal, and S. Ilic (2018a)

    Ppf-foldnet: unsupervised learning of rotation invariant 3D local descriptors

    In ECCV, pp. 602–618. Cited by: §2.
  • H. Deng, T. Birdal, and S. Ilic (2018b) Ppfnet: global context aware local features for robust 3D point matching. In CVPR, pp. 195–205. Cited by: §2.
  • T. Deprelle, T. Groueix, M. Fisher, V. Kim, B. Russell, and M. Aubry (2019) Learning elementary structures for 3d shape generation and matching. In Advances in Neural Information Processing Systems, pp. 7433–7443. Cited by: §4.1, Table 1.
  • A. Geiger and C. Wang (2015) Joint 3D object and layout inference from a single RGB-D image. In German Conference on Pattern Recognition, pp. 183–195. Cited by: §2.
  • T. Groueix, M. Fisher, V. G. Kim, B. C. Russell, and M. Aubry (2018) A papier-mâché approach to learning 3D surface generation. In CVPR, pp. 216–224. Cited by: §2, §4.1.
  • P. Hermosilla, T. Ritschel, P. Vázquez, À. Vinacua, and T. Ropinski (2018) Monte Carlo convolution for learning on non-uniformly sampled point clouds. ACM Transactions on Graphics (TOG) 37 (6), pp. 1–12. Cited by: §2.
  • G. E. Hinton, A. Krizhevsky, and S. D. Wang (2011) Transforming auto-encoders. In

    International conference on artificial neural networks

    pp. 44–51. Cited by: §2.
  • R. Klokov and V. Lempitsky (2017) Escape from cells: deep kd-networks for the recognition of 3d point cloud models. In ICCV, pp. 863–872. Cited by: Table 2.
  • K. Lai, L. Bo, and D. Fox (2014) Unsupervised feature learning for 3D scene labeling. In ICRA, pp. 3050–3057. Cited by: §2.
  • S. Lee and C. Kim (2021) SAF-Nets: shape-adaptive filter networks for 3D point cloud processing. Journal of Visual Communication and Image Representation 79, pp. 103246. Cited by: §2, Table 2.
  • H. Lei, N. Akhtar, and A. Mian (2018) Spherical convolutional neural network for 3D point clouds. arXiv preprint arXiv:1805.07872. Cited by: §2.
  • J. Li, B. M. Chen, and G. Hee Lee (2018a) SO-Net: self-organizing network for point cloud analysis. In CVPR, pp. 9397–9406. Cited by: §2, §4.1, §4.1, Table 2, Table 4.
  • Y. Li, R. Bu, M. Sun, W. Wu, X. Di, and B. Chen (2018b) Pointcnn: convolution on x-transformed points. In NIPS, pp. 820–830. Cited by: §1, §2, Table 2.
  • L. Liu, H. Jiang, P. He, W. Chen, X. Liu, J. Gao, and J. Han (2019a) On the variance of the adaptive learning rate and beyond. arXiv preprint arXiv:1908.03265. Cited by: §4.
  • Y. Liu, B. Fan, S. Xiang, and C. Pan (2019b) Relation-shape convolutional neural network for point cloud analysis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8895–8904. Cited by: Table 2.
  • B. Maiseli, Y. Gu, and H. Gao (2017) Recent developments and trends in point set registration methods. Journal of Visual Communication and Image Representation 46, pp. 95–106. Cited by: §2.
  • D. Maturana and S. Scherer (2015) Voxnet: a 3D convolutional neural network for real-time object recognition. In IROS, pp. 922–928. Cited by: §2.
  • L. Mescheder, M. Oechsle, M. Niemeyer, S. Nowozin, and A. Geiger (2019) Occupancy networks: learning 3d reconstruction in function space. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4460–4470. Cited by: §4.1, Table 1.
  • C. R. Qi, H. Su, K. Mo, and L. J. Guibas (2017a) Pointnet: deep learning on point sets for 3D classification and segmentation. In CVPR, pp. 652–660. Cited by: §1, §2, §2, Table 2, Table 5.
  • C. R. Qi, H. Su, M. Nießner, A. Dai, M. Yan, and L. J. Guibas (2016) Volumetric and multi-view CNNs for object classification on 3D data. In CVPR, pp. 5648–5656. Cited by: §2.
  • C. R. Qi, L. Yi, H. Su, and L. J. Guibas (2017b) Pointnet++: deep hierarchical feature learning on point sets in a metric space. In NIPS, pp. 5099–5108. Cited by: §1, §2, §2, §4.1, Table 2, Table 5.
  • N. Qin, X. Hu, and H. Dai (2018)

    Deep fusion of multi-view and multimodal representation of als point cloud for 3d terrain scene recognition

    ISPRS journal of photogrammetry and remote sensing 143, pp. 205–212. Cited by: §2.
  • J. Rajasegaran, V. Jayasundara, S. Jayasekara, H. Jayasekara, S. Seneviratne, and R. Rodrigo (2019) Deepcaps: going deeper with capsule networks. In CVPR, pp. 10725–10733. Cited by: §1, §3.2.1, §3.3, §3.3.
  • P. Ramachandran, B. Zoph, and Q. V. Le (2017) Searching for activation functions. arXiv preprint arXiv:1710.05941. Cited by: §3.1.1, §3.2.1.
  • S. Sabour, N. Frosst, and G. E. Hinton (2017) Dynamic routing between capsules. In NIPS, pp. 3856–3866. Cited by: §1, §1, §2, §3.1.1, §3.1, §3.2, §3.2, §3.2, §3.3.
  • W. Sun, A. Tagliasacchi, B. Deng, S. Sabour, S. Yazdani, G. Hinton, and K. M. Yi (2020) Canonical capsules: unsupervised capsules in canonical pose. arXiv preprint arXiv:2012.04718. Cited by: §4.1, Table 1.
  • W. Wang, Y. You, W. Liu, and C. Lu (2021) Point cloud classification with deep normalized reeb graph convolution. Image and Vision Computing 106, pp. 104092. Cited by: §2, §2.
  • Y. Wang and J. M. Solomon (2019)

    Prnet: self-supervised learning for partial-to-partial registration

    arXiv preprint arXiv:1910.12240. Cited by: §2.
  • Y. Wang, Y. Sun, Z. Liu, S. E. Sarma, M. M. Bronstein, and J. M. Solomon (2019) Dynamic graph CNN for learning on point clouds. Acm Transactions On Graphics 38 (5), pp. 1–12. Cited by: §1, Table 2.
  • X. Wen, Z. Han, X. Liu, and Y. Liu (2020) Point2SpatialCapsule: aggregating features and spatial relationships of local regions on point clouds using spatial-aware capsules. IEEE Transactions on Image Processing 29, pp. 8855–8869. Cited by: §1, §2, §2, §4.1, Table 2, Table 5.
  • H. Wu, Y. Miao, and R. Fu (2021) Point cloud completion using multiscale feature fusion and cross-regional attention. Image and Vision Computing 111, pp. 104193. Cited by: §2.
  • Z. Wu, S. Song, A. Khosla, F. Yu, L. Zhang, X. Tang, and J. Xiao (2015) 3D shapenets: a deep representation for volumetric shapes. In CVPR, pp. 1912–1920. Cited by: §2, §4.1, §4.1, Table 3, §4.
  • T. Xiang, C. Zhang, Y. Song, J. Yu, and W. Cai (2021) Walk in the cloud: learning curves for point clouds shape analysis. arXiv preprint arXiv:2105.01288. Cited by: Table 2.
  • Q. Xu, X. Sun, C. Wu, P. Wang, and U. Neumann (2020) Grid-gcn for fast and scalable point cloud learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5661–5670. Cited by: Table 2.
  • Y. Yang, C. Feng, Y. Shen, and D. Tian (2018) Foldingnet: point cloud auto-encoder via deep grid deformation. In CVPR, pp. 206–215. Cited by: §2.
  • L. Yi, V. G. Kim, D. Ceylan, I. Shen, M. Yan, H. Su, C. Lu, Q. Huang, A. Sheffer, and L. Guibas (2016) A scalable active framework for region annotation in 3D shape collections. ACM Transactions on Graphics (ToG) 35 (6), pp. 1–12. Cited by: Figure 2, Table 3, §4.
  • L. Yu, X. Li, C. Fu, D. Cohen-Or, and P. Heng (2018) Pu-net: point cloud upsampling network. In CVPR, pp. 2790–2799. Cited by: §2.
  • W. Yuan, T. Khot, D. Held, C. Mertz, and M. Hebert (2018) Pcn: point completion network. In 2018 International Conference on 3D Vision (3DV), pp. 728–737. Cited by: §2.
  • S. Zhang, H. Wang, J. Gao, and C. Xing (2019) Frequency domain point cloud registration based on the Fourier transform. Journal of Visual Communication and Image Representation 61, pp. 170–177. Cited by: §2.
  • Y. Zhao, T. Birdal, H. Deng, and F. Tombari (2019a) 3D point capsule networks. In CVPR, pp. 1009–1018. Cited by: §1, §2, §3.1, §3, Figure 3, Figure 3, §4.1, §4.1, §4.1, §4.1, Table 1, Table 4, Table 5.
  • Y. Zhao, T. Birdal, J. E. Lenssen, E. Menegatti, L. Guibas, and F. Tombari (2019b) Quaternion equivariant capsule networks for 3D point clouds. arXiv preprint arXiv:1912.12098. Cited by: §2.