3D Object Recognition with Ensemble Learning --- A Study of Point Cloud-Based Deep Learning Models

by   Daniel Koguciuk, et al.

In this study, we present an analysis of model-based ensemble learning for 3D point-cloud object classification and detection. An ensemble of multiple model instances is known to outperform a single model instance, but there is little study of the topic of ensemble learning for 3D point clouds. First, an ensemble of multiple model instances trained on the same part of the ModelNet40 dataset was tested for seven deep learning, point cloud-based classification algorithms: PointNet, PointNet++, SO-Net, KCNet, DeepSets, DGCNN, and PointCNN. Second, the ensemble of different architectures was tested. Results of our experiments show that the tested ensemble learning methods improve over state-of-the-art on the ModelNet40 dataset, from 92.65% to 93.64% for the ensemble of single architecture instances, 94.03% for two different architectures, and 94.15% for five different architectures. We show that the ensemble of two models with different architectures can be as effective as the ensemble of 10 models with the same architecture. Third, a study on classic bagging i.e. with different subsets used for training multiple model instances) was tested and sources of ensemble accuracy growth were investigated for best-performing architecture, i.e. SO-Net. We also investigate the ensemble learning of Frustum PointNet approach in the task of 3D object detection, increasing the average precision of 3D box detection on the KITTI dataset from 63.1% to 66.5% using only three model instances. We measure the inference time of all 3D classification architectures on a Nvidia Jetson TX2, a common embedded computer for mobile robots, to allude to the use of these models in real-life applications.



page 7

page 8


M3DeTR: Multi-representation, Multi-scale, Mutual-relation 3D Object Detection with Transformers

We present a novel architecture for 3D object detection, M3DeTR, which c...

Predictive Ensemble Learning with Application to Scene Text Detection

Deep learning based approaches have achieved significant progresses in d...

Deep learning for Plankton and Coral Classification

Oceans are the essential lifeblood of the Earth: they provide over 70 ox...

Multi-model Ensemble Learning Method for Human Expression Recognition

Analysis of human affect plays a vital role in human-computer interactio...

Edge and Corner Detection in Unorganized Point Clouds for Robotic Pick and Place Applications

In this paper, we propose a novel edge and corner detection algorithm fo...

Topologically Persistent Features-based Object Recognition in Cluttered Indoor Environments

Recognition of occluded objects in unseen indoor environments is a chall...

Group Ensemble: Learning an Ensemble of ConvNets in a single ConvNet

Ensemble learning is a general technique to improve accuracy in machine ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Over the last few years with the rapid development of sensor technology, processing of three–dimensional data (3D) has become an important topic of research. High quality, long range laser scanners are widely used in autonomous cars [1], and the availability of cheap RGB-D sensors has resulted in significant progress in 3D mobile robots perception [2]. Accurate object detection, segmentation, and classification from 3D point-clouds are challenging problems, especially so in real-world settings, and crucial for performing robotic tasks.

A lot of the handcrafted approaches to 3D point cloud analysis have been developed previously [3, 4, 5]; however, in recent years, deep learning – based approaches have increased in popularity [6, 7, 8, 9]. Results of deep learning models are strongly correlated to the available amount of data and to the quality of used regularization techniques [10].

In point cloud classification task we assume that the object is already segmented, which means that all the points belong to that single-class object. There are 3 main approaches to point cloud classification:

  • 3D ConvNets — point cloud is converted to voxel grid with a given resolution. This approach is not memory efficient in case of large volumes. The sparsity of 3D data leads to inefficient and redundant computation. However, some octree- or kd-tree–based approaches reduce these disadvantages and provide encouraging results [6, 11].

  • Rendering a set of 2D views of the 3D object — the problem is transformed into a set of 2D vision problems. View-pooling layer [12] may be used to aggregate features from different views. This technique leverages the performance of 3D ConvNets, but the loss of information during rendering makes this approach impractical in point-level segmentation task [7, 13].

  • Direct point cloud processing — architectures that directly process point-clouds in an order-invariant manner, first presented by the PointNet

    architecture. It can be adapted to the analysis of different kinds of problems ranging from an estimation of population statistics


    , anomaly detection in piezometer data of embankment dams

    [15], to cosmology [16].

In this paper, we focus on the direct point cloud processing, because such architectures can perform well not only in classification, but also in segmentation and detection tasks. Seven architectures are used in our experiments: PointNet [8], PointNet++ [17], SO-Net [18], KCNet [19], DeepSets [20], DGCNN [21], and PointCNN [22]. We chose these because of their prominence and the availability of author’s implementations that are open to the research community.

For object classification, two types of datasets can be considered. The first type is based on 3D CAD models: PrincetonSB [23], ModelNet [11], ShapeNet [24], and many others. The second type are datasets of 3D objects/scenes acquired from the real world with depth sensors [1, 25, 26, 1]. In this work, we focus on ModelNet40 [11], because it is one of the most popular benchmarks for object classification. It contains 40 classes of objects’ CAD models. For real-world applications, the KITTI dataset [1] is the most prominent and widely used benchmark for 3D perception of autonomous vehicles and thus is our focus in this study.

Ensemble learning [27] increases performance of the prediction with the help of multiple models. Several methods are reported in the literature: bagging, boosting, stacking, a bucket of models, Bayesian methods, and many others. In this article, we focus on bagging, also known as bootstrap aggregating. We test three voting methods: direct output averaging, soft voting, and hard voting. We compare the ensemble of model instances trained on the same training set and its different subsets and evaluate their performance.

Knowledge distillation [28] is a training procedure which learns a so-called student model (usually fast mobile network) to mimic a teacher (usually ensemble of complex models) on a large unlabelled dataset rather than ground truth labels. An ensemble is often useful as a teacher model as its execution time is not crucial [29]. Experiments with knowledge distillation are beyond the scope of this work, which is focused on leveraging the ensemble results.

Our experiments show that an ensemble of neural networks trained on the whole training set is better than bagging using random parts of the training set. An ensemble of

different model with different architectures can even further improve the prediction performance.

We also count the number of trainable parameters and inference execution times on NVIDIA Jetson TX2 platform for each approach. According to previous studies [30, 31], using the Jetson platform as a high-level driver is a reasonable choice for energy-efficient mobile robotic applications.

2 Related Work

To the best of our knowledge, there are no studies reporting on the strict influence of different ensemble methods and the number of aggregated models to the prediction accuracy for direct point cloud classification architectures. Su et al. [32] studied a model combining different types of representations, but since there are fast advances in this field, there are more and more models in the point cloud classification zoo. A previous article [33] has reported significant performance gain while using an ensemble of 10 instances of one voxel-based, deep learning architecture, introduced in that article.

This section starts with a brief description of each tested architecture. We compare the reported results with test accuracy reproduced in our experiments. The whole setup with exact versions of all libraries and code version used has been shared online111https://github.com/dkoguciuk/ensemble_learning_for_point_clouds. The difference with original reports may be caused by some implementation details, hyper–parameters, or test methodology (reporting maximum vs. average score). Finally, we briefly describe bagging.

2.1 DeepSets

DeepSets [20] is one of the two independently developed (along with PointNet) first deep learning approaches to shape classification using bare point sets. The general idea is quite similar in both methods, but DeepSets focuses primarily on producing the permutation equivariance layer, which consists of three operations: , sum, and . According to the paper,

could be an arbitrary neural network architecture applied iteratively over every point in the point cloud, and the output should be summed along the set dimension. The reduced vector can then be passed into the multilayer perceptrion (


) working as a classifier.

Given the symmetry in weight sharing, the final features of the whole point set (after the summation) are invariant to the ordering of the input. This article demonstrates a strong mathematical background, many sample applications, and great results on the ModelNet40 dataset. Despite this, there is no direct successor of the model in the literature.

2.2 PointNet

The basic idea of PointNet [8] is to learn a spacial encoding of each point using a series of nonlinear mappings and then aggregate all point features to one global point cloud signature. The first part plays a similar role to the operation in DeepSets (there are some slight differences in the weight sharing scheme), and the second one (symmetric function) is also DeepSets

alike, but the authors suggest using max pooling operation as the one achieving the best performance.

The model is also invariant to the order in which the points are presented, which can directly operate on point clouds without any additional preprocessing such as spatial partitioning or graph construction. Moreover, the model is extremely robust to deformation and noise, but by its design, it suffers from not being able to detect small local structure details; thus, it leaves big space for modifications.

2.3 PointNet++

PointNet++ is an extended version of original PointNet architecture [17]

, where authors made hierarchical feature extraction by building a pyramid-like aggregation scheme to combine features from multiple scales. There are three steps on each pyramid level: sampling, grouping, and feature extraction. The first two steps consist of partitioning the input point cloud into overlapping local regions by the distance metric of the underlying space. The third step is learning a higher dimensional representation of an input region with the so-called local learner which is, naturally, a standard

PointNet model. There are three such pyramid levels in the original PointNet++ article, which produce the features of the whole point set.

2.4 SO-Net

SO-Net [18] is another hierarchical feature extraction model based on PointNet approach, but it has a different sampling and grouping strategy in comparison to PointNet++

. The main idea is to build a Self-Organizing Map (SOM) to model the spatial distribution of the input point cloud and then assign each point into

nearest SOM nodes, which play a similar role to sampling and grouping steps in PointNet++. Each local region is processed with a local pointnet-like learner, and a channel-wise max pooling operation is applied to aggregate point features to node signatures. Now, each SOM node with its features is processed with the second level learner and again aggregated with max-pool into a feature vector that represents the whole input point cloud.

2.5 KCNet

The main idea of the KCNet model [19] is to construct a kernel correlation layer as an affinity measure between a query point with its neighbors and kernel points, where the latter are allowed to move and adjust during training freely. KCNet uses several kernels at the local level to augment the original 3D coordinates’ input of the PointNet architecture. The second modification to the plain PointNet model is a recursive max-pooling operation performed in the neighborhood of each node.

2.6 Dgcnn

Instead of generating the point embedding directly from the point coordinates, the DGCNN [21] introduces EdgeConv operation, which incorporates a point’s neighboring structure. For each point of a point cloud, they construct a local neighborhood graph and apply deep learning feature extraction on edges of this graph. EdgeConv is designed to be invariant to the ordering of neighbors, and thus is permutation invariant.

The operation could be applied hierarchically just like in the traditional convolutional networks for 2D images, but the authors propose to build the graph of neighboring points for each layer independently.

2.7 PointCNN

The PointCNN [22] is using MLP on the local neighborhood to organize points into latent canonical order. Learning, so-called, -transformation results not only in permuting input but also in weighing the features associated with the points. Next, an element-wise product and sum operations of the typical convolution operation are subsequently applied on the -transformed features.

The above-described operations could be applied hierarchically: after each convolution, a subset of points are retained by downsampling, thus contain richer information aggregated the expanding neighborhood.

2.8 Comparison of Models

The general idea behind PointNet and DeepSets approaches is similar, but they differ mostly in weights sharing scheme in the MLP network. Both are pioneers in the field but do not explicitly use local structure information.

PointNet++ and SO-Net both hierarchically apply PointNet but differ in the sampling and grouping strategy. KCNet adds local structure information, which is learnable — not designed by hand like in PointNet++ or SO-Net. However, the kernel size is fixed here, which could not reveal the spatial distribution like SO-Net.

Last two models incorporate information of the local neighborhood of a query point by building a graph structure around it and by applying some kind of graph convolution. DGCNN learns feature extraction between a pair of neighboring points, where PointCNN tries to organize the nearby points in the canonical order.

2.9 Frustum PointNet

The main author of PointNet and PointNet++ extended his network into a 3D object detection framework called Frustum PointNet [34]. The main idea is to combine both mature 2D and novel 3D approaches by splitting the task into three main stages: 3D frustum proposal based on 2D object detection, 3D instance segmentation, and 3D bounding box estimation. The modules are based on PointNet (denoted as v1) or PointNet++ (v2) architectures.

In the first stage, 2D CNN object detector is used to generate 2D region proposals. Then each region is lifted up to 3D and thus become frustum proposal, containing point cloud — all points in the LiDAR point cloud which lie inside the 2D region when projected onto the image plane. The output point cloud is then fed into 3D instance segmentation network with binary output (where are the model instance weights, is class of the object and is a model function) meaning if the given point is a part of a 3D object or not, which assumes there is only one meaningful object in the frustum. Points belonging to the object form a point cloud . Then, the T-Net module find transformation centering . Centered point cloud is denoted as . In the last step, is used to estimate amodal 3D bounding box of the object , which is finally transformed to the global frame. Amodal bounding box

is described by its position, size, and heading angle. Size and heading are represented as discrete probability distributions, not raw values. Summing up, the network prediction can be denoted as:


2.10 Ensemble Learning with Bagging and Boosting

An ensemble consists of a set of individually trained models, whose predictions are combined. It is well known that ensemble methods can be used for improving prediction performance [27, 35].

Individually trained models may be trained using different training sets. In bagging, the training set is selected independently for each classifier from the full training set. The selected set can be a subset of the entire training set (later referred to as bagging without replacement) or can have the same size, but some samples repeated several times (bagging with replacement). However, the result of neural network training depends on several random factors so that the ensemble can consist of classifiers trained on the same training set, which we refer to as a simple ensemble.

Boosting trains a series of classifiers, with the training set (or samples loss weights) of the next classifier focusing on the samples with a higher error for the previous classifier. This can reduce errors, but noise in the training data often results in boosting overfitting [27].

Given the output of the individual classifier, the output of the ensemble can be calculated in different ways. Boosting uses individual weight for each of the classifier in series. Stacking trains a learning algorithm to combine predictions. In bagging and simple ensembles, all classifiers are equivalent, so three aggregation methods are commonly used in classification task: direct output averaging, soft voting (sum of activation of all hypothesis for each sample equals to one) and hard voting (each classifier output is in one-hot form, , i.e. each classifier votes for one hypothesis).

3 Ensemble Learning for 3D Object Recognition

We performed several experiments during the conduct of this study. Seven deep network architectures were selected: PointNet, PointNet++, SO-Net, KCNet, DeepSets, DGCNN, and PointCNN. All networks were tested on one task, i.e. ModelNet40 object classification. All these networks take a raw 3D point cloud as an input and output a vector of class scores for a given object, which can be denoted as follows:


where is the number of points in the point cloud, and is the number of classes in the classification task.

One set of hyper–parameters is selected for each network based on the authors’ settings. 10 model instances are trained for each architecture. The influence of the number of models in the simple ensemble is tested for each architecture, which is described in subsection 3.1.

Different architectures achieve the best result for different object classes. This suggests studying if the ensemble of different models can outperform every single model. Such a comparison for PointNet, PointNet++, KCNet, DGCNN, PointCNN, and 3.2. DeepSets was not included due to the challenges in technical implementation.

An SO-Net architecture achieved the highest accuracy during our experiments, so it was selected for further bagging tests. We tested the classification accuracy for bagging with and without replacement in subsection 3.3.

The ensemble of several model instances is computationally expensive. However, a deep network architecture can be viewed as an encoder (transforming sample to a feature vector) followed by a classifier (e.g. MLP, calculating class probabilities based on a feature vector). The question arises, whether classification accuracy can be improved by an ensemble of classifiers, based on the same feature vector. This was tested for SO-Net architecture according to subsection 3.4.

Our work shows that a simple ensemble of several model instances of the same architecture increases classification performance. Random factors cause differences between model instances. Influence of each factor is evaluated in subsection 3.5.

An ensemble of three model instances of Frustum PointNet was tested on the KITTI dataset. This is described in section 3.6. Note, that Frustum PointNet is a pioneering approach, in the sense that it performs detection + classification on raw LiDAR scans, so it cannot be compared directly with the other models presented.

3.1 Simple Ensemble of Model Instances

We experiment with the simple ensemble, which is a special case of bagging, with the full training dataset being used to train every model instance. The ensembling is performed by averaging the raw output activation, soft voting, or hard voting (denoted as , , and respectively):


Where each model instance is denoted as:


where is a model index. For each architecture, 10 model instances were trained. Tests were performed for . For each value of , all

model instances’ combinations were selected. Mean and standard deviation for each value of

are reported in the experimental results.

Figure 1: Dependency between instance classification accuracy (left) or mean class accuracy (right) and -number of ensemble models for a given voting method. For visibility, only results for best-performing architecture and mean result of all architectures are plotted. Results for soft-voting and activation ensemble are approximately equal, usually outperforming hard-voting.

Figure 1 shows the comparison of voting methods. Raw activation averaging equals approximately to soft-voting and usually outperforms hard-voting. For simplicity, only activation averaging is used in the rest of this article.

Figure 2: Dependency between instance classification accuracy (left) or mean class accuracy (right) and -number of ensemble models. We have learned each approach independently 10 times (70 different models), then for each possible number included in the ensemble classifier we have randomly chosen ten different -subsets and have reported mean accuracy with its standard deviation across those -subsets. As one can observe, using the ensemble learning makes the output more stable and classification accuracy rise slightly.

Figure 2 presents the instance and mean class accuracy as a function of used models for each architecture, along with their standard deviation. As one can observe, with the increasing number of models in the ensemble, the classification accuracy is slightly rising, and the standard deviation of classification accuracy is getting smaller, which means that the output is more stable and not so much dependent on a single learning session.

Instance accuracy
Instance accuracy
Instance accuracy
Class accuracy
Class accuracy
Instance accuracy
Class accuracy
PointNet 89.20 88.65 89.38 85.77 86.62 0.74 0.86
PointNet++ 90.70 90.14 90.48 87.71 88.19 0.34 0.48
DeepSets 90.30 89.71 90.27 85.79 86.46 0.56 0.67
KCNet 91.00 89.62 92.14 85.38 88.28 2.52 2.89
SO-Net 93.40 92.65 93.64 89.98 91.02 0.99 1.05
DGCNN 92.20 91.55 92.02 89.0 89.30 0.47 0.27
PointCNN 92.20 91.82 92.22 87.85 88.36 0.41 0.50
Table 1: Average instance and mean class accuracy for each architecture. Results for the single model, simple ensemble of 10 models and the accuracy increase are detailed. The simple ensemble of KCNet instances has a high increase in the classification accuracy (2.52%).

Table 1 shows the numerical comparison of classification accuracy increase between all approaches. The simple ensemble of KCNet instances has a noticeably higher increase in the classification accuracy (2.52%), then second SO-Net (0.99%) and other architectures (with mean instance accuracy increase equal to 0.50%).

Figure 3: Plain class classification accuracy, its gain (green), or loss (red) for all architectures for less accurate classes. Please note the flower pot class is the hardest class to classify, and almost all architectures using point neighborhood are doing slightly worse with ensemble learning here. This could suggest that the general shape for flower pot is more meaningful than local structure information.
Figure 4: Plain class classification accuracy, its gain (green) or lose (red) for all architectures for more accurate classes (with accuracy between 80 and 100%). Please note that many classes, for example, bed or cone, all architectures achieves better classification accuracy using ensemble learning.

Figures 3 and 4 show the loss or gain in the classification accuracy per every class between plain version and simple ensemble for all tested approaches. Some classes are easy to classify and all approaches achieve 100% accuracy, for example, airplane and laptop. However, interestingly, there are classes where the PointNet approach does better than others, despite the smallest accuracy of overall classification, for example, glass box and stool. This suggests that different methods can be focused on various aspects of point clouds, in particular, focusing on local structure leads to overfitting for some classes with more discriminative global shape.

Figure 5: Number of classes where certain architecture is the best for a plain version, its gain (green), or loss (red). High gain of GDCNN and PointCNN architectures suggests they can be much better in classifying some classes and much worse in other ones.

One can ask one more interesting question about those approaches and their ensembles: in how many classes a particular model has the highest accuracy? Figure 5 answers that question and shows how this number is changed after simple ensemble (if classes reach same best accuracy, each of them scores in this rank). Note that the order of architectures is different than that given in Table 1.

3.2 Ensemble of Different Models

We evaluate ensembles of pairs of different models. First, an ensemble of output scores were calculated for pairs of architectures according to the following formula:


Note that each output of the architecture is scaled to have identity standard deviation for the training set. The ensemble of different models improved both instance and mean class accuracy. Top pairs consist of SO-Net model with higher weight and the second architecture. Ensemble results are calculated for 10 model instances: 5 of one architecture and 5 of the other. Table 2 shows the most interesting results. As two model instances are used in the ensemble to calculate its plain accuracy, results for two SO-Net instances ensemble are plotted for reference.

accuracy mean
Instance accuracy
Class accuracy
Class accuracy
SO-Net PointNet 0.7 0.3 93.23% 93.65% 90.95% 91.44%
SO-Net PointNet++ 0.7 0.3 93.41% 93.75% 91.33% 91.61%
SO-Net KCNet 0.8 0.2 93.21% 93.73% 90.62% 90.99%
SO-Net DGCNN 0.9 0.1 93.64% 93.95% 91.59% 92.00%
SO-Net PointCNN 0.8 0.2 93.55% 94.03% 90.97% 91.50%
SO-Net SO-Net 0.5 0.5 93.18% 93.64% 90.57% 91.02%
Table 2: Example results for ensemble of models with two different architectures.

Ensemble of all considered architectures with the best-performing SO-Net was tested according to the following formula:


Note that this is the ensemble of different architectures including only one training instance of each architecture. The obtained models are further tested in the aspect of multiple training instances learning as described in subsection 3.1. The ensemble of all architectures with the major role of SO-Net, achieves the highest overall accuracy. The results of the ensemble are calculated for five instances for each architecture with the nonzero factor. Table 3 shows the most interesting results.

accuracy mean
Instance accuracy
Class accuracy
Class accuracy
- 0.1 - - 0.2 0.7 93.73% 94.13% 91.24% 91.77%
0.05 0.05 - - 0.25 0.65 93.74% 94.14% 91.17% 91.57%
- 0.05 - 0.05 0.3 0.6 93.88% 94.14% 91.52% 91.92%
- 0.05 - - 0.2 0.75 93.65% 94.15% 91.14% 91.76%
0.05 0.05 - - 0.2 0.7 93.67% 94.15% 91.12% 91.71%
- 0.3 - 0.05 - 0.65 93.73% 94.04% 91.66% 92.20%
0.05 0.3 - 0.05 - 0.6 93.78% 94.04% 91.79% 92.20%
- 0.15 - 0.1 0.05 0.7 93.76% 94.03% 91.69% 92.21%
- 0.15 - 0.1 - 0.75 93.70% 94.05% 91.64% 92.22%
- 0.2 - 0.1 - 0.7 93.68% 94.06% 91.60% 92.24%
Table 3: Example results for ensemble of models with different architecture.

3.3 Ensemble Learning with Model Bagging

For the SO-Net architecture, which achieves the highest overall accuracy, model bagging was tested. For bagging with replacement, 10 training sets were generated by randomly sampling with replacement of a number of samples equal to the size of the training set. For bagging without replacement, training subset sizes were used in experiments, denoted as follows:


For each , 10 training set splits were generated (sampled without replacement), and one model instance was trained. Output of each model for a given was aggregated as detailed in subsection 3.1.

Figure 6: Results of the simple ensemble, bagging without replacement for ten different training subset sizes and bagging with replacement. Results for SO-Net architecture, instance (left) and class (right) classification accuracy. Error bars not plotted to prevent jamming. The biggest gain is achieved for the smallest training subset size, but the overall classification accuracy is the best for simple ensemble (aka bagging with replacement).

Figure 6 shows the results. The biggest gain is achieved for the smallest training subset size. Accuracy increase for bagging with replacement is higher than that of without replacement. However, none of the bagging methods outperforms simple ensemble in the given task.

3.4 Simple Ensemble of Last Layers

The SO-Net architecture consists of the explicitly defined encoder (computationally expensive) and (fast) classifier. Now, we can check whether the accuracy growth of ensemble learning or SO-Net architecture is determined mostly by encoder or classifier part. In the case of the the latter, one could learn the ensemble of classifiers only and thus save learning time by a significant factor. To check this, for each of the 10 SO-Net

encoder instances, 5 additional classifiers were trained, with the same hyper–parameters, constant encoder weights, and 31 training epochs. The average result of 5–classifier ensemble is compared to the average result of model with a single classifier (both averages are calculated for 10 encoder instances).

Parameter Mean Standard Deviation
Instance accuracy mean 92.43% 0.19%
Instance accuracy ensemble 92.69% 0.28%
Class accuracy mean 89.80% 0.21%
Class accuracy ensemble 89.98% 0.28%
Instance accuracy increase 0.25% 0.12%
Class accuracy increase 0.18% 0.12%
Table 4: The influence of classifiers ensemble with one encoder for SO-Net architecture. Five classifiers were trained for each of the 10 encoders. As one can see, the computationally cheaper classifier ensemble does not result in rewarding accuracy gain.

Table 4 shows the results. According to the results, the encoder causes the major advantage of the ensemble. This means that computationally cheaper classifier ensemble does not result in rewarding accuracy gain.

3.5 Influence of Random Factors in Simple Ensembles

We identified four random factors in SO-Net model training:

  • The order of training samples and random data augmentation (note that eliminating this factor means that samples are shuffled between training epochs but in the same way for every model instance);

  • Initial values of weights and biases of the neural network;

  • Random dropout regularization (eliminating this factor means fixing the dropout seed so certain neuron would be dropped, for example, always in epoch number 3, 7, 17, etc.);

  • Random order of massively parallel computations, resulting in different summation order, which is not alternating for floating point numbers.

The first three factors can be eliminated, whereas the last one cannot be eliminated. To verify the influence of each factor, five SO-Net model instances were trained for each configuration with one, two, or three random factors eliminated.

Instance accuracy
Instance accuracy
Class accuracy
Class accuracy
Instance accuracy
Class accuracy
const const const 92.34% 93.19% 89.58% 90.96% 0.85% 1.38%
const const random 92.48% 93.11% 89.89% 90.36% 0.63% 0.48%
const random const 92.50% 93.15% 90.10% 90.58% 0.65% 0.48%
random const const 92.51% 93.07% 89.88% 90.40% 0.56% 0.52%
random random const 92.20% 92.99% 89.62% 90.25% 0.79% 0.63%
random const random 92.47% 93.35% 89.78% 90.51% 0.88% 0.73%
const random random 92.49% 93.15% 89.84% 90.65% 0.66% 0.81%
random random random 92.65% 93.57% 89.98% 90.87% 0.92% 0.90%
Table 5: The influence of random factors in model instances training to ensemble accuracy gain. One can observe that accuracy increases even if results of massively parallel computations are the only one (irremovable) random factor. One can observe all removable random factors in the training procedure have little influence on classification accuracy in ensemble learning.

The experiments were time-consuming (35 additional training sessions), but the results, presented in Table 5, are coarse because only one constant order of values for each factor was considered. However, one can observe that the increase in the accuracy can be observed even if all model instances in the ensemble were trained with the same training data order and augmentation, initial weights, and dropout order. This leads to the conclusion that for SO-Net architecture, diversity in models is caused mainly just by the numerical issues of massively parallel computations.

3.6 Ensemble Methodology for Frustum PointNet

Since the first step of Frustum PointNet approach use 2D CNN object detector, which is beyond the scope of this paper, we look at ensemble learning for the three point cloud processing modules of the network. A single experiment on the KITTI dataset was performed to verify conclusions based on ModelNet40 and to plan further work on knowledge distillation with ensemble learning for real-world 3D data.

Frustum PointNet consists of three trainable modules. Ensembling can be used either for the last bounding-box predicting module or for all modules (denoted as and , respectively). Results for other configurations (e.g. ensemble of only segmentation modules) have been shared online. Given independently trained model instances, and can be defined as follows (compare with equation 1):

Model Car Pedestrian Cyclist
2D proposal 91.5 77.8 77.4
Accuracy ground
v1, no ensemble 81.7 62.6 63.8
v1, last module ensemble 82.6 64.4 65.5
v1, all modules ensemble 82.9 65.9 66.9
v2, no ensemble 82.3 63.3 65.0
v2, last module ensemble 83.3 65.7 67.8
v2, all modules ensemble 83.6 66.5 68.3
Accuracy 3D
v1, no ensemble 72.4 58.1 58.9
v1, last module ensemble 74.1 59.4 61.5
v1, all modules ensemble 74.4 60.1 62.9
v2, no ensemble 72.0 56.5 60.8
v2, last module ensemble 73.6 59.6 63.7
v2, all modules ensemble 74.0 60.5 65.0
Table 6: Frustum PointNet ensemble results for object classes.
Model Easy Medium Hard All
2D proposal 90.0 79.8 76.9 82.2
Accuracy ground
v1, no ensemble 79.1 67.6 61.5 69.4
v1, last module ensemble 80.3 69.0 63.2 70.8
v1, all modules ensemble 81.3 70.6 63.9 71.9
v2, no ensemble 78.8 68.9 63.0 70.2
v2, last module ensemble 80.8 71.2 64.8 72.3
v2, all modules ensemble 81.3 71.8 65.3 72.8
Accuracy 3D
v1, no ensemble 74.3 60.5 54.5 63.1
v1, last module ensemble 76.4 62.4 56.1 65.0
v1, all modules ensemble 77.3 63.2 56.9 65.8
v2, no ensemble 73.5 60.8 54.9 63.1
v2, last module ensemble 76.6 63.2 57.2 65.6
v2, all modules ensemble 77.5 64.0 58.2 66.5
Table 7: Frustum PointNet ensemble results for difficulty levels.

Three Frustum PointNet instances were trained for both v1 and v2 configurations. Accuracy is calculated for ground and 3D detection. Results are limited by the 2D bounding box proposal accuracy. Objects are divided into three classes (car, pedestrian and cyclist) and three difficulty levels (easy, medium and hard). Tables 6 and 7 presents the accuracy for object classes and for difficulty levels, respectively.

Note that configuration v1 (based on PointNet) often outperforms v2 (based on PointNet++) when ensemble is not used, whereas with ensemble of 3 model instances, v2 is better. The performance gain is also higher than that for ModelNet40 classification, which suggests that training set size is smaller concerning the complexity of the task. Averaging outputs of all modules outperforms averaging of only last module output.

3.7 Comparison of Computational Runtime

We benchmark the speed of all the architectures on the Jetson TX2 platform. We chose a mini-batch of four point clouds because it seemed to be a reasonable amount of segmented objects visible in a typical mobile robot environment. One has to keep in mind that we have not performed any target-specific optimization. All approaches used (NVIDIA CUDA) acceleration and three different deep learning frameworks, based on the original authors’ implementations of these methods.

Figure 7: Comparison of the time of inference on the Jetson TX2 platform (on the bottom) and the number of parameters (on the top) for each tested model. DeepSets has the smallest number of parameters and is significantly faster than that of the other approaches.

As depicted in Figure 7, DeepSets has the smallest number of parameters and is significantly faster than that of the other approaches. Surprisingly, PointNet has considerable amount of parameters but is also pretty fast. It is worth pointing out that an increase in a few percentages of classification accuracy (i.e. SO-Net or PointCNN) is occupied by significantly longer execution times.

4 Conclusion

In this article, we focus on the examination of ensemble learning on 3D point cloud classification with seven most popular architectures using raw point sets. We examine the possibility to leverage the classification accuracy of each of the seven cited models by ensemble learning. First, we observe which voting policy is the best for the task. Second, we found slightly better classification accuracy with the increasing number of models in ensemble along with smaller standard deviation. It proves that the ensemble’s output is more stable and reliable. The biggest mean instance classification accuracy gain was observed for KCNet — (), SO-Net — (), and other for architectures — () on average. Significant increase in classification accuracy achieved by KCNet

in comparison with all other architectures could be caused by different underlying working principle of kernel correlation as a measure of neuron activation. Or could suggest there is some more space for hyperparameters tweaks in

KCNet, e.g. the number of filters (sets of kernels).

One can see some interesting observations due to the comparative study of different models. For example, there are classes where the simplest (global) approaches (PointNet, DeepSets) present the best classification accuracy. This suggests that the general shape of some objects is more important than the shape of local structures, which could mislead more complex models.

We also showed that the ensemble models with different architectures can further leverage the overall accuracy. We found that the SO-Net got the highest instance and mean class classification accuracy, but PointCNN wins in the number of classes in which given network obtained the highest accuracy after ensemble (this score is also high for DGCNN). This suggests that the latter can be much better in classifying some classes and much worse in other ones. This could also explain why the ensemble of only two model instances with different architectures lead to approximately of instance classification accuracy increase compared to the state of the art results. This increase is equal to one obtained while using 10 instances of SO-Net model. Further gain can be achieved while using multiple model instances for each of mixed architectures. Instance accuracy of 94.03% can be obtained for two architectures and 94.15% while combining three or four architectures.

The performance gain is even higher for Frustum PointNet’ architecture evaluated on real-world KITTI dataset, using only three model instances.

We tested the source of randomness in ensemble learning analysis for SO-Net. We observed that numerical issues of massively parallel computations in deep neural networks are essential and beneficial in ensemble learning. The ensemble of several classifiers with the same encoder does not result in a significant performance gain. Simple ensemble outperforms classic bagging for tested approaches.

In addition we provide some tips for implementing point cloud classification into a mobile robot equipped with the Jetson TX2 platform by comparing inference time for all the tested models.

There are more questions one can ask around the topic of ensemble learning for point cloud processing. In our opinion, results of this study could leverage the benefit of knowledge distillation in real-world 3D object detection for autonomous cars and mobile robots.