Code for Revisiting Point Cloud Classification: A New Benchmark Dataset and Classification Model on Real-World Data, ICCV 2019
Deep learning techniques for point cloud data have demonstrated great potentials in solving classical problems in 3D computer vision such as 3D object classification and segmentation. Several recent 3D object classification methods have reported state-of-the-art performance on CAD model datasets such as ModelNet40 with high accuracy ( 92 this paper, we argue that object classification is still a challenging task when objects are framed with real-world settings. To prove this, we introduce ScanObjectNN, a new real-world point cloud object dataset based on scanned indoor scene data. From our comprehensive benchmark, we show that our dataset poses great challenges to existing point cloud classification techniques as objects from real-world scans are often cluttered with background and/or are partial due to occlusions. We identify three key open problems for point cloud object classification, and propose new point cloud classification neural networks that achieve state-of-the-art performance on classifying objects with cluttered background. Our dataset and code are publicly available in our project page https://hkust-vgd.github.io/scanobjectnn/.READ FULL TEXT VIEW PDF
Autonomous vehicles operate in highly dynamic environments necessitating...
In the real world, out-of-distribution samples, noise and distortions ex...
Processing point cloud data is an important component of many real-world...
This paper proposes a new optical camouflage system that uses RGB-D came...
To alleviate the cost of collecting and annotating large-scale point clo...
Despite the recent successes in computer vision, there remain new avenue...
The 3D modelling of indoor environments and the generation of process
Code for Revisiting Point Cloud Classification: A New Benchmark Dataset and Classification Model on Real-World Data, ICCV 2019
The task of understanding our real world has achieved a great leap in recent years. The rise of powerful computational resources such as GPUs and the availability of 3D data from depth sensors have accelerated the fast-growing field of 3D deep learning. Among various 3D data representations, point clouds are widely used in computer graphics and computer vision thanks to their simplicity. Recent works have shown great promises in solving classical scene understanding problems with point clouds such as 3D object classification and segmentation.
However, the current progress on classification with 3D point clouds has witnessed a trend of performance saturation. For example, many recent object classification methods have reported very high accuracies in 2018, and the trend of bringing the accuracy towards perfection is still ongoing. This phenomenon inspires us to raise a question on whether problems such as 3D object classification have been totally solved, and to think about how to move forward.
To answer this question, we perform a benchmark of existing point cloud object classification techniques with both synthetic and real-world data. For synthetic objects, we use ModelNet40 , the most popular dataset in point cloud object classification that contains about 10,000 CAD models. To support the investigation of object classification methods on real-world data, we introduce ScanObjectNN, a new point cloud object dataset from the state-of-the-art scene mesh datasets SceneNN  and ScanNet . Based on the initial instance segmentation from the scene datasets, we manually filter and select objects for 15 common categories, and further enrich the dataset by considering additional object perturbations.
Our study shows that while the accuracy with CAD data is reaching perfection, learning to classify a real-world object dataset is still a very challenging task. By analyzing the benchmark results, we identify three open issues that are worth to further explore for future researches. First, classification models trained on synthetic data often do not generalize well to real-world data such as point clouds reconstructed from RGB-D scans [19, 9], and vice versa. Second, challenging in-context and partial observations of real-world objects are common due to occlusions and reconstruction errors; for example, they can be found in window-based object detectors  in many robotics or autonomous vehicle applications. Finally, how to handle background effectively when they appear together with objects due to clutter in the real-world scenes.
As our dataset opens up opportunities to tackle such open problems in real-world object classification, we also present a new method for point cloud object classification that can improve upon the state-of-the-art results on our dataset by jointly learning the classification and segmentation tasks in a single neural network.
In summary, we make the following contributions:
A new object dataset from meshes of scanned real-world scene for training and testing point cloud classification,
A comprehensive benchmark of existing object classification techniques on synthetic and real-world point cloud data,
A new network architecture that is able to classify objects observed in a real-world setting by a joint learning of classification and segmentation.
In this paper, we focus on object classification with point cloud data, which has advanced greatly in the past few years. We briefly discuss the related works and their datasets, below.
Early attempts to classifying point clouds were developed by adapting ideas from deep learning on images, , using multiple view images [39, 48, 46, 22], or applying convolutions on 3D voxel grids [27, 43]. While it seems natural to extend the convolution operations from 2D to 3D, it is shown that performing convolutions on a point cloud is not a trivial task [30, 49]. The difficulty stems from the fact that a point cloud has no well-defined order of points on which convolutions can be performed. Qi et al.  addressed this problem by learning global features of point clouds using a symmetric function that is invariant to the order of points. Alternatively, some other methods proposed to learn local features from convolutions, , [31, 25, 20, 42, 44, 18, 2, 24, 33, 11]
or from autoencoders. There are also methods jointly learning features from point clouds and multi-view projections . It is also possible to treat point clouds and views as sequences [26, 17, 15]
, or to use unsupervised learning.
Recent works demonstrate very competitive and compelling performances on standard datasets. For example, the gap between state-of-the-art methods such as SpecGCN , SpiderCNN , DGCNN , PointCNN  is less than 1% on ModelNet40 dataset . In the online leaderboard maintained by the authors of ModelNet40, the accuracy of the object classification task is reaching perfection, with 92% for point cloud methods [25, 42, 44, 26].
There are a limited number of datasets that can be used to train and test 3D object classification methods. ModelNet40 was originally developed by Wu et al. 
for learning a convolutional deep-belief network to model 3D shapes represented in voxel grids. Objects in ModelNet40 are CAD models of 40 common categories such as airplane, motorbike, chair and table, to name a few. This dataset has been a common benchmark for point cloud object classification. ShapeNet  is an alternative large-scale dataset of 3D CAD shapes with approximately objects in 55 categories. However, this set is usually used for benchmarking part segmentation.
So far, object classification on ModelNet40 is done with the assumption that objects are clean, complete, and free from any background noise. Unfortunately, this assumption is not often held in practice. It is common to see incomplete (partial) objects due to the imperfection of 3D reconstruction. In addition, objects in real-world settings are often scanned when being placed in a scene, which makes them appear in a clutter, and thus may be attached with background elements. A potential treatment is to remove such background using human annotators . However, this solution is tedious, prone to errors, and subjective to the experience of annotators. Other works synthesize challenges on CAD data by introducing noise simulated by Gaussians [4, 12] or created with a parametic model  to mimic real world scenarios. Recently, the trend of sim2real  also aims to bridge the gap between synthetic and real data.
Prior to our work, there are also a few datasets of real-world object scans [10, 8, 5] but most are small in scale and are not suitable for training object classification networks, which often have thousands of parameters. For example, in robotics, Sydney urban objects dataset  contains only 631 objects of 26 categories captured by a LiDAR camera, which is mainly used for evaluation [27, 2] but not for training. Some datasets [36, 5] are captured in controlled environment which might greatly differ from real-world scenes. Choi et al.  proposed a dataset of more than 10,000 object scans in the real world. However, not all of their scans can be successfully reconstructed; the online repository by the authors also provided only about 400 reconstructed objects. RGB-D and 3D scene meshes datasets [19, 9, 1, 37, 34] have more objects that are reconstructed along with the scenes, but such objects are often considered in a scene segmentation or object detection task, and not under an object classification setup. RGBD-to-CAD object classification challenge [21, 29] provides an object dataset that mixes CAD models and real-world scans. Its goal is to classify RGB-D objects such that a retrieval can be done to find similar CAD models. However, several categories are ambiguous, and objects are supposed to be well segmented before classification. ScanNet  has a benchmark on 3D object classification with partially scanned objects. However, this dataset is designed for volume-based object classification , and there are quite few techniques that report their results with this data.
Our goal is to quantitatively analyze the performances of existing object classification methods on point clouds. We split our task into two parts: benchmarking with synthetic data and with real-world data.
For synthetic data, we experiment with the well-known ModelNet40 dataset . This set is a collection of CAD models with 40 object categories. The dataset includes 9,840 objects for training and 2,468 objects for testing. The objects in ModelNet40 are synthetic, and thus are complete, well-segmented, and noise-free. In this experiment, we use the uniformly dense point cloud variant as preprocessed by Qi et al. . Each point cloud is randomly sampled to 1024 points as input to the networks unless otherwise stated. The point clouds are centered at zero, and we use local coordinates normalized to as point attributes. We follow the default train/test split, and use the default parameters as in the original implementations of the methods. Our benchmark is performed with a NVIDIA Tesla P100 GPU. We re-trained PointNet , PointNet++ , PointCNN , Dynamic Graph CNN (DGCNN) 
, 3D modified Fisher Vector (3DmFV), and SpiderCNN . For remaining methods, we provided the results reported in the original papers. We additionally report each method’s best performance when provided with additional information such as point normals. The results are shown in Table 1. It can be observed that the performance of recent methods is becoming incremental, and fluctuates around 92%. This saturating score inspires us to revisit the object classification problem: Can classification methods trained on ModelNet40 perform well on real-world data? Or is there still room for more research problems to be explored?
|Kd-Net ||88.5||90.6 (91.8 *)|
|PointNet++ ||87.8||90.7 (91.9 w/ normal)|
|SO-Net ||87.3||90.9 (93.4 w/ normal)|
|SpecGCN ||-||91.5 (92.1 w/ normal)|
|SpiderCNN ||86.8||90.0 (92.4 w/ normal)|
Objects obtained from real-world 3D scans are significantly different from CAD models due to the presence of background noise and the non-uniform density due to holes from incomplete scans/reconstructions and occlusions. This situation is often seen in sliding window-based object detection  in which a window may enclose an object of interest partially and also include background elements within the window. Due to these properties, applying existing point cloud classification methods to real-world data may not produce the same good results as CAD models.
To study this potential issue, we build a real-world object dataset based on two popular scene meshes datasets: SceneNN  and ScanNet . SceneNN has 100 annotated scenes with highly cluttered objects while ScanNet has a larger collection of 1513 indoor scenes. From a total of more than 1600 scenes from SceneNN and ScanNet, we selected 700 unique scenes. We then manually examined each object, fixed inconsistent labels, and discard objects that are ambiguous, have low reconstruction quality, have unknown labels, are too sparse, and have too few instances to form a category for training. During categorization, we also took into account inter-class balancing to avoid any bias potentially coming from classes with more samples.
The results are objects that are categorized into 15 categories. The raw objects are represented by a list of points with global and local coordinates, normals, colors attributes and semantic labelsOther works synthesize challenges on CAD data by introducing noise simulated by Gaussians [4, 12] or created with a parametic model . Recently, the trend of sim2real  also aims to bridge the gap between synthetic and real data. As in the experiment with synthetic data, we sample all raw objects to 1024 points as input to the networks and all methods were trained using only the local coordinates. We will make our dataset publicly available for future research. Table 2 summarizes classes and objects in our dataset.
Based on the selected objects, we construct several variants that represent different levels of difficulty of our dataset. This allows us to explore the robustness of existing classification methods in more extreme real-world scenarios.
The first variant is referred to as OBJ_ONLY which includes only ground truth segmented objects extracted from the scene meshes datasets. This variant has the closest form analogous to its CAD counterpart, and is used to investigate the robustness of classification methods to noisy objects with deformed geometric shape and non-uniform surface density. Sample objects of this variant are shown in Figure 2(a).
The previous variant assumes that an object can be accurately segmented before being classified. However, in real-world scans, objects are often presented in under-segmentation situations, i.e., background elements or parts of nearby objects are included, and accurate annotations for such under-segmentations are also not always available. Those background elements may provide the context where objects belong to, and thus would become a good hint for object classification, , laptops often sit on desks. However, they may also introduce distractions which corrupt the classification, , a pen may be under-segmented with a table where it sits on and thus could be considered as a part of the table rather than a separate object. To study these factors, we introduce a variant of our dataset where objects are attached with background data (OBJ_BG). We determine such background by using the ground truth axis-aligned object bounding boxes. Specifically, given a bounding box, all points in the box are extracted to form an object. Sample objects with background are shown in Figure 2(b).
The given bounding boxes from the ground-truth tightly enclose the objects. However, in real-world scenarios bounding boxes may over- or under-cover, or even split objects. For example, in object detection techniques such as R-CNN , object category has to be predicted from a rough bounding box that localizes a candidate object. To simulate this challenge, we extend our dataset by translating, rotating (about the gravity axis), and scaling the ground truth bounding boxes before extracting the geometry in the box. We name the variants of these perturbations with a common prefix PB.
The perturbations introduce various degrees of background and partiality to objects. In this work, we use four perturbation variants in the increasing order of difficulty: PB_T25, PB_T25_R, PB_T50_R, and PB_T50_RS. Suffix _T25 and _T50 denote translation that randomly shifts the bounding box up to 25% and 50% of its size from the box centroid along each world axis. Suffix _R and _S denotes rotation and scaling. Each perturbation variant contains five random samples for each original object, resulting in up to perturbed objects in total. Since perturbation might introduce invalid objects, e.g., objects that are almost completely out of the bounding box of interest, we perform an additional check after perturbation by ensuring that at least 50% of the original object points remain in the bounding box. Objects that do not satisfy this condition are discarded. Sample point clouds of these variants are shown in Figure 3. More details about perturbing objects can be found in our supplementary material.
For a clearer picture of the maturity of point cloud-based object classification, we benchmark several representative methods on our dataset. We aim to identify the limitations of current works on real-world data. We choose 3DmFV , PointNet , SpiderCNN , PointNet++ , DGCNN  and PointCNN as our representative works.
We first study the case when training is done on ModelNet40 and testing is done on ScanObjectNN. Since objects in ModelNet40 are standalone with no background objects, we also removed background in all our variants for fair evaluations. Furthermore, we only evaluated the current methods on 11 (out of 15) common classes between ModelNet40 and our dataset. Please refer to the supplementary material for the details on these common classes.
Evaluation results are reported in Table 3. These results show that the current techniques trained on CAD models are not able to generalize to real-world data; all techniques achieved less than 50% of accuracy. This is expected and is because of the fact that real-world objects and CAD objects are significantly different in their geometry. Real-world objects are often incomplete and partial due to construction errors and occlusions; their surfaces have low-frequency noise; object boundaries are inaccurate. These are in contrast to CAD objects, which are often clean and noise-free. We also found that the harder the data is (more noise and partiality), the lower the performance is, and this is consistent for all techniques. In other words, knowledge learned from synthetic objects in ModelNet40 is not well transferable and/or applicable to real-world data.
In this experiment, we train and test the techniques on ScanObjectNN to demonstrate training on datasets with real-world properties should improve the performance in classifying real-world objects. We also analyze how different perturbations can affect the classification performance. We randomly split our dataset into two subsets: training (80%) and test (20%) set. We ensure that the training and test sets contain objects from different scenes so that similar objects do not occur in the same set, same types of chairs can be found in the same room. We report the performance of all the techniques on the hardest split in Table 4. Full performances on all splits are provided in our supplementary material.
For fair comparisons, we kept the same data augmentation process in all the methods (, random rotation and per-point jitter). We trained the methods to convergence rather than selecting the best performance on the test set.
Vanilla. The 2nd column in Table 4 shows the overall performance of existing methods when trained on the simplest variant of our dataset (OBJ_ONLY). This clearly shows that the classification accuracy increased significantly when training and testing are both done using ScanObjectNN versus when training is done using ModelNet40 (Table 3 Column 2). However, we also notice an observable performance drop comparing to the pure synthetic setting in Table 1. This gives an important message: point cloud classification on real-world data is still open, a dataset with real-world properties can help, but further research is necessary to regain the high performance as in synthetic setting.
In the following, we investigate the performance change in different types of perturbations in our dataset.
Background. As shown in Table 4 Columns 3-7, background makes strong impact to the classification performance of all methods. Specifically, except PointCNN , all methods performed worse on OBJ_BG compared with OBJ_ONLY. It can be explained by the fact that background elements could distract the learning in existing methods by confusing between foreground and background points.
To further confirm the negative effect of having background objects, we conduct a control experiment using the hardest perturbation variant, , PB_T50_RS. Table 5 shows the overall accuracy of all existing models decrease when trained and tested with the presence of background.
|w/o BG||w/ BG||w/o BG||w/ BG|
Table 4 also shows the impact of perturbations to the classification performance (compared with Column 2). In this result, we observe that translation and rotation both make the classification performance decrease significantly, especially with larger perturbations that introduce more background and partiality. Scale further degrades the performance by a small gap.
Figure 4 illustrates the confusion matrices of all methods on our hardest variant PB_T50_RS. It can be seen that there are no major ambiguity issues in our categories, and our dataset is challenging due to the high variations in real-world data.
Generalization to CAD Data. While it is shown that networks trained on synthetic data generalizes poorly to our dataset (Table 3), the reverse is not true. Here we tested the generalization capability of existing methods when trained on ScanObjectNN. In this experiment, all methods were trained on our PB_T50_RS (with and without background) and tested on ModelNet40. The results in the last two columns in Table 5 clearly show that existing methods could generalize better when they were trained on real-world data (compared with the results in Table 3). Performance on individual classes are presented in Table 6. As shown in Table 6, lower accuracies are achieved on classes such as bed, cabinet, and desk, where complete structures are never observed in real scans because these objects are often situated adjacent to walls or near corners of rooms. Therefore, we advocate using real-world data in training object classification because the generalization is shown to be much better.
We further support part-based annotation in our dataset. So far, point cloud classification methods only evaluate part segmentation task on ShapeNet . However, there has been no publicly available dataset for part segmentation on real-world data despite the availability of scene meshes datasets [19, 9]. We close this gap with our dataset, which will be released for future research. Figure 5 shows a visualization of part segmentation on our data. Table 7 and Table 8 provide a baseline part segmentation evaluation on our data. Using these part annotations may also improve partial object classification in the future.
Our quantitative evaluations show that performing object classification on real-world data is challenging. The state-of-the-art methods in our benchmark have up to 78.5% accuracy on our hardest variant (PB_T50_RS). The benchmark also helps us recognize the following open problems:
Background is expected to provide context information but also introduce noise. It is desirable to have an approach that can distinguish foreground from background to effectively exploit context information in the classification task.
Object partiality, caused by low reconstruction quality or inaccurate object proposals, also needs to be addressed. Part segmentation techniques [30, 25] could help to describe partial objects.
Generalization between CAD models and real-world scans needs more investigations. In general, we found that training on real-world data and testing on CADs can generalize better than the opposite case. It could be explained that real-world data have more variations including background and partiality as discussed above. However, CAD models are still important because real-world scans are seldom complete and noise free. Bridging this domain gap could be an important research direction.
To facilitate future work, in the next sections, we propose ideas and baseline solutions.
We propose here a simple deep network to handle the occurrence of background in point clouds obtained from real scans; this is one of the open problems we raised in the previous section. An issue with existing point cloud classification networks is the lack of capability to distinguish between foreground and background points. In other words, existing methods take point clouds as a whole and directly calculate features for classification. This issue stems from the design of these networks and also from the simplicity of available training datasets, , ModelNet40.
To tackle this issue, our idea is to make the network aware of the presence of background by adding a segmentation-guided branch to the classification network. The segmentation branch predicts an object mask that separates the foreground from the background. Note that the mask can be easily obtained from our training data since our objects are originally from scene instance segmentation datasets [19, 9].
. In particular, we use three levels of set abstractions from the PointNet++ to extract point cloud global features. Global features are then passed through three fully connected layers to produce object classification score. Dropout is also used in a similar manner with the original PointNet++ architecture. Three PointNet feature propagation modules are then employed to compute object masks in segmentation. The feature vector just before the last fully connected layer for the classification score is used as the input to the first PointNet feature propagation modules, making the predicted object mask driven by the classification output. We trained both branches jointly. The loss function is the sum of the classification and segmentation loss, which can be written aswhere and are both cross entropy losses between the predicted and ground-truth class labels and object masks, respectively. We set in our experiments.
Joint learning for both classification and segmentation with the use of object masks allows the network to be aware of relevant points (, acknowledge the presence of background points). In addition, using classification prediction as a prior to segmentation guides the network to learn object masks that are consistent with the true shape of desired object classes. As to be detailed in our experiments, jointly learning classification and mask prediction results in better classification accuracy in noisy scenarios.
Furthermore, we also introduce BGA-DGCNN, which is a background-aware network based on DGCNN . We apply the same concept as BGA-PN++ that jointly predicts both classification and segmentation, where the last fully connected layer of the classification branch is used as input to the segmentation branch. Our experimental results show that our bga model is adaptive to different network architectures.
We evaluate our network on both our dataset and ModelNet40. Table 9 shows a comparison between our network and existing ones on our hardest variant PB_T50_RS and ModelNet40 respectively. Our BGA models, BGA-PN++ and BGA-DGCNN, both outperform their vanilla counterparts with BGA-PN++ achieving the best performance on our PB_T50_RS. On ModelNet40, our BGA-PN++ improves upon PointNet++ by almost 5% (with 52.6% of accuracy), while our BGA-DGCNN achieves the top performance of 56.5%. Note that, in this evaluation all methods were trained on our PB_T50_RS. As shown, our BGA models gains improvements in both ModelNet40 and our dataset.
In addition, we also evaluated the segmentation performance of our network. Experimental results showed that our BGA-PN++ performed at 77.6% and 71.0%, while our BGA-DGCNN achieved 78.5% and 74.3% of segmentation accuracy on our PB_T50_RS and ModelNet40, respectively. We visualize some of the object masks predicted by our BGA-PN++ in Figure 7. It can be seen that our proposed network is able to mask out the background fairly accurately.
While both BGA models demonstrate good performance, we found that DGCNN-based networks generalizes well between real and CAD data, e.g., when being trained on real and tested on CAD data (Table 9) and vice versa (Table 3). Moreover, Table 3 also show that the same is true for DGCNN-based models on the synthetic to real case. More investigations on the DGCNN architecture could lead to models that generalize better and bridge the gap between synthetic and real data.
Our proposed BGA is not without limitation. In general, it requires object masks and background to be included in the data. Fig. 8-(a) shows a fail case of our method when evaluating on a background-free ModelNet40 object.
This paper revisits state-of-the-art object classification methods on point cloud data. We found that existing methods were successful with synthetic data but failed on realistic data. To prove this, we built a new real-world object dataset containing objects in 15 categories. Compared with current datasets, our dataset offers more practical challenges including background occurrence, object partiality, and different deformation variants. We benchmarked existing methods on our new dataset, discussed issues, identified open problems, and suggested possible solutions. We also proposed a new point cloud network to classify objects with background. Experimental results showed the advance of our method on both synthetic and real-world object datasets.
This research project is partially supported by an internal grant from HKUST (R9429).
3DmFV: three-dimensional point cloud classification in real-time using convolutional neural networks. IEEE Robotics and Automation Letters. Cited by: §2, §2, §3.1, Table 1, Figure 4, Table 3, Table 4, Table 5, Table 6, §4, Table 9.
RotationNet: joint object categorization and pose estimation using multiviews from unsupervised viewpoints. In CVPR, Cited by: §2.