SESS: Self-Ensembling Semi-Supervised 3D Object Detection

12/26/2019 ∙ by Na Zhao, et al. ∙ National University of Singapore 8

The performance of existing point cloud-based 3D object detection methods heavily relies on large-scale high-quality 3D annotations. However, such annotations are often tedious and expensive to collect. Semi-supervised learning is a good alternative to mitigate the data annotation issue, but has remained largely unexplored in 3D object detection. Inspired by the recent success of self-ensembling technique in semi-supervised image classification task, we propose SESS, a self-ensembling semi-supervised 3D object detection framework. Specifically, we design a thorough perturbation scheme to enhance generalization of the network on unlabeled and new unseen data. Furthermore, we propose three consistency losses to enforce the consistency between two sets of predicted 3D object proposals, to facilitate the learning of structure and semantic invariances of objects. Extensive experiments conducted on SUN RGB-D and ScanNet datasets demonstrate the effectiveness of SESS in both inductive and transductive semi-supervised 3D object detection. Our SESS achieves competitive performance compared to the state-of-the-art fully-supervised method by using only 50



There are no comments yet.


page 8

page 11

page 12

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Figure 1: Semi-supervised 3D object detection pipeline. Our SESS can predict 3D bounding boxes and semantic labels of objects for an unlabeled scene after training with a mixture of labeled data and unlabeled data.

Point cloud-based 3D object detection is the task to estimate the object category and oriented 3D bounding box for all objects in the scene. This task has always been a great interest to computer vision and robotics communities due to its potential real-world applications in many areas such as autonomous driving, domestic robotics, augmented/virtual reality, etc. In recent years, many deep learning-based approaches for point cloud-based 3D object detection 

[1, 7, 9, 11, 12, 16, 17, 18, 24, 27, 29] have emerged and achieved high performances on various benchmark datasets [2, 3, 19]. Despite the impressive performances, most of the existing deep learning-based approaches for 3D object detection on point clouds are strongly supervised and require the availability of a large amount of well-annotated 3D data that is often time-consuming and expensive to collect.

Semi-supervised learning is a promising alternative to strongly supervised learning for point cloud-based 3D object detection. This is because semi-supervised learning requires only few labeled data, and this largely alleviates the difficulty to collect enormous amount of labeled data. Furthermore, the available few strong labels can still provide the necessary supervision to guide the deep network into learning the correct information for 3D object detection. Information from the few strong labels can also be propagated to the unlabeled data to improve learning. A complete removal of strong labels in the training data would be extremely challenging for the deep network to learn anything meaningful. This is due to the inherent difficulty for a deep network to precisely detect 3D bounding boxes of objects in the point cloud, where points are sparsely distributed, and/or the scene is partially visible and incomplete due to occlusions and 3D amodal perception. To the best of our knowledge, [21] is currently the only existing work to learn a deep network for point cloud-based 3D object detection without strong supervision. More specifically, they propose a cross-category semi-supervised learning where 3D ground truth labels are needed for a set of object categories, i.e. the strong object classes, and 2D ground truth labels are required for all object classes. Although promising results are achieved in [21], the approach requires RGB-D input and does not work on pure 3D point clouds. Moreover, it still requires a large amount of 3D labels on the strong object classes.

In view of the potential of semi-supervised learning and limitations in [21], we address the in-category semi-supervised 3D object detection problem with 3D point cloud as the only input in this paper. In contrast to cross-category semi-supervision, in-category semi-supervision means that the training data contains few strongly labeled point clouds and a large number of unlabeled point clouds. Furthermore, the strongly labeled point clouds are assumed to contain all object classes of interests, albeit few examples per object class. To this end, we propose SESS: a self-ensembling semi-supervised 3D object detection framework for point clouds. More specifically, our SESS achieves semi-supervision with a Mean Teacher paradigm [22] that contains a teacher and student 3D object detection network. The teacher guides the predictions of the student to be consistent with its predictions under random perturbations, where these predictions are sets of 3D object proposals. In other words, we want the 3D object proposals from both teacher and student networks to be aligned at the end of the training stage. We propose three consistency losses based on the center, class and size of the 3D object proposals to encourage alignment of the 3D object proposals from the teacher and student networks. Our three consistency losses encode both geometry and semantic information to guide the network towards learning precise coordinates of the 3D bounding boxes and accurate object categories. We conduct experiments of our SESS framework on two benchmark datasets. Promising results over baseline and strongly supervised approaches validate our semi-supervised learning approach for the challenging task of point cloud-based 3D object detection.

The main contributions of this work are as follows.

  • We propose SESS: a novel self-ensembling semi-supervised point cloud-based 3D object detection framework. Our semi-supervised learning framework only requires a small amount of strong labels which alleviates the hunger for a large amount of 3D annotations.

  • Our SESS follows the Mean Teacher paradigm, where we design a perturbation scheme and three consistency losses. Our three consistency losses encode both geometry and semantic information to guide the network towards learning precise coordinates of the 3D bounding boxes and accurate object categories.

  • We achieve competitive performance as compared to the state-of-the-art fully supervised methods with only 50% labeled data, and we experimentally verify the effectiveness of our SESS in both inductive and transductive semi-supervised learning settings.

2 Related work

2.1 3D Object Detection

A number of approaches have been proposed for 3D object detection task, which can be briefly summarized into three different types based on their input data formats: 2D projection [8, 9, 18, 26], voxel grid [1, 7, 15, 20, 16, 25, 29], and point cloud [5, 11, 12, 17, 24, 27, 28]. The 2d projection and voxel grid based methods are proposed to circumvent the difficulty in processing irregular point clouds by either projecting 3D data into 2D representations (e.g. front-view, or bird’s eye view) or voxelizing it into regular grids. To efficiently localize 3D objects in the point cloud of a 3D space, [5, 12, 24] leverage on mature 2D object detectors to trim a 3D bounding frustum for each detected object, for the sake of 3D search space reduction, while [11, 17, 27, 28] explore the sparsity of 3D data and generate 3D proposals around seed points that are determined by different manners (e.g. segmenting [17] or voting [11]).

Despite the significant improvement achieved by the existing detection models, a huge number of high-quality 3D ground truths is required for training. This limits their applicability in practice, where the ground truths are expensive to acquire. In order to leverage the abundant unlabeled data that are easier to access, semi-supervised 3D object detection is a promising direction to exploit. To our best knowledge, there is no existing semi-supervised point cloud-based 3D object detection approach that only involves a small set of labeled data. The most closely related work is proposed recently by Tang and Lee in [21]. They propose a cross-category semi-supervised 3D object detection method. However, it requires both the 2D box labels and some of the 3D box labels. We consider this setting as “mix supervised” to differentiate with our semi-supervised setting where few labeled samples are used with plentiful of unlabeled samples. Furthermore, [21] follows the two-step pipeline in [12] to restrict the object localizing space: the first step is 2D object detection on RGB images and the second step is 3D object detection in the frustum point clouds yielded from the 2D detections. This two-step pipeline means that the performance is tightly dependent on the performance of the 2D detector. In this work, we directly process the raw point cloud in one step to remove the dependency on 2D modality.

Figure 2: The architecture of our SESS.

2.2 Semi-Supervised Learning

Semi-Supervised Learning (SSL) attracts growing interest in a wide range of research areas (e.g. image classification and semantic segmentation) by virtue of its aim to learn from both labeled and unlabeled data simultaneously. Many approaches have been proposed to solve semi-supervised learning. Due to the space limitation, we only review self-ensembling based approaches, which is the most promising architecture in SSL recently.

The idea behind self-ensembling approaches is to improve the generalization of a model by encouraging consensus among ensemble predictions of unknown samples under small perturbations of inputs or network parameters. For instance, model [13], a variation of ladder network [23], consists of two identical parallel branches that respectively take one image and the corrupted version of the image as input. The consistency loss is computed based on the difference between the (pre-activated) predictions from the clean branch and the (pre-activated) corrupted branches processed by an explicit denoising layer. In contrast to model, model [6] discards the explicit denoising layer and inputs the same image with different corruption conditions into a single branch. Virtual Adversarial Training [10] shares similar idea with the model but it uses adversarial perturbation instead of independent noise. Temporal model [6], an extension of

model, forces the consistency between the recent network output and the aggregation of network predictions over multiple previous training epochs rather than predictions from auxiliary corrupted input. However, this model becomes cumbersome when applied to large dataset because it needs to maintain a per-sample moving average of the historical network predictions. Mean Teacher 

[22] tackles the weakness of temporal model by replacing network prediction average with network parameter average. It contains two network branches - teacher and student

with the same architecture. The parameters of the teacher are the exponential moving average of the student network parameters that are updated by stochastic gradient descent. The student network is trained to yield consistent predictions with the teacher network. We choose the Mean Teacher architecture as the basis of our framework, and adapt it to the 3D object detection task.

3 Our Method

3.1 Problem Definition

Given any point cloud of a scene as input, our objective is to classify and localize amodal 3D bounding boxes for objects in the 3D scene. In the semi-supervised setting, we have access to

training samples, including labeled point clouds and unlabeled point clouds . Here denotes the point cloud of a 3D scene, containing points with coordinates; and denotes the ground truth annotations for all the interested objects in the 3D point cloud . Each object is represented by a semantic class (1-of- predefined classes) and an amodal 3D bounding box parameterized by its center , size , and orientation along the upright-axis.

3.2 SESS Architecture

The illustration of our SESS architecture is shown in Figure 2. We use the Mean Teacher paradigm [22] in our semi-supervised 3D object detection task, where the student and the teacher network are 3D object detectors. The student and teacher networks take the perturbed point clouds as input and output the 3D object proposals, which represent the estimated classes and 3D bounding boxes of all the objects of interest in the point cloud. We adopt the state-of-the-art VoteNet111It is worth highlighting that rather than designing a specific detector model, our proposed framework is model-agnostic and any existing point cloud-based 3D object detection network can be used. [11] as our backbone for student and teacher network. More specifically, SESS takes a training batch with a mixture of labeled and unlabeled point clouds: , where and denote the labeled and unlabeled samples in a batch, respectively. We randomly sample points from each training point cloud, i.e. or , twice to get two sets of points. The first set of points is perturbed into by a stochastic transformation and then passed to the student network, while the second set of points is directly passed to the teacher network. The output proposals from the teacher network are further transformed to by the applied on previously. For each proposal in , we find its closest alignment from the output proposals of the student network based on the Euclidean distance. Subsequently, the error between each aligned proposal pair is computed from three consistency losses. Concurrently, the set of ground truths is also transformed by the same applied on , and the transformed is compared with the labeled output of student network using a supervised loss. Finally, the weights of the student network is updated at training step , and then the updated weights from the student network are used in an exponential moving average (EMA) to update the weights of the teacher network :


where is a smoothing hyper-parameter. For supervised loss, we take the same multi-task loss as in [11]. We will introduce our perturbation scheme and consistency losses for adapting the Mean Teacher paradigm into the 3D object detection task in the following subsections.

3.3 Perturbation Scheme

As mentioned in [6, 22], input perturbation or data augmentation play an essential role in the success of self-ensembling approaches. The perturbation schemes of the Mean Teacher on image-based tasks, e.g. image recognition, include random translations and horizontal flips of the input images, adding Gaussian noises on the input layer and applying dropouts within the network. However, none of the image-based perturbation schemes can be used directly for our point cloud-based 3D object detection task. Consequently, we propose a perturbation scheme suitable for point cloud-based 3D object detection in this paper.

Random Sub-sampling

We apply random sub-sampling on the input point cloud to both the student and teacher networks as part of our perturbation scheme. The local geometrical relationship of the points in two random sub-samples of a given point cloud might differ significantly, but the global geometry, i.e. the 3D bounding box locations of the objects, in the sub-sampled point clouds should remain the same. As a result, our model is trained to exploit the underlying geometry in the global context by forcing the consistency between the stochastic outputs from student and teacher networks.

Stochastic Transform

We apply stochastic transformations that include flipping, rotation and scaling on the randomly sub-sampled point cloud in the student network to prevent the network from memorizing unintended properties of the training point clouds, e.g. the absolute position of each point. More specifically, we formulate the transformation operations as a set of stochastic variables . Here represents a random flip along the -axis, and its binary value is determined by:



is a random variable uniformly sampled from

. represents a random flip along the -axis and is generated the same way as . denotes the rotation around the upright-axis, paramterized by a rotation angle sampled uniformly from :


and which is uniformly sampled from represents the scaling of the points. Finally, a is randomly sampled and applied on each input training point cloud to the student network as: . Note that the ground truth labels of the labeled input point cloud are also transformed by the corresponding before computing the supervised loss. Additionally, the output proposals from the teacher network are also transformed by to enable the alignment between outputs of the two networks. Specifically, is perturbed on each center point of the proposals in .

3.4 Consistency Loss

Unlike the direct computation of consistency between class predictions of perturbed images in the context of recognition task [22], the consistency between two sets of 3D object proposals cannot be computed directly. We circumvent this problem by pairing up the predicted proposals from the student and teacher networks with an alignment scheme, followed by applying three consistency losses on the paired proposals. The objective of the three consistency losses is to enforce the consensus of object locations, semantic categories and sizes. Let denotes the centers of the predicted 3D bounding boxes from the student network, and denotes those from the teacher network after transformation. For each , we do the alignment by searching for the its nearest neighbor in based on the minimum Euclidean distance between the centers of the bounding boxes. We further use to denote the elements from that are aligned with each element in . More formally,


Similarly, we can also collect with elements from that are aligned with each element in . It is important to note that the alignments and are not bijective, hence . Intuitively, the alignment errors, i.e., the total distance between all corresponding elements in and , should be zero when the bounding boxes predicted by the teacher and student networks are consistent. To this end, we propose the center-aware consistency loss:


to minimize the alignment errors between the teacher and student network.

In addition to center consistency, we also consider two other properties of the 3D proposals: semantic class and size to enforce the consistency between two sets of proposals. Following the principle in classic self-ensembling learning, where the teacher network produces targets for the student to learn, we only consider a uni-directional alignment, i.e., to in computing the class- and size-aware consistency losses. More specifically, let and

denote the class probabilities of the predicted objects from the student and the teacher networks, respectively. The aligned

is easily obtained based on minimum center distance. We define the class-aware consistency loss as the Kullback-Leibler (KL) divergence between and :


In similar vein, the sizes of the bounding boxes predicted by the student and the teacher networks are denoted as and , respectively. We use the same minimum center distance to get the aligned . The size-aware consistency loss can now be computed as the Mean Square Error (MSE) between and :


Finally, the total consistency loss is a weighted sum of all the three consistency terms described earlier:


where , , and are the weights to control the importance of the corresponding consistency term.

Dataset Model 10% 20% 30% 40% 50% 70%
SUNRGB-D VoteNet [11] 32.3 41.8 47.6 50.4 52.1 55.7
SESS 40.7 48.4 52.3 54.1 56.1 58.6
Improv.(%) 26.01 15.79 9.87 7.34 7.68 5.21
ScanNetV2 VoteNet [11] 32.3 42.4 45.0 49.7 52.6 54.8
SESS 41.2 48.6 52.0 55.4 58.7 59.3
Improv.(%) 27.55 14.62 15.56 11.47 11.60 8.21
Table 1: Comparison with VoteNet on SUN RGB-D val set and ScanNetV2 val set with varying ratios of labeled data.
Dataset DSS [20] COG [15] 2D-driven [5] F-PointNet [12] GSPN [5] 3D-SIS [4] VoteNet [11] SESS
SUN RGB-D 42.1 47.6 45.1 54.0 57.7 61.1
ScanNetV2 15.2 19.8 30.6 40.2 58.6 62.1
Table 2: Comparison with fully-supervised methods on SUN RGB-D and ScanNetV2 val sets with 100% training labels.

4 Experiments

4.1 Datasets

We evaluate our SESS on ScanNet and SUN RGB-D for semi-supervised 3D object detection.

Sun Rgb-D

[19] is an indoor benchmark dataset for 3D object detection. It contains 10,335 single-view RGB-D images, which are officially split into 5,285 training samples and 5,050 validation samples, where 3D bounding box annotations for hundreds of object classes are available. Followed the standard evaluation protocol [5, 11, 12, 15, 21], we perform evaluation on the 10 most common categories for comparing with the previous methods. By using the provided camera parameters, the depth images are converted to point clouds as our inputs.


[2] contains 1,513 reconstructed meshes from 707 unique indoor scenes, which are officially split into 1,201 training samples and 312 validation samples. Each scene is well annotated with semantic segmentation masks. Since there is no existing amodal or orientated 3D bounding box in ScanNetV2 dataset, we derive the axis-aligned bounding boxes from the point-level labeling as in [4, 11]. We adopt the same 18 object classes out of the 21 semantic classes as proposed in [4, 11]. The input point clouds are generated by sampling vertices from meshes.

For both datasets, we evaluate on different proportions of labeled data randomly sampled from all the training data. We ensure that all classes are present, or otherwise we re-sample till all the classes have been covered in the set. We keep the remaining data as unlabeled data for training in our semi-supervised framework.

4.2 Implementation Details

Framework Details

We feed training batches of point clouds with 5,000 points to our framework. To construct a batch, we randomly sample labeled samples from and unlabeled samples from . In the experiments, is set to 2 and to 8. During the perturbation step, the number of randomly sub-sampled points is 4,000; the is set to 30 on SUN RGB-D and 5 on ScanNetV2; the random scale range is bounded by and

. The weights in the consistency loss function are set as

, , . As suggested in [22], we ramp up the coefficient of consistency cost from 0 to its maximum value of 10 during the first 30 epochs, using a sigmoid-shaped function , where increases linearly from 0 to 1 during the ramp-up period. In terms of EMA decay , we set during the ramp-up period, and for the rest of the training, following [22].


We adopt the exact network structure of VoteNet [11] as the structure of our student and teacher networks. We pre-train VoteNet with all the labeled samples. We then initialize the student and teacher networks with the pre-trained VoteNet, and train the student network on both the labeled and unlabeled data by minimizing the supervised loss as well as consistency loss. The student network is trained by an ADAM optimizer with an initial learning rate of 0.001. The learning rate is decayed by 0.1 at the 80 and 120 epoch, respectively. In general, the model converges at around 120 epochs. The number of generated 3D proposals is 128.


During inference, we forward the point cloud of an entire scene to the student network222Note that the teacher network can also be used to detect objects. In the experiments, we find the student and the teacher network give similar performance. to generate the proposals. Following the same protocol as described in [11]

we post-process those predicted proposals by a 3D NMS module with an 3D Intersection-over-Union(IoU) threshold of 0.25. For the evaluation metric, we adopt the widely-used mean average precision (mAP). By default, mAP@0.25 (3D IoU threshold 0.25) is reported in the following experiments.

4.3 Comparison with Fully-supervised Methods


To our best knowledge, there are no other 3D object detection approaches sharing the same semi-supervised setting as us. Consequently, we compare our semi-supervised SESS to the state-of-the-art fully-supervised 3D object detection method, VoteNet [11], which can be considered as an upper bound of our semi-supervised method since we share the same network backbone. By drawing varying ratios of labeled data out of the entire training set, we train VoteNet with the available labeled data in a fully-supervised way, and SESS with the available labeled data as well as the remaining unlabeled data in a semi-supervised way. Additionally, we also evaluate our semi-supervised SESS based on a wide-ranging comparison with existing fully-supervised 3D object detection methods. Deep Sliding Shapes (DSS) [20] and Cloud of gradients (COG) [15] are both sliding window based methods, where DSS is a 3D extension of Faster R-CNN pipeline [14], and COG designs a 3D HoG-like feature to model the 3D geometry and appearance. 2D-driven [5] and F-PointNet [12] both depend on 2D detection in associated RGB images to reduce the search space of 3D localization. GSPN [5] and 3D-SIS [4] both target on 3D instance segmentation task but incorporate 3D object detection as an auxiliary task. Note that all the aforementioned methods use both point clouds and RGB images as inputs except VoteNet and our SESS that only require point clouds.


Table 1 lists the comparison results against VoteNet under different ratios of labeled data on the two datasets, respectively. SESS significantly outperforms VoteNet under each ratio setting. The improvements verify the effectiveness of our proposed semi-supervised framework. On both datasets, as the proportion of labeled samples decreases, the performance gap between our SESS and the fully-supervised VoteNet becomes larger. Given 10% labeled data, our SESS gains 26.01% and 27.55% improvement over VoteNet on SUN RGB-D and ScanNetV2 respectively. This indicates that our framework is able to learn knowledge from unlabeled data, and our benefit is larger when the number of labeled data is scarce. Table 2 shows the performances against recent state-of-the-art methods on the two dataset, by using all the training samples. It is interesting to see in Table 1 and 2 that by using only 50% labeled samples, our SESS achieves better than (i.e. on ScanNet) or close to (i.e. on SUN RGB-D) the upper-bound performance obtained by the fully-supervised VoteNet with 100% labeled samples. Furthermore, it is worth pointing out that when given all the labeled training data, our SESS is able to further improve the performance beyond the upper-bound performance of VoteNet. This indicates that our consistency losses are complementary to supervised loss, and our framework might be integrated with any supervised 3D object detector to enhance the detection accuracy.

4.4 Transductive Semi-supervised Learning

Generally, semi-supervised learning may refer to either inductive learning or transductive learning. In inductive learning, the goal is to generalize correct labels for new unseen data. In transductive learning, the goal is infer the labels restricted to the given unlabeled data. Our previous experiments conducted on unseen validation set can be considered as inductive learning. In Table 3 we show that our SESS is also effective in transductive learning on both datasets. The SESS consistently outperforms the fully-supervised VoteNet under different numbers of labeled samples. This demonstrates that our proposed SESS is a general framework that is not specific to inductive or transductive solution.

Dataset Model 10% 20% 30% 40% 50% 70%
SUNRGB-D VoteNet 33.5 39.8 47.5 49.7 51.6 55.2
SESS 40.7 46.1 53.3 54.3 55.1 59.0
ScanNetV2 VoteNet 37.8 47.7 52.1 56.9 61.2 64.3
SESS 46.7 55.4 59.5 63.9 67.5 69.6
Table 3: Transductive leaning on SUN RGB-D and ScanNetV2 unlabeled training sets, compared with fully-supervised VoteNet. The percentage indicates the ratio of labeled data for training.

4.5 Ablation Studies

In this section, we explore the effects of perturbations and consistency losses. The training of ablation experiments is conducted on SUN RGB-D with 10% labeled data and ScanNetV2 with of 30% labeled data. The evaluation is on the corresponding validation set.


We study the effects of each perturbation by removing them individually from the framework, and report the performance after the removal. We also evaluate an extreme case that removes the perturbation scheme altogether. Figure 3 illustrates the resultant performances. Obviously, the performance drops greatly on both datasets when the entire perturbation scheme is removed. The effect may vary between the datasets for each individual perturbation. For example, the rotating perturbation contributes less to performance on ScanNet than SUN RGBD, as the bounding boxes of objects in ScanNet are axis-aligned. The scaling perturbation gives less improvement on SUN RGB-D than that on ScanNet. We suspect that this is because the partial scenes in SUN RGB-D are all with similar scales and thus are less sensitive to scaling perturbation. In contrast, the scales of the scenes in ScanNet are quite diverse.

Consistency Losses

We further investigate the effects of our three consistency losses by experimenting with different combinations. The comparison is reported in Table 4. From the perspective of individual consistency loss, the center-aware and class-aware consistency losses contribute more than the size-aware consistency loss. However, the combination of center-aware or class-aware with size-aware consistency loss helps to improve the performance to some extent. Finally, the integration of the three consistency losses gives us the best performance on both datasets. It indicates that the requirement of representing the predicted bounding boxes with correct geometries (i.e. center, size) as well as semantics (i.e. class) regularizes the model towards a better performance.

(b) ScanNetV2
Figure 3: Effects of different perturbations.
center class size SUN RGB-D ScanNetV2
38.2 50.0
39.2 50.2
38.1 49.2
40.3 50.7
38.9 50.5
40.0 51.5
40.7 52.0
Table 4: Ablation study on consistency losses.

4.6 Qualitative Results and Analysis

Figure 4: Qualitative comparison between the fully-supervised VoteNet and the proposed SESS on SUN RGB-D val set.
Figure 5: Qualitative comparison between the fully-supervised VoteNet and the proposed SESS on ScanNetV2 val set.

Figure 6 and Figure 7 show the visualizations of the predictions by VoteNet and SESS with 30% labeled training data and 100% labeled training data on the ScanNet and SUN RGB-D scenes, respectively. As can be seen, the partial scene obtained by single-view scanning in SUN RGB-D is very challenging, where some objects are partly visible with amodal ground-truth bounding box (such as the “sofa” in Figure 6). Surprisingly, both our method and the strongly supervised VoteNet successfully detected the target objects in such a challenging scene. Similar to the strongly supervised VoteNet, our SESS is able to detect more objects than were provided by the ground-truth, such as the partial table in front of the sofa and the heavily occluded chairs behind the sofa. Our SESS gives more accurate predictions than VoteNet in terms of unannotated objects with 30% labeled data. We attribute this to the exploitation of unlabeled data in our proposal approach. Our SESS detects more unannotated objects when 100% labeled data is used in training, and the predicted 3D bounding boxes are consistent with human perception.

In contrast to the partial scenes in SUN RGB-D, the scenes in ScanNet are more complete and include larger areas with cluttered objects. An example is shown in Figure 7, this scene contains 7 tables and 27 chairs. Our SESS correctly recognizes the 7 tables and 26 chairs with 30% labeled data, while the strongly supervised VoteNet only detects 6 tables and 24 chairs correctly. We argue that the proposed consistency losses, which guide the model with encoded geometric and semantic information, contribute to the better localization of the 3D bounding boxes. All 34 objects are completely detected with precise bounding boxes when our model is trained with 100% labeled data.

5 Conclusion

In this paper, we propose SESS: a novel self-ensembling semi-supervised 3D object detection method to learn point cloud-based 3D object detection from both labeled and unlabeled data. To this end, we adapt Mean Teacher paradigm to 3D object detection by designing a perturbation scheme specific to point-based data and three consistency losses. The experimental results on two datasets validated the effectiveness and advantage of our proposed method. We experimentally showed that our method is a general framework that can be applied in both inductive and transductive semi-supervised 3D object detection.


  • [1] Y. Chen, S. Liu, X. Shen, and J. Jia (2019) Fast point r-cnn. In ICCV, pp. 9775–9784. Cited by: §1, §2.1.
  • [2] A. Dai, A. X. Chang, M. Savva, M. Halber, T. Funkhouser, and M. Nießner (2017) Scannet: richly-annotated 3d reconstructions of indoor scenes. In CVPR, pp. 5828–5839. Cited by: §1, §4.1.
  • [3] A. Geiger, P. Lenz, and R. Urtasun (2012) Are we ready for autonomous driving? the kitti vision benchmark suite. In CVPR, Cited by: §1.
  • [4] J. Hou, A. Dai, and M. Nießner (2019) 3d-sis: 3d semantic instance segmentation of rgb-d scans. In CVPR, pp. 4421–4430. Cited by: Table 2, §4.1, §4.3.
  • [5] J. Lahoud and B. Ghanem (2017) 2d-driven 3d object detection in rgb-d images. In ICCV, pp. 4622–4630. Cited by: §2.1, Table 2, §4.1, §4.3.
  • [6] S. Laine and T. Aila (2017) Temporal ensembling for semi-supervised learning. In ICLR, Cited by: §2.2, §3.3.
  • [7] A. H. Lang, S. Vora, H. Caesar, L. Zhou, J. Yang, and O. Beijbom (2019) PointPillars: fast encoders for object detection from point clouds. In CVPR, pp. 12697–12705. Cited by: §1, §2.1.
  • [8] B. Li, W. Ouyang, L. Sheng, X. Zeng, and X. Wang (2019) GS3D: an efficient 3d object detection framework for autonomous driving. In CVPR, pp. 1019–1028. Cited by: §2.1.
  • [9] M. Liang, B. Yang, Y. Chen, R. Hu, and R. Urtasun (2019) Multi-task multi-sensor fusion for 3d object detection. In CVPR, pp. 7345–7353. Cited by: §1, §2.1.
  • [10] T. Miyato, S. Maeda, M. Koyama, and S. Ishii (2018) Virtual adversarial training: a regularization method for supervised and semi-supervised learning. IEEE transactions on pattern analysis and machine intelligence 41 (8), pp. 1979–1993. Cited by: §2.2.
  • [11] C. R. Qi, O. Litany, K. He, and L. J. Guibas (2019) Deep hough voting for 3d object detection in point clouds. In ICCV, Cited by: §1, §2.1, §3.2, Table 1, Table 2, §4.1, §4.1, §4.2, §4.2, §4.3.
  • [12] C. R. Qi, W. Liu, C. Wu, H. Su, and L. J. Guibas (2018) Frustum pointnets for 3d object detection from rgb-d data. In CVPR, pp. 918–927. Cited by: §1, §2.1, §2.1, Table 2, §4.1, §4.3.
  • [13] A. Rasmus, M. Berglund, M. Honkala, H. Valpola, and T. Raiko (2015) Semi-supervised learning with ladder networks. In Advances in neural information processing systems, pp. 3546–3554. Cited by: §2.2.
  • [14] S. Ren, K. He, R. Girshick, and J. Sun (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pp. 91–99. Cited by: §4.3.
  • [15] Z. Ren and E. B. Sudderth (2016) Three-dimensional object detection and layout prediction using clouds of oriented gradients. In CVPR, pp. 1525–1533. Cited by: §2.1, Table 2, §4.1, §4.3.
  • [16] Z. Ren and E. B. Sudderth (2018) 3d object detection with latent support surfaces. In CVPR, pp. 937–946. Cited by: §1, §2.1.
  • [17] S. Shi, X. Wang, and H. Li (2019) Pointrcnn: 3d object proposal generation and detection from point cloud. In CVPR, pp. 770–779. Cited by: §1, §2.1.
  • [18] M. Simon, S. Milz, K. Amende, and H. Gross (2018) Complex-yolo: an euler-region-proposal for real-time 3d object detection on point clouds. In ECCV, pp. 197–209. Cited by: §1, §2.1.
  • [19] S. Song, S. P. Lichtenberg, and J. Xiao (2015)

    Sun rgb-d: a rgb-d scene understanding benchmark suite

    In CVPR, pp. 567–576. Cited by: §1, §4.1.
  • [20] S. Song and J. Xiao (2016) Deep sliding shapes for amodal 3d object detection in rgb-d images. In CVPR, pp. 808–816. Cited by: §2.1, Table 2, §4.3.
  • [21] Y. S. Tang and G. H. Lee (2019) Transferable semi-supervised 3d object detection from rgb-d data. In ICCV, Cited by: §1, §1, §2.1, §4.1.
  • [22] A. Tarvainen and H. Valpola (2017) Mean teachers are better role models: weight-averaged consistency targets improve semi-supervised deep learning results. In Advances in neural information processing systems, pp. 1195–1204. Cited by: §1, §2.2, §3.2, §3.3, §3.4, §4.2.
  • [23] H. Valpola (2015)

    From neural pca to deep unsupervised learning


    Advances in Independent Component Analysis and Learning Machines

    pp. 143–171. Cited by: §2.2.
  • [24] Z. Wang and K. Jia (2019) Frustum convnet: sliding frustums to aggregate local point-wise features for amodal 3d object detection. In IROS, Cited by: §1, §2.1.
  • [25] Y. Yan, Y. Mao, and B. Li (2018) Second: sparsely embedded convolutional detection. Sensors 18 (10), pp. 3337. Cited by: §2.1.
  • [26] B. Yang, W. Luo, and R. Urtasun (2018) Pixor: real-time 3d object detection from point clouds. In CVPR, pp. 7652–7660. Cited by: §2.1.
  • [27] Z. Yang, Y. Sun, S. Liu, X. Shen, and J. Jia (2019) STD: sparse-to-dense 3d object detector for point cloud. In ICCV, Cited by: §1, §2.1.
  • [28] L. Yi, W. Zhao, H. Wang, M. Sung, and L. J. Guibas (2019) Gspn: generative shape proposal network for 3d instance segmentation in point cloud. In CVPR, pp. 3947–3956. Cited by: §2.1.
  • [29] Y. Zhou and O. Tuzel (2018) Voxelnet: end-to-end learning for point cloud based 3d object detection. In CVPR, pp. 4490–4499. Cited by: §1, §2.1.

Appendix A Additional Evaluation Metric

We additionally evaluate mean average precision with an IoU threshold of 0.5 on SUN RGB-D and ScanNetV2 for both inductive (see Table 5) and transductive (see Table 6) semi-supervised 3D object detection. Consistent with the evaluation at an IoU threshold of 0.25, our SESS significantly outperforms the fully supervised VoteNet under different ratios of labeled data for both inductive and transductive learning.

Appendix B Per-class Evaluation

We respectively report per-class average precision on 10 classes of SUN RGB-D and 18 classes of ScanNetV2 in Table 7 and 8, using all the training samples. With the assistance of the proposed pertubation scheme and consistency losses, our SESS is superior than the fully supervised VoteNet on each class of SUN RGB-D and 14 classes of ScanNetV2.

Appendix C More Qualitative Results and Discussions

Figure 6 and 7 demonstrate additional qualitative results on the SUN RGB-D and ScanNetV2 val datasets, respectively. As can be seen from the four examples in Figure 6, the heavy occlusion (e.g. the chairs at the back rows in the classroom), partial visibility (e.g. the leftmost cabinet in the bedroom), and extreme sparsity (e.g. the rightmost chair in the study space) make the detection on SUN RGB-D very difficult. Some of them are even hard for human to recognize without the reference of the associated RGB images, such as the leftmost chair in the second row in the classroom and the rightmost chair in the study space. Both VoteNet and our SESS fail to detect these extremely challenging objects that come with no or few representative points. However, it is interesting to see that our SESS successfully detect most of the objects in these challenging scenarios, including those unannotated objects such as the chairs in the back of the classroom, and the table in front of the bed in the bedroom.

In Figure 7, we also show four more examples covering various scenarios on ScanNetV2 dataset. Objects with strong geometric cues (e.g. table, chair, bed, desk etc.) are easy to detect since both strongly supervised VoteNet and our SESS rely on only the geometric data (i.e. XYZ coordinates). In contrast, objects without explicit geometric features (e.g. door, picture, window) are difficult to recognize. Despite the challenge, our SESS is able to detect most of the difficult objects, such as bookshelves in the library and doors in the lounge. We argue that the proposed consistency losses, which encode not only geometric but also semantic information, guide the model to achieve better localization of the 3D bounding boxes.

Dataset Model 10% 20% 30% 40% 50% 70% 100%
SUNRGB-D VoteNet 10.6 14.7 23.3 25.6 27.2 30.0 31.1
SESS 14.4 20.6 28.5 29.0 30.6 33.4 37.3
ScanNetV2 VoteNet 11.9 21.2 22.5 27.7 28.9 30.9 33.5
SESS 18.6 26.9 27.4 31.5 34.2 35.5 38.8
Table 5: Inductive leaning on SUN RGB-D and ScanNetV2 val sets compared with the fully supervised VoteNet, evaluated by mAP@0.5 IoU. The percentage indicates the ratio of labeled data for training.
Dataset Model 10% 20% 30% 40% 50% 70%
SUNRGB-D VoteNet 10.3 15.3 23.4 25.5 25.0 29.9
SESS 15.8 20.1 27.4 27.2 29.2 36.7
ScanNetV2 VoteNet 13.8 25.3 28.6 32.7 35.2 38.3
SESS 23.2 31.3 34.3 37.6 41.6 42.6
Table 6: Transductive leaning on SUN RGB-D and ScanNetV2 unlabeled training sets compared with the fully supervised VoteNet, evaluated by mAP@0.5 IoU. The percentage indicates the ratio of labeled data for training.
Method bathtub bed bookshelf chair desk dresser nightstand sofa table toilet mAP
DSS 44.2 78.8 11.9 61.2 20.5 6.4 15.4 53.5 50.3 78.9 42.1
COG 58.3 63.7 31.8 62.2 45.2 15.5 27.4 51.0 51.3 70.1 47.6
2D-driven 43.5 64.5 31.4 48.3 27.9 25.9 41.9 50.4 37.0 80.4 45.1
F-PointNet 43.3 81.1 33.3 64.2 24.7 32.0 58.1 61.1 51.1 90.9 54.0
VoteNet 74.4 83.0 28.8 75.3 22.0 29.8 62.2 64.0 47.3 90.1 57.7
SESS 76.9 84.8 35.4 75.8 29.3 31.3 66.9 66.4 51.8 92.3 61.1
Table 7: Per-class mAP@0.25 IoU on SUN RGB-D val set, with 100% training samples. The upper table lists the results obtained by five fully-supervised methods, and the lower table lists the results of our proposed semi-supervised method.
Method cabin. bed chair sofa table door wind. bkshf pic. cntr desk curt. fridg. showr. toilet sink bath ofurn. mAP
3DSIS 19.8 69.7 66.2 71.8 36.1 30.6 10.9 27.3 0.0 10.0 46.9 14.1 53.8 36.0 87.6 43.0 84.3 16.2 40.2
VoteNet 36.3 87.9 88.7 89.6 58.8 47.3 38.1 44.6 7.8 56.1 71.7 47.2 45.4 57.1 94.9 54.7 92.1 37.2 58.6
SESS 41.1 88.1 85.9 91.7 64.5 52.1 40.4 51.4 11.8 51.9 74.9 45.9 59.6 73.3 98.3 53.9 93.0 39.5 62.1
Table 8: Per-class mAP@0.25 IoU on ScanNetV2 val set, with 100% training samples. The upper table lists the results from two fully-supervised methods, and the lower table lists the results of our proposed semi-supervised method.
Figure 6: Additional Qualitative comparison between the fully-supervised VoteNet and the proposed SESS on SUN RGB-D val set, using 100% training samples. Four scene types are illustrated from the upper to bottom, they are classroom, bedroom, study space, and living room.
Figure 7: Additional Qualitative comparison between the fully-supervised VoteNet and the proposed SESS on ScanNetV2 val set, using 100% training samples. Four scene types are illustrated from the upper to bottom, they are library, kitchen, hotel, and lounge.