3DIoUMatch: Leveraging IoU Prediction for Semi-Supervised 3D Object Detection

12/08/2020 ∙ by He Wang, et al. ∙ 14

3D object detection is an important yet demanding task that heavily relies on difficult to obtain 3D annotations. To reduce the required amount of supervision, we propose 3DIoUMatch, a novel method for semi-supervised 3D object detection. We adopt VoteNet, a popular point cloud-based object detector, as our backbone and leverage a teacher-student mutual learning framework to propagate information from the labeled to the unlabeled train set in the form of pseudo-labels. However, due to the high task complexity, we observe that the pseudo-labels suffer from significant noise and are thus not directly usable. To that end, we introduce a confidence-based filtering mechanism. The key to our approach is a novel differentiable 3D IoU estimation module. This module is used for filtering poorly localized proposals as well as for IoU-guided bounding box deduplication. At inference time, this module is further utilized to improve localization through test-time optimization. Our method consistently improves state-of-the-art methods on both ScanNet and SUN-RGBD benchmarks by significant margins. For example, when training using only 10% labeled data on ScanNet, 3DIoUMatch achieves 7.7 absolute improvement on mAP@0.25 and 8.5 absolute improvement on mAP@0.5 upon the prior art.



There are no comments yet.


page 11

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

* indicates equal contributions.Project page: http://THU17cyz.github.io/3DIoUMatch

Object detection is a key task in 3D scene understanding. It provides a concise representation of raw sensor measurements in the form of semantically meaningful 3D bounding boxes. This low-dimensional representation can already serve numerous applications in AR/VR, as well as in robot navigation and manipulation. As a result, in recent years there has been a surge of interest in developing improved object detection pipelines and indeed current state-of-the-art methods show impressive performance. Yet, much of their success is attributed to the availability of large datasets of 3D scenes that are carefully annotated. While rapid advances in sensor technology facilitate the collection of 3D scenes at scale, annotating them remains the main bottleneck. This calls for detection methods that can leverage both labeled and unlabeled data at train time.

In this work, we aim to address this requirement by proposing a novel semi-supervised 3D object detection method which we dub 3DIoUMatch. As our backbone object detection network we adopt VoteNet [qi2019deep], a popular point-based object detector. To provide supervision to the unlabeled scenes, we leverage a teacher-student mutual learning framework [tarvainen2017mean] and use the bounding box predictions from the teacher network as pseudo-labels to supervise the student network on unlabeled data. However, unlike most pseudo-label techniques that were designed for classification, in the highly complex (joint regression and classification) task of object detection, we observe that the pseudo-labels suffer from significant noise, and using them directly is suboptimal.

Inspired by FixMatch [sohn2020fixmatch]

, the state-of-the-art semi-supervised learning (SSL) method for 2D image classification that proposed confidence-based filtering to improve pseudo-label quality, we adopt a pseudo-label filtering mechanism for 3D object detection by setting thresholds on predicted class probabilities and objectness scores, so as to filter out teacher proposals with potentially erroneous semantic labels or ones not belong to foreground. While effective, these two criteria alone are not sufficient to capture localization quality, and the pseudo-labels may still have large errors in the bounding box parameters. To that end, we further propose to leverage estimated IoU (intersection over union) as a localization quality measure for pseudo-label filtering. IoU estimation was first proposed in the context of 2D object detection as a localization confidence in the pioneering work IoU-Net 

[jiang2018acquisition], where estimated IoU was proven successful in replacement of class confidence for test-time Non-Maximal Suppression (NMS). To the best of our knowledge, leveraging IoU estimation for pseudo-label filtering is a novel idea for SSL on both 2D and 3D object detection. With our newly devised differentiable 3D IoU estimation module, we are able to filter out poorly localized pseudo-labels and leverage estimated IoU for both train-time and test-time NMS.

A key challenge when filtering based on IoU estimation is how to properly set the threshold. Unlike objectness and class confidence for which high threshold values (e.g. 0.9) work well, 3D IoU is more sensitive to small errors. Setting the threshold too high would reduce the number of pseudo-labels to very few, from which little could be learned. To balance between quality and coverage, we propose a two-stage filtering process: first, using a relatively low IoU threshold; then, an IoU-guided class-aware Lower-Half Suppression (LHS) that removes only half of the highly-overlapping boxes with low predicted IoU. Our proposed LHS thus naturally sets a threshold that is both dynamic and class-aware. Our experiments show that LHS outperforms IoU-guided NMS, which suppresses all but the top one during semi-supervised training. Beyond train-time filtering, we also leverage our differentiable 3D IoU module for test-time IoU-guided NMS and optimization-based bounding box refinement, which is not possible with previous non-differentiable 3D IoU modules [yang2019std].

Our method consistently improves upon the previous state-of-the-art method, SESS [zhao2020sess], on both ScanNet and SUN-RGBD benchmarks by significant margins. When using only 10% labeled data on ScanNet, 3DIoUMatch outperforms SESS by 7.7 absolute improvement on mAP@0.25 and by 8.5 absolute improvement on mAP@0.5. When using 5% labeled data on SUN-RGBD, 3DIoUMatch outperforms SESS by 4.8 absolute improvement on mAP@0.25 and by 8.0 absolute improvement on mAP@0.5.

Our main contributions can be summarized as follows:

  1. We propose a novel semi-supervised method for 3D object detection in point clouds based on pseudo-label propagation along with a carefully designed filtering mechanism.

  2. For the first time, we leverage predicted 3D IoU as a localization confidence score for pseudo-label filtering, and further propose IoU-guided Lower-Half Suppression for robust pseudo-label deduplication.

  3. We devise a 3D IoU module that enables our localization filtering, IoU-guided bounding box suppression, and IoU optimization for bounding box refinement.

  4. We achieve markedly improved performance over prior art on the two major indoor object detection benchmarks, ScanNet and SUN-RGBD.

2 Related Works

Figure 1: 3DIoUMatch pipeline at semi-supervised training stage. We adopt as our backbone an extended version of VoteNet with an additional 3D IoU estimation module. For SSL, we utilize a teacher-student mutual learning framework, composed of a learnable student taking strongly augmented input data and an EMA teacher taking weakly augmented input samples. On labeled data, the student network is supervisedly trained. On unlabeled data, the student network takes pseudo-labels from its EMA teacher. To improve the quality of pseudo-label, we adopt a confidence-based filtering mechanism that filters out predictions that fail to pass all thresholds on class probability, objectness, and 3D IoU. We further use IoU-guided Lower-Half Suppression to remove the duplicated predictions. Using the filtered pseudo-labels, we selectively supervise the student predictions that are around the bounding boxes in the pseudo-labels.

Semi-Supervised Learning (SSL)

Many of the recent SSL methods [berthelot2019mixmatch, xie2019unsupervised, berthelot2019remixmatch] leverage consistency regularization, first proposed in [sajjadi2016regularization, laine2016temporal], which enforces the model to predict consistently across label-preserving data augmentation of different intensity. Borrowing the concept from Mean Teacher [tarvainen2017mean], the model with frozen weight can be viewed as the teacher model, otherwise student model. Some methods [berthelot2019mixmatch], following Mean Teacher, make the teacher model as the EMA of the student model for further regularization. Pseudo labeling [lee2013pseudo] is another popular class of SSL method which can also be treated as a kind of consistency regularization, as one output of the unlabeled data is enforced to be consistent with the other (the pseudo-labels) by being supervised with the other. To improve the quality of pseudo-labels, FixMatch [sohn2020fixmatch], a state-of-the-art SSL work on image classification, has shown that the student network can improve significantly by setting a classification confidence threshold and filtering out low-confidence predictions from the teacher. With the filtered pseudo-labels, the student model only gets supervised on the unlabeled data whose pseudo-labels are kept. Another key factor to the success of these methods is strong data augmentation. It has been shown crucial to many SSL works [sajjadi2016regularization, laine2016temporal, xie2019unsupervised]. Recent works [berthelot2019remixmatch, sohn2020fixmatch] proposed to adopt even more powerful augmentation such as RandAugment [cubuk2020randaugment] and Cutout [devries2017improved].

Semi-Supervised Object Detection

Since the beginning of the deep learning era, tremendous progress has been made in 2D object detection, e.g region-based detectors 

[girshick2014rich, girshick2015fast, ren2015faster] and single-stage detectors [liu2016ssd, redmon2016you, tian2019fcos]. Similarly in 3D object detection, a number of deep learning methods have been proposed for different 3D data modalities, e.g. RGBD-based detectors [qi2018frustum, qi2020imvotenet], point-based detectors [yi2019gspn, shi2019pointrcnn, lang2019pointpillars, qi2019deep], voxel-based detectors [zhou2018voxelnet], point-voxel-based detectors [shi2020pv], etc.

Despite the great progress in both 2D and 3D object detection, most works focused on a fully-supervised setting. A few works [hoffman2014lsda, gao2019note] have proposed to leverage unlabeled data or weakly-annotated data for 2D object detection. Under a standard SSL setting as we follow, CSD [jeong2019consistency] proposed a consistency regularization method to enforce the consistency between predictions from an image and its flipped version. STAC [sohn2020simple] adopts a two-stage scheme for training Faster R-CNN [ren2015faster]: in the first stage it pre-trains a detector with labeled data only and then predicts the pseudo labels for the unlabeled data; in the second stage, STAC leverages asymmetric data augmentation and the pseudo-label filtering mechanism to remove object proposals with low confidence. Note that the pseudo-labels are only generated once at the end of the first stage.

The only prior work on semi-supervised point-based 3D object detection, is SESS [zhao2020sess]. SESS is built upon VoteNet [qi2019deep] and adopts a two-stage training scheme. It leverages a mutual learning framework composed of an EMA teacher and a student, uses asymmetric data augmentation, and enforces three kinds of consistency losses between the teacher and student outputs. Although SESS brings noticeable improvements upon a vanilla VoteNet when using only a small portion of labeled data, we find their consistency regularization suboptimal, as it is uniformly enforced on all the student and teacher predictions. In this work, we instead propose to apply confidence-based filtering to improve the quality of pseudo-labels from the teacher predictions and we are the first (in both 2D and 3D object detection) to introduce IoU estimation for localization filtering.

IoU Estimation

IoU estimation was first proposed in a 2D object detection work IoU-Net [jiang2018acquisition], which proposed an IoU head that runs in parallel to bounding box refinement and is differentiable w.r.t. bounding box parameters. IoU-Net adds an IoU estimation head to several off-the-shelf 2D detectors and uses IoU estimation instead of classification confidence to guide NMS, which improves the performance consistently over different backbones. Thanks to its differentiability, IoU-Net can perform IoU optimization on bounding box parameters for iterative refinement, which further brings noticeable performance improvement.

For 3D object detection, STD [yang2019std] follows IoU-Net to add a simple IoU estimation branch parallel with the box estimation branch and to guide NMS with IoU estimation. This IoU estimation module design, unfortunately, is not suitable for IoU optimization as the features fed to the IoU estimation branch are not differentiable w.r.t. the bounding box size. This is because STD concatenates the canonical coordinates w.r.t. box center with features of interior points to create the initial features for each bounding box proposal, which is unaware of the box size. Moreover, the bounding box location is regressed along with IoU estimation so the bounding box estimated and optimized is the one before regression, namely the feature offset problem [zhu2019iou], which is also a problem in IoU-Net. In contrast, our 3D IoU estimation module is simple yet effective, can support the IoU optimization while avoiding the feature offset problem.

3 Method

In this section, we describe our solution in detail. After formulating the problem in 3.1, since our backbone detection module is based on VoteNet, we summarize it in 3.2. We then explain our newly devised 3D IoU estimation module in 3.3. These set the stage for a detailed description of our proposed 3DIoUMatch pipeline in 3.4. We further explain how we use the estimated 3D IoU for pseudo-label filtering and deduplication in 3.5. Finally, we illustrate how we leverage the pseudo-labels for supervision in 3.6.

3.1 Problem Definition

Given a 3D point cloud representation of a scene containing a set of objects , we aim at detecting the amodal oriented 3D bounding boxes of all objects in , along with their semantic class labels. In particular, we are interested in accomplishing this task under challenging conditions of limited supervision where we have access to a (small) set of labeled scenes and a set of unlabeled scenes , where and are the number of labeled and unlabeled scenes, respectively. For a labeled scene , the label comprises bounding box parameters and semantic class labels of all ground truth objects .

3.2 VoteNet for 3D Object Detection

VoteNet [qi2019deep] is a single-stage object detector for 3D point clouds. Built upon PointNet++ [qi2017pointnet++] backbone, VoteNet first processes the input point cloud to generate a sub-sampled set of seed points enriched with high-dimensional features . Next, each seed point votes for the center of the object it belongs to, and the votes are grouped into clusters. Finally, each of the K vote clusters is aggregated to make a prediction of a 3D bounding box parameters , a corresponding objectness score

, and a probability distribution

over possible semantic classes. The bounding box parameters are its center location , scale , and orientation around the upright axis.

At train time, VoteNet jointly minimizes a weighted combination of the following target losses: vote coordinate regression, objectness score binary classification, box center regression, bin classification and residual regression for heading angle, scale regression, and category classification. We refer the readers to [qi2019deep] for a detailed description of each loss term.

At test time, VoteNet applies Non-Maximum Suppression (NMS) based on objectness score to remove duplicated bounding boxes. Here, we instead rely on a 3D IoU estimation module which we describe next.

Figure 2: 3D IoU module takes inputs seed feature points and a bounding box along with its predicted class label, and estimates the 3D IoU between the box and its maximum overlapping ground truth. The module constructs a 3D regular grid with

virtual grid points spanning over the bounding box. We then perform a 3D grid feature pooling that applies a distance-weighted interpolation for feature propagation from the seed points to the grid points. Then the local coordinates of these grid points along with their features are pushed through a PointNet to regress class-aware 3D IoU. Finally, we use the input class label for output selection.

3.3 3D IoU Estimation Module

As will be made clear in the full pipeline description, a key contribution in this work is an IoU-guided filtering scheme. To facilitate the rejection of poorly localized proposals, as well as guiding deduplication and test-time refinement, we devise a new 3D IoU estimation module differentiable w.r.t bounding box parameters.

In detail, for each predicted bounding box , we wish to estimate its 3D IoU with respect to its corresponding ground-truth box . VoteNet does not have intermediate region proposals and only output bounding box parameters at the end stage. Features used for bounding box parameter regression are gathered from vote points in a fixed-radius ball vicinity around each vote cluster, which are unaware of the final bounding box prediction. So, different from implementation in IoU-Net [jiang2018acquisition] that parallel the bounding refinement and IoU estimation, we need to do it serially by pooling features again specifically for 3D IoU estimation using the final predicted bounding box.

This feature pooling layer takes a bounding box as input and should generate continuous features with respect to the change in bounding box parameters. Existing RoI pooling methods proposed in GSPN [yi2019gspn] and PointRCNN [shi2019pointrcnn] and 3D IoU module proposed in STD [yang2019std] simply set a hard cropping boundary at the bounding box surface, taking the point features inside the proposal and ignoring any points outside. These designs have poor differentiability and cause discontinuities whenever a change in the box parameters modifies the point population inside the box, thus are not suitable for 3D IoU optimization (see Table 3).

Here, for the first time, we devise a 3D pooling layer, 3D Grid Pooling, that is differentiable with respect to the change in all bounding box parameters. Inspired by RoIAlign [he2017mask] in 2D object detection, we propose to construct virtual grid points spanning the space of the bounding box and their features are obtained by distance-weighted interpolation from real points not restricted inside the box.

Network architecture

Taking as inputs the seed points , predicted bounding box , and a predicted label , our 3D IoU module estimates the largest 3D IoU between and all ground truth bounding boxes. Following IoU-Net [jiang2018acquisition], the IoU estimation is class-aware.

To build a differentiable 3D IoU module, we first construct a grid that exactly span over the space of and evenly divide its width, length, and height. For each grid point , we find its nearest neighbours among all seed points and interpolate their features to get , where and is the L2 distance. Ideally, if is equal to the number of all seed points, then the IoU module is continuously differentiable. Due to the computational cost, we empirically find is sufficient for accurate 3D IoU estimation with smooth gradients. We then concatenate and for each grid point and form a grid feature set . The feature set will be pushed towards a PointNet to predict class-aware 3D IoU. A final 3D IoU selection will be performed using the input class label.

Training IoU Estimation Module

To train the 3D IoU estimation branch in both stages, we generate on-the-fly training data via jittering the bounding box predictions, i.e. adding Gaussian noise to the box center and size. As a way of data augmentation, this jittering is essential for the generalization of IoU estimation to unlabeled data. We use an L1 loss to supervise the IoU estimation module.

3.4 3DIoUMatch for SSL on 3D object detection

With the incorporation of 3D IoU module into VoteNet, we construct an IoU-aware VoteNet for SSL on 3D object detection. Our proposed solution is comprised of two training stages: a pre-training stage, where we train our IoU-aware VoteNet on the labeled data, followed by an SSL stage where the entire data is utilized by pseudo-labeling the unlabeled scenes.


We start by training our IoU-aware VoteNet in a supervised manner, using the labeled set . The training loss is a sum over the original VoteNet losses and 3D IoU loss . Once converged, we clone the network to create a pair of student and teacher networks.

Semi-supervised training through a teacher-student framework.

We follow a teacher-student mutual learning framework [tarvainen2017mean] and train our networks on both labeled and unlabeled data . Each training batch contains a mixture of labeled samples and unlabeled samples.

For labeled samples, we supervise the student network using ground truth supervisions (as done in the pre-training stage) whereas for unlabeled samples, the student networks is supervised using pseudo-labels generated from the teacher network. The final loss is formed as:

where is the unsupervised loss weight.

To succeed in semi-supervised learning, it is crucial for the teacher network to generate high-quality pseudo-labels and maintain a reliable performance margin over the student network throughout the training. As commonly used in SSL literature, e.g. Mean Teacher 

[tarvainen2017mean] and SESS [zhao2020sess], we adopt an EMA teacher. We further leverage asymmetric data augmentation and pseudo-label filtering (see Sec.3.5).

To be in a position of advantage, the teacher network takes input data with weak augmentation only while the student network uses stronger data augmentation. We share the same data augmentation strategy with SESS. The input point clouds to our teacher network are augmented only by random sub-sampling while the inputs to the student network further undergo a set of stochastic transformation , including random flip, random rotation around the upright axis, and a random uniform scaling.

5% 10% 20% 100%
Dataset Model
ScanNet VoteNet 27.9±0.5 10.8±0.6 36.9±1.6 18.2±1.0 46.9±1.9 27.5±1.2 57.8 36.0
SESS reported 39.7±0.9 18.6 47.9±0.4 26.9 62.1 38.8
SESS 32.0±0.7 14.4±0.7 39.5±1.8 19.8±1.3 49.6±1.1 29.0±1.0 61.3 39.0
Ours 40.0±0.9 22.5±0.5 47.2±0.4 28.3±1.5 52.8±1.2 35.2±1.1 62.9 42.1
Abs. improve. +8.0 +8.1 +7.7 +8.5 +3.2 +6.2 +1.6 +3.1
SUN-RGBD VoteNet 29.9±1.5 10.5±0.5 38.9±0.8 17.2±1.3 45.7±0.6 22.5±0.8 58.0 33.4
SESS reported 42.9±1.0 14.4 47.9±0.5 20.6 61.1 37.3
SESS 34.2±2.0 13.1±1.0 42.1±1.1 20.9±0.3 47.1±0.7 24.5±1.2 60.5 38.1
Ours 39.0±1.9 21.1±1.7 45.5±1.5 28.8±0.7 49.7±0.4 30.9±0.2 61.5 41.3
Abs. improve. +4.8 +8.0 +3.4 +7.9 +2.6 +6.4 +1.0 +3.2
Table 1:

Comparison with VoteNet and SESS on ScanNet val set and SUN RGB-D val set under different ratios of labeled data. We report the mAP@0.25 and mAP@0.5 as mean±standard deviation across 3 runs under different random data splits. Due to the randomness of the data splits and our better pre-training protocol, SESS results provided by us are higher than those reported in the paper on mAP@0.5, and the mAP@0.25 results differ a little (the only difference is the pre-trained weights and data splits). The final improvement is the absolute improvement of our method over SESS results provided by us. Following SESS, we also report the results with 100% labeled data, where we simply make a copy of the full dataset as unlabeled data and train our method.

3.5 Pseudo-Label Filtering and Deduplication

In the teacher-student framework, the performance gap between the teacher and the student is usually quite marginal given that these two models are just different by EMA on weight and data augmentation strength. Hence, it is not always true that the teacher prediction is more accurate than the student’s on a specific training sample. On unlabeled data, the student model will only benefit from the pseudo-labels that are more accurate than its predictions. Therefore we should filter out low-quality predictions from the teacher model and only supervise the student model with the rest of the teacher model predictions.

Jointly filtering based on class, objectness, localization confidences

For VoteNet, we propose to set an objectness threshold and filter out bounding box predictions with objectness score . We further propose to set a classification confidence threshold for filtering out predictions that are likely to contain a wrong class label.

Note that none of these two confidence measures capture the accuracy of bounding box parameter predictions. We propose to predict a 3D IoU for each predicted bounding box, use the 3D IoU estimation as a localization confidence, and set a localization threshold to filter out poorly localized predictions. Formally, we remove all the predictions that fail to satisfy all three confidence thresholds, i.e. , , and .

IoU-guided lower-half suppression for deduplication

After the confidence-based filtering, there is still a lot of duplicated bounding box predictions that may introduce harmful noise to our pseudo-labels. NMS is a standard process in object detection for duplicate removal before evaluation, which takes a set of highly overlapped bounding box predictions that share the same class prediction, ranks them according to a confidence score and removes all but the top-1 prediction. STAC [sohn2020simple] applies class confidence based NMS to teacher predictions during pseudo-label generation.

The default NMS used in VoteNet is based on objectness confidence. Given that objectness score doesn’t capture the localization quality, a train-time IoU-guided NMS will naturally perform better (see Table.2), where we use the product of predicted IoU and predicted objectness as the ranking metric. However, using the top one selected by IoU-guided NMS can still be suboptimal, since the predicted IoU will inevitably carry some errors. We argue that different from the test time scenario, pseudo-labels do not need to be fully deduplicated. Imagine this situation: if a bounding box predicted by the student is to the left of its corresponding ground truth, it is a foreground object and will get bounding box supervision in VoteNet. However, if unfortunately the pseudo-label survives after non-maximal suppression is to the right of the ground truth more than , this predicted bounding box may lose supervision and be treated as a background box. This example shows that strict non-maximal suppression can lead to a smaller number of student model predictions that can receive supervision. Since we cannot know the best pseudo label among a bunch of highly-overlapped ones, it’s fine to be less strict. To this end, we propose a novel Lower-Half Suppression, or in short, LHS, that only discards half of the proposals with lower predicted IoU. We argue that since LHS suppresses bounding boxes sharing the same class label, this suppression can be seen as a second-step class-aware self-adjusted filtering, which sets dynamic thresholds among the overlapping bounding boxes to keep the ones with higher confidence and hence find a better balance between pseudo-label quality and the amount of supervision. We also use the product of predicted IoU and predicted objectness as the confidence metric.

Final-step pseudo-label processing

After the filtering and IoU-guided LHS, we now have high-quality predictions from the teacher network, where is the number of bounding boxes remains. Given that the student model inputs go through a stronger augmentation including an additional geometric transformation , in synchronize with the student model inputs, the bounding box parameters of the pseudo-labels need to go through the same transformation , namely . We further take convert the predicted class probability distribution into semantic class label via . Now we obtain the filtered pseudo-labels .

3.6 Selective Supervision using Pseudo-Labels

For our generated pseudo-labels, there is no guarantee that the labels can cover all the ground truth objects from due to the filtering and potentially inaccurate teacher predictions. Given the incompleteness of our filtered pseudo-labels, we are relatively confident about the bounding boxes in this set but student predictions far away from all of our pseudo-labels are not necessarily negative. Our experiments show that supervising objectness on unlabeled data using the pseudo-labels seriously hurts the performance. For similar reasons, we do not supervise vote loss, which is a unique element in VoteNet and not shown in other detectors. For more analysis and experimental proof for this, we refer the readers to the supplementary materials. In this case, we will only supervise the bounding boxes in the vicinity of the pseudo bounding boxes and aim to improve their bounding box quality. More specifically, we stick to the way how VoteNet select foreground objects for bounding box parameter supervision: we supervise bounding box parameters and class for a prediction only if the vote that generates this prediction is within of any bounding box in the pseudo-labels. For this set of pseudo-foreground predictions, we adopt the same way that VoteNet establishes association and enforce original VoteNet losses except for objectness loss and vote loss.

4 Experiments

ScanNet 10% SUN-RGBD 5%
IoU opt.
mAP@0.25 mAP@0.5 mAP@0.25 mAP@0.5
Obj-NMS 38.4 19.8 32.9 12.5
Obj-NMS 44.5 24.7 36.9 17.5
Obj-NMS Obj-NMS 44.2 25.2 37.1 17.4
IoU-NMS Obj-NMS 45.9 26.8 37.4 18.7
IoU-LHS Obj-NMS 46.5 26.9 37.9 18.5
IoU-LHS IoU-NMS 47.0 28.2 38.8 20.8
IoU-LHS IoU-NMS 47.2 28.3 39.0 21.1
Table 2: Effects of the different components, including train-time filtering and deduplication, and test-time improvements.

4.1 Datasets and Evaluation Metrics

We evaluate our 3DIoUMatch on ScanNet [dai2017scannet] and SUN RGB-D [song2015sun]. ScanNet is an indoor dataset consisting of 1,513 reconstructed meshes, among which 1,201 are training samples and the rest are validation samples. SUN RGB-D contains 10,335 RGB-D images of indoor scenes which are split into 5,285 training samples and 5,050 validation samples. For both datasets, we follow[qi2019deep, zhao2020sess] for preprocessing data and labels to train our method and we report mAP@0.25 (mean average precision with 3D IoU threshold 0.25) and mAP@0.5 in the following experiments.

4.2 Implementation Details


For the pre-training stage, we train with a batch size of 8 and follow the same data augmentation of SESS [zhao2020sess]. We then use those pre-trained weights to initialize the student and teacher networks. For the SSL stage, we construct each batch by taking 4 labeled samples and 8 unlabeled samples, with the same data augmentation. The weights of different loss terms (e.g. center regression loss, size regression loss, etc.) are the same as VoteNet and we set

. The student network is trained for 1000 epochs (the labeled data is traversed in one epoch), optimized by an ADAM optimizer with an initial learning rate of 0.002, and the learning rate is decayed by 0.3, 0.3, 0.1, 0.1 at the 400

th, 600th, 800th and 900th epoch, respectively. The number of generated 3D proposals is 128. We use for the IoU module. The three thresholds are set to be . For more details, we refer the readers to the supplementary materials.


We forward the input to the student network to generate proposals. We first apply IoU optimization to refine the bounding box parameters following IoU-Net [jiang2018acquisition], followed by an IoU-guided NMS with a 3D IoU threshold of 0.25.

4.3 Result Comparison

Table 1 shows the results of our method compared to SESS and VoteNet under different ratios of labeled data on ScanNet and SUN RGB-D, respectively. The results illustrate that, with our effective train-time filtering and test-time improvement leveraging IoU estimation, we are able to significantly outperform current state-of-the-art, SESS, under all labeled ratio settings. With 5% labeled data, our method outperforms SESS by 8.1 and 8.0 on mAP@0.5 on ScanNet and SUN RGB-D, respectively. Note that our method gains more improvement on mAP@0.5, thanks to the high quality of pseudo labels and the IoU guidance for test-time NMS.

4.4 Ablation Study

Filtering and Deduplication Mechanism.

We study the effect of each component of the filtering and deduplication mechanism. In Table 2, the second row shows the results of naive pseudo labeling, which takes all predictions from the teacher model for supervision. Expectedly the results are not satisfying, only a little higher than VoteNet. Simply applying the dual filtering of classification and objectness confidence gives significant improvement, as the filtering picks out the teacher model proposals that are very likely to be close to true objects and have the correct class. The conventional objectness-based NMS in VoteNet, however, fails to improve further, since the remaining proposals already have high objectness scores and the objectness-based NMS is not capable of picking the ones with higher localization accuracy.

As shown in the fifth and sixth row, after we introduce IoU during train time, IoU filtering and train-time IoU-guided NMS contribute to better performance under both settings. Our proposed IoU-guided LHS improves over IoU guided NMS on mAP@0.25, since LHS finds a better balance between quality and coverage. With better filtering and deduplication leveraging IoU estimation during train time, we gain 2.3 and 1.7 absolute improvement over the without-IoU version on mAP@0.25 and mAP@0.5 respectively on ScanNet 10%. This verifies that considering localization confidence is important for getting high-quality pseudo labels. With test-time improvements, our method gains in total 3.0 and 3.1 absolute improvement respectively.

We set 0.9 for both classification and objectness confidence threshold following STAC [sohn2020simple] and investigate the effect of different IoU thresholds on ScanNet 10%, as shown in Figure 3. The performance (with test-time improvements) is higher than the without-IoU baseline by large margins when . Note that the performance peaks at for mAP@0.25 while peaking at 0.5 for mAP@0.5, simply because mAP@0.5 prefers a stronger filtering on localization quality. When , further increasing may lead to a drastic drop in pseudo-label coverage and hence is detrimental to the performance.

Test-time IoU-guided NMS and IoU optimization.

We then evaluate the improvement brought by using IoU estimations at test time. The last two rows in Table 2 shows that IoU-guided NMS and IoU optimization improves the performance further.

As mentioned in 3.3, some region-based 3D detectors, e.g. STD [yang2019std], crop the features inside a predicted bounding box and regress the offset. To capture their core characteristics under IoU optimization, we build a simple IoU estimation module which only queries points inside the predicted box and passes the queried feature points through a PointNet to predict the 3D IoU, namely box query. In principle, the differentiability of this module is the same as that in STD, which doesn’t release their code and misses the IoU optimization step in their paper. For a fair comparison, we train another IoU-aware VoteNet with box query as the IoU estimation module and show the comparison between it and our proposed method on the full set of ScanNet and SUN RGB-D. From the results in Table 3, we prove that our method is more effective on both IoU-guided NMS and IoU optimization than box query.

Obj-NMS [qi2019deep] 57.84 35.99 58.01 33.44
Box query
IoU-NMS only
57.56 37.07 58.16 34.81
Box query
IoU-NMS + Optim.
57.62 37.17 58.19 34.90
IoU-NMS only
57.92 37.01 58.82 36.22
IoU-NMS + Optim.
58.46 37.43 59.11 36.71
Table 3: Comparison of our IoU module with box-query on ScanNet 100% and SUN RGB-D 100% .

4.5 Result Analysis

In this section, we examine how our 3DIoUMatch works during training on ScanNet 10%. The upper two curves in Figure 4 show that as the training goes, the performance on unlabeled data and test data increases conformably. The increasing performance on unlabeled data indicates the increasing quality of pseudo-labels. We also show how the coverage of the pseudo-labels changes on the unlabeled data over the training. Here coverage at a certain threshold simply means the class-agnostic recall, measuring the percentage of ground truth objects that can find a pseudo-label with an IoU larger than the threshold. As we can see from the lower two curves in Figure 4: at the beginning, the coverage of the pseudo-labels is relatively low due to the strict filtering mechanism; as the semi-supervised learning goes on, the improving detection performance leads to a higher passing rate of the filter and hence a higher coverage of the pseudo-labels, which in return feuls the semi-supervised learning; by the end of training, the coverage at 0.25 and at 0.5 both increase by about 10%.

Figure 3: 3DIoUMatch results with different IoU thresholds on ScanNet 10%.
Figure 4: The performance improvements and pseudo-label coverage over the semi-supervised learning stage on ScanNet 10%.

4.6 Limitations

As discussed in 3.6, we do not supervise objectness loss and vote loss on unlabeled data due to the uncertainty about negative samples. Also, we do not supervise the 3D IoU module on unlabeled data. We expect further performance improvements if the unlabeled data can be utilized to improve their training.

5 Conclusion

In this paper, we propose 3DIoUMatch, a novel semi-supervised 3D object detection method leveraging IoU estimation. Built upon a teacher-student mutual learning framework, we use EMA teacher, asymmetric data augmentation and pseudo-label filtering and deduplication to make the student effectively learn from the teacher. With our IoU estimation module, we make filtering and deduplication aware of localization confidence and apply test-time IoU-guided NMS and IoU optimization, leading to further improvement. Experiment results on two real-world datasets verify the effectiveness of our method and we achieve significant gain over the prior art under each setting. We believe our idea of leveraging IoU estimation is generally helpful and can be coupled with different kinds of 2D and 3D object detectors to improve their semi-supervised learning.

Acknowledgement: This research is supported by a grant from the SAIL-Toyota Center for AI Research, NSF grant CHS-1528025, a Vannevar Bush Faculty fellowship, and gifts from the Adobe, Amazon AWS, and Snap corporations.

Appendix A Implementation Details

Network Design

Our IoU-aware VoteNet shares the same structure with VoteNet[qi2019deep] except for the IoU estimation module. We provide a more detailed description of the IoU estimation module here. The IoU estimation module is appended after the proposal generation module of VoteNet and takes the bounding box proposals as input. For each bounding box proposal, we create virtual grid points. We obtain the relative coordinates of the grid points by subtracting the coordinates of the bounding box center. For every grid point we find its nearest neighbours among all seed points and interpolate their features to get , where and

is the L2 distance. The interpolated features of every grid point is then concatenated with the relative coordinates and forwarded into an MLP with channel dimensions of [256+3, 128, 128, 128] to learn a new feature. Then the features of all grid points go through a global max pooling, after which go through another MLP with channels [128, 128, 128, C], where

is the number of classes, to make the IoU prediction class-aware. Finally, we select the per box IoU estimation by using the class label (during training) or class prediction (during inference).


For the pre-training stage, we find that the network does not converge using the same protocol as fully-supervised VoteNet. We instead use a new protocol, where the network is trained for 900 epochs, optimized by an ADAM optimizer with an initial learning rate of 0.001, and the learning rate is decayed by 0.1, 0.1, 0.1 at the 400th, 600th and 800th, respectively. We observe convergence using this protocol on all ratios of labeled data.

Inspired by IoU-Net[jiang2018acquisition], for both stages, we generate on-the-fly training data via jittering the bounding box predictions for the IoU estimation module. Specificly, we add to each bounding box size prediction and add to each bounding box center prediction to obtain more training samples. The final IoU estimation loss is the L1 loss averaged over all IoU trainig samples, original predictions or jitters. The IoU estimation loss weight is 1.


As IoU-Net[jiang2018acquisition] did not release code, we implemented a simple version of test-time IoU optimization.

  1. We obtain the original bounding box proposals.

  2. We calculate the gradients of the IoU estimation w.r.t. to bounding box size and center, , and update the bounding box size and center by adding to the box size and center, respectively, where is the optimization step size.

  3. We repeat the second step for times.

We find setting to 10 yields noticeable improvement while not slowing inference speed too much. Choosing from the range of has similar performance.

Appendix B Overhead of the IoU module

Our light-weighted IoU estimation module brings moderate overhead to the network, as shown in Table 4. The memory reported in the second column refers to the memory consumed by training with batch size 8 on a single GTX 1080Ti GPU. The last two columns mean the time consumed by a full pass (forwarding and backwarding) of a batch of 8 on a single GTX 1080Ti GPU, training ScanNet and SUNRGB-D respectively. Note that regardless of the network design, there is overhead introduced by calculating the ground truth IoU for supervision.

Method Mem. (GB) ScanNet (s) SUNRGB-D (s)
VoteNet 6.56 0.282 0.316
Ours 6.60 0.325 0.377
Table 4: Memory and time overhead of the IoU module.

width= cab bed chair sofa table door wind bkshf pic cntr desk curt fridg showr toil sink bath ofurn VoteNet mAP@0.25 17.9 74.7 74.5 75.3 45.6 18.3 11.7 21.7 0.7 28.4 49.4 21.5 23.2 18.5 79.6 25.7 66.3 11.7 SESS mAP@0.25 20.5 75.1 76.2 76.4 48.1 20.0 14.4 19.4 1.2 30.0 51.8 25.0 30.0 26.4 82.2 29.2 72.3 14.1 Without IoU mAP@0.25 22.6 79.5 77.8 77.8 49.6 25.4 18.6 27.7 3.3 41.4 56.2 27.4 30.4 53.6 81.3 28.5 74.5 18.8 3DIoUMatch mAP@0.25 26.6 82.6 80.9 83.3 52.1 28.0 19.9 29.4 3.7 45.0 61.9 29.2 34.1 51.2 85.7 32.3 82.8 21.5 VoteNet mAP@0.5 3.2 64.6 43.4 49.3 25.1 2.8 1.1 8.7 0.0 2.4 14.7 3.9 7.6 1.1 46.8 11.9 39.4 1.5 SESS mAP@0.5 3.7 61.2 48.0 44.8 29.5 3.2 2.8 8.4 0.2 7.5 19.2 5.0 12.2 1.8 48.7 15.3 40.8 2.2 Without IoU mAP@0.5 3.9 66.1 52.7 50.7 35.1 7.9 5.0 13.1 0.9 14.5 26.1 10.3 17.5 7.0 63.9 11.7 62.1 4.9 3DIoUMatch mAP@0.5 5.9 72.0 60.5 56.6 39.7 10.3 5.2 18.1 0.7 16.0 35.3 8.3 21.4 6.2 67.5 13.2 67.6 5.2

Table 5: Per class mAP@0.25 and mAP@0.5 on ScanNet val set, with 10% labeled data.

width= bathtub bed bookshelf chair desk dresser nightstand sofa table toilet VoteNet mAP@0.25 67.8 32.2 39.4 58.5 53.5 8.0 1.9 14.7 3.2 20.3 SESS mAP@0.25 70.8 34.7 41.9 60.4 63.0 9.8 3.7 25.2 4.0 28.0 Without IoU mAP@0.25 75.1 33.5 43.0 59.5 76.9 6.8 5.1 33.0 3.5 34.8 3DIoUMatch mAP@0.25 75.4 37.7 45.2 64.2 77.0 6.0 5.7 34.6 4.5 39.4 VoteNet mAP@0.5 31.2 6.2 15.5 29.6 14.6 0.5 0.2 2.0 0.3 5.2 SESS mAP@0.5 36.7 7.2 19.2 31.8 20.4 0.7 0.5 7.0 0.4 7.1 Without IoU mAP@0.5 41.5 9.7 25.7 34.5 40.8 0.8 0.8 8.3 0.8 11.4 3DIoUMatch mAP@0.5 45.2 14.4 27.8 43.6 47.2 0.8 1.9 15.7 0.6 13.4

Table 6: Per class mAP@0.25 and mAP@0.5 on SUNRGB-D val set, with 5% labeled data.

Appendix C More on IoU Module Comparison

We provide more explanation on why an IoU estimation network design like that in STD[yang2019std] is less effective in IoU estimation and is not differentiable. Given a bounding box proposal, STD concatenates the canonized coordinates and features of the points inside the bounding box to form new features of the points. Therefore, the new feature of a point can be denoted as a function of the point coordinates , the original point feature , the bounding box center and the bounding box heading angle . Then STD voxelizes the bounding box and sample points in each voxel to produce the voxel feature . The process of producing the voxel feature from points in voxels consist of no other parameters except from the point features and point coordinates , so is still a function of . As all voxel features are flattened and fed to an MLP, which outputs the final IoU, we can conclude that the IoU estimation is not differentiable w.r.t. bounding box size.

We also argue that for VoteNet, since the number of seed points with features are small (1024), box query methods may have difficulty querying points inside a bounding box, especially if a bounding box is too small. Our method, instead won’t suffer from this as we are not confined to points inside the bounding box.

Although STD didn’t release code, we still implemented an IoU estimation module according to the paper for better comparison. However, some issues need to be stated. First, since the backbone of STD is very different from VoteNet, the comparison between IoU estimation module alone is inherently problematic. Second, STD aims at outdoor object detection, where the task is slightly different. Third, we adopted most of the parameters of STD in the paper, but changed number of voxels (to 27) and number of points sampled per voxel (to 6) due to memory concerns and the small number of seed points in VoteNet. The results in Table 7 show the better performance of our IoU module. We also observe serious overfitting using the STD IoU module, suggesting that it may not be suitable for our problem.

ScanNet 100% SUNRGB-D 100%
+0.08 +1.02 +0.81 +2.78
IoU Opt.
+0.54 +0.42 +0.29 +0.49
-0.03 +0.22 +0.20 +1.32
IoU Opt.
+0.04 +0.00 +0.00 -0.01
Table 7: Effectiveness of the IoU module in STD compared with our method.
Figure 5: Qualitative results on ScanNet, with 10% labeled data. Here green bounding boxes have an IoU while red bounding boxes are with an IoU .
Figure 6: Qualitative results on SUNRGB-D, with 5% labeled data.

Appendix D Why not Supervise Votes and Objectness?

As we mentioned, we supervise all VoteNet loss terms on unlabeled data except for vote regression loss and objectness binary classification loss. As we observe, supervising votes or objectness with pseudo labels leads to degrading performance. The main reason is that by rigorous filtering and deduplication we can only be highly confident of a true object being close to a pseudo bounding box, but we are not sure whether or not there is a true object where there are no pseudo bounding boxes nearby. If we supervise objectness on unlabeled data with the pseudo labels the same way as VoteNet, it’s not difficult to imagine the network would be more and more biased on detecting objects. In Table 8, our experiments on ScanNet 10% and SUNRGB-D 5% show that the performance suffers a drop after supervising objectness on unlabeled data.

Vote prediction is an unique component of VoteNet. For a point, the label for its vote is the center of the object it belongs to. To generate pseudo vote labels, the straightforward way is to count every point inside a pseudo bounding box as a vote. However, since this pseudo vote label set is also far from complete, we face a similar problem supervising with it. In Table 8, our experiments on ScanNet 10% and SUNRGB-D 5% also show that the performance drops after supervising vote prediction on unlabeled data.

ScanNet 10% SUNRGB-D 5%
3DIoUMatch 47.2 28.3 39.0 21.1
+vote sup.
on unlabeled
45.4 28.3 37.9 20.9
+obj. sup.
on unlabeled
40.1 26.0 38.2 20.4
Table 8: Objectness & vote supervision on unlabeled data using pseudo-labels.

Appendix E Per-class Evaluation

We report per-class average precision on ScanNet with 10% labeled data and SUNRGB-D with 5% labeled data, respectively. The bold numbers are the highest per class. The results in Table 5, 6 show that our method improves the average precision on nearly all classes over SESS. Our 3DIoUMatch also has better performance on most classes than the without-IoU version.

Appendix F Qualitative Results

We show the qualitative results on ScanNet val set with 10% labeled training data, Figure 5 and on SUNRGB-D val set with 5% labeled training data, Figure 6. For the results of our method, SESS and VoteNet, green bounding boxes are the predicted bounding boxes whose IoU , and the red bounding boxes are those with an IoU . As can be seen in both figures, our method give more accurate predictions and significantly reduces the number of false positives.