Transferable Semi-supervised 3D Object Detection from RGB-D Data

04/23/2019 ∙ by Yew Siang Tang, et al. ∙ National University of Singapore 0

We investigate the direction of training a 3D object detector for new object classes from only 2D bounding box labels of these new classes, while simultaneously transferring information from 3D bounding box labels of the existing classes. To this end, we propose a transferable semi-supervised 3D object detection model that learns a 3D object detector network from training data with two disjoint sets of object classes - a set of strong classes with both 2D and 3D box labels, and another set of weak classes with only 2D box labels. In particular, we suggest a relaxed reprojection loss, box prior loss and a Box-to-Point Cloud Fit network that allow us to effectively transfer useful 3D information from the strong classes to the weak classes during training, and consequently, enable the network to detect 3D objects in the weak classes during inference. Experimental results show that our proposed algorithm outperforms baseline approaches and achieves promising results compared to fully-supervised approaches on the SUN-RGBD and KITTI datasets. Furthermore, we show that our Box-to-Point Cloud Fit network improves performances of the fully-supervised approaches on both datasets.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 5

page 8

page 12

page 14

page 15

page 16

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Figure 1: Illustration of fully-supervised 3D object detection where 3D box labels of all classes (weak and strong classes) are available, and cross-category semi-supervised 3D object detection where 3D box labels of inference classes (weak classes e.g. Table and Night stand

) are not available. Conventional semi-supervised learning (which is in-category) requires strong annotations from the inference classes (3D box labels in the case of 3D object detection) while cross-category semi-supervised learning requires only the weak annotations from the inference classes (2D box labels in the case of 3D object detection).

Figure 2: Overall architecture with a RGB-D (a pair of image and point cloud) input. (1) Frustum point clouds are extracted from the input point cloud and 2D object detection boxes on the image. (2) takes as input and outputs class-agnostic segmentations used to mask . (3) predicts an initial 3D box with the masked point cloud. (4) The pretrained model refines to according to

, and predicts the BoxPC fit probability used to supervise

.

3D object detection refers to the problem of classification and estimation of a tight oriented 3D bounding box for each object in a 3D scene represented by a point cloud. It plays an essential role in many real-world applications such as home robotics, augmented reality and autonomous driving, where machines must actively interact and engage with its surroundings. In the recent years, significant advances in 3D object detection have been achieved with fully supervised deep learning-based approaches

[32, 6, 5, 7, 17, 41, 36, 13, 21]. black However, these strongly supervised approaches rely on 3D ground truth datasets [26, 8, 33, 34] that are tedious and time consuming to label even with good customised 3D annotation tools such as those in [26, 8]. In contrast, calculations based on [18] show that it can be 3-16 times faster (depending on the annotation tool used) to label 2D than 3D bounding boxes (details in supplementary). This observation inspires us to investigate the direction of training a 3D object detector for new object classes from only 2D bounding box labels of these new classes, while simultaneously transferring information from 3D bounding box labels of the existing classes. We hope that this direction of research leads to cost reductions, and improves the efficiency and practicality of learning 3D object detectors in new applications with new classes.

More specifically, the objective of this paper is to train a 3D object detector network with data from two disjoint sets of object classes - a set of classes with both 2D and 3D box labels (strong classes), and another set of classes with only 2D box labels (weak classes). In other words, our goal is to transfer useful 3D information from the strong classes to the weak classes black (the new classes of a new 3D object detection application correspond to these weak classes) . We refer to this as cross-category semi-supervised 3D object detection as illustrated in Fig. 1.

To this end, we propose a novel transferable semi-supervised 3D object detection network. Our network leverages on the state-of-the-art Frustum PointNets [21] as the backbone. We train the backbone network to make class-agnostic segmentations and class-conditioned initial 3D box predictions on the strong classes in a fully-supervised manner. We also supervise the network with 2D box labels of the weak classes using a relaxed reprojection loss which corrects 3D box predictions that violate boundaries specified by the 2D box labels and through the prior knowledge of the object sizes. To transfer knowledge from 3D box labels from the strong classes, we first train a Box-to-Point Cloud (BoxPC) Fit network on 3D box and point cloud pairs to reason about 3D boxes and the fit with their corresponding point clouds. More specifically, we proposed an effective and differentiable method to encode a 3D box and point cloud pair. Finally, the 3D box predictions on the weak classes are supervised and refined by the BoxPC Fit network.

The contributions of our paper are as follows:

  • We propose a network to perform 3D object detection on weak classes, where only 2D box labels are available. This is achieved using relaxed reprojection and box prior losses on the weak classes, and transferred knowledge from the strong classes.

  • A differentiable BoxPC Fit network is designed to effectively combine BoxPC representation. It is able to supervise and improve the 3D object detector on the weak classes after training on the strong classes.

  • black Our transferable semi-supervised 3D object detection model outperforms baseline approaches and achieves promising results compared to fully-supervised methods on both SUN-RGBD and KITTI datasets. Additionally, we show that our BoxPC Fit network can be used to improve the performance of fully-supervised 3D object detectors.

2 Related Work

3D Object Detection

3D object detection approaches have advanced significantly in the recent years [25, 24, 14, 32, 6, 5, 7, 17, 41, 36, 13, 21]. However, most approaches are fully-supervised and highly-dependent on 3D box labels that are diffcult to obtain. To the best of our knowledge, there are no existing weakly- or semi-supervised 3D object detection approaches. Recently, [29]

proposed a self-supervised Augmented Autoencoder that trains on CAD models, removing the need for pose-annotated training data. However, it is unclear if the method generalizes to general 3D object detection datasets with high intra-class shape variations and noisy depth data.

Weakly- and Semi-Supervised Learning

There is growing interest in weakly- and semi-supervised learning in many problem areas [39, 40, 1, 3, 2, 38, 30, 28] because it is tedious and labor intensive to label large amounts of data for fully supervised deep learning. There is a wide literature of weakly-supervised [12, 20, 31, 42] and semi-supervised [19, 37, 9, 16] learning approaches for semantic segmentation. Both strong and weak labels on the inference classes are required in conventional semi-supervised learning. Hence, the approach remains expensive for applications with new classes. [35, 10, 11] proposed the cross-category semi-supervised semantic segmentation which is a more general approach of semi-supervised segmentation, where the model is able to learn from strong labels provided for classes outside of the inference classes. They outperformed weakly- and semi-supervised methods, and showed on-par performances with fully-supervised methods. Inspired by the effectiveness of the transferred knowledge, we propose to tackle the same problem in the 3D object detection domain.

3 Problem Formulation

Let , where is the set of classes with only 2D box labels (weak classes) and is the disjoint set of classes with 2D and 3D box labels (strong classes). We tackle the cross-category semi-supervised 3D object detection problem where 3D object detection is performed on the object classes in while training is done on strong labels from and weak labels from

. In 3D object detection with RGB-D data, our goal is to classify and predict

amodal 3D bounding boxes for objects in the 3D space. Depth data can be provided by various depth sensors e.g. RGB-D sensors and LiDAR etc. Each 3D bounding box is parameterized by its center , size and orientation along the vertical axis.

4 Method

4.1 Frustum PointNets

In this section, we briefly discuss Frustum PointNets (FPN), the 3D object detection framework which we adopt as the backbone of our framework. Refer to [21] and our supplementary for more details.

4.1.1 Frustum Proposal

As seen in Fig. 2, the inputs to the network is an image with a 2D box for each object and a 3D point cloud. During inference, we obtain the 2D boxes from a 2D object detector trained on

. We project the point cloud onto the image plane using the camera projection matrix and select only the points that lie in the 2D box for each object. This reduces the search space to only points within a 3D frustum which we refer to as frustum point cloud. Next, variance of the points in each frustum point cloud is reduced by rotating the frustum about the vertical axis of the camera to face the front. Formally, let us denote a frustum point cloud by

, where is the number of points. (i.e. x, y, z, RGB or reflectance) is the -th point in the point cloud.

4.1.2 3D Instance Segmentation

The frustum point cloud contains foreground points of an object which we are interested in and background points from the surroundings. To isolate the foreground object and simplify the subsequent 3D box estimation task, the frustum point cloud is fed into a 3D instance segmentation network which predicts a class-agnostic foreground probability for each point. These probabilities are used to mask and retrieve the points with high values, thus giving a masked point cloud. is trained by minimizing

(1)

on , where is the point-wise binary cross-entropy loss and is the ground truth mask.

Figure 3: Illustration of 3D bounding box and point cloud pairs with a good BoxPC fit (center) and a bad BoxPC fit (right). The ground truth 3D bounding box of the chair is given on the left.
Figure 4: Different feature representations for a pair of 3D bounding box and point cloud. On the left, the inputs are processed independently and concatenated in the feature space. On the right, the inputs are combined in the input space and processed as a whole.

4.1.3 3D Box Estimation

The 3D box estimation network predicts an initial 3D box of the object

given the masked point cloud and a one hot class vector from the 2D detector. Specifically, we compute the centroid of the masked point cloud and use it to translate the masked point cloud to the origin position to reduce translational variance. Next, the translated point cloud is given to two PointNet-based

[22] networks that will predict the initial 3D box . The center is directly regressed by the network while an anchor-based approach is used to predict the size and rotation . There are size and rotation anchors, and size and rotation residuals. Hence, the network has outputs that are trained by minimizing

(2)

on . black and are regression losses for the box center predictions, and are the rotation anchor classification and regression losses, and are the size anchor classification and regression losses, and is the regression loss for the 8 vertices of the 3D box. Note that each loss term in is weighted by , which we omit in Eq. 2 and all subsequent losses for brevity. Until this point, the network is fully-supervised by 3D box labels available only for classes . In the subsequent sections, we describe our contributions to train on the weak classes in by leveraging on the BoxPC Fit Network (Sec. 4.2) and weak losses (Sec. 4.3).

4.2 Box to Point Cloud (BoxPC) Fit Network

The initial 3D box predictions on classes are not reliable because there are no 3D box labels for these classes. As a result, the 3D box surfaces of the initial predictions for classes are likely to cut the frustum point cloud at unnatural places, as illustrated on the right of Fig. 3. We utilize our BoxPC Fit Network with a novel training method to transfer the knowledge of a good BoxPC fit from the 3D box labels of classes to classes . The input to our is a pair of 3D box and frustum point cloud . The outputs of are the BoxPC Fit probability, i.e. goodness-of-fit

(3)

between and the object in , and the correction

(4)

required on to improve the fit between and the object in . when encloses the object in tightly, i.e. there is a high overlap between and the 3D box label .

A pretrained (see Sec. 4.2.1 for the training procedure) is used to supervise and improve 3D box predictions for classes . Specifically, we train to make better initial 3D box predictions by maximizing . By minimizing the loss:

(5)

i.e. maximizing , our network learns to predict that fit objects in their respective point clouds well. Finally, we obtain the final 3D box prediction by correcting the initial 3D box prediction with , i.e. .

4.2.1 Pretraining BoxPC Fit Network

We train the BoxPC Fit network on the classes in with the 3D box labels and their corresponding point clouds . We sample varying degree of perturbations to the 3D box labels and get perturbed 3D boxes . We define 2 sets of perturbations, and , where and are sets of small and large perturbations, respectively. Formally, , where refers to the 3D Intersection-over-Union. is defined similarly but with and as the bounds. and are set such that the 3D boxes perturbed by have high overlaps with or good fit with its point clouds and vice versa for

. The loss function to train

is given by

(6)

where is the classification loss

(7)

to predict the fit of a perturbed 3D box and its frustum point cloud, and is the regression loss

(8)

to predict the perturbation to correct the perturbed 3D box. is the binary cross-entropy and is the Smooth L1 loss with as the target for .

4.2.2 BoxPC Feature Representations

A possible way to encode the feature respresentations for the BoxPC network is illustrated on the left of Fig. 4. The features of the input 3D box and frustum point cloud are learned using several MLPs and PointNets [22], respectively. These features are then concatenated and fed into several layers of MLPs to get the BoxPC fit probability and Box Correction term . However, this neglects the intricate information of how the 3D box surfaces cut the point cloud, leading to poor performance as shown in our ablation studies in Sec. 6.3. Hence, we propose another method to combine the 3D box and point cloud in the input space in a differentiable manner that also exploits the relationship between them. Specifically, we use the box center to translate the point cloud to the origin to enhance translational invariance to the center positions. Next, we use the box sizes and rotation to form six planes representing each of the six box surfaces. We compute a feature vector of the perpendicular distance (with direction) from each point to each of the six planes. This feature vector is concatenated with the original frustum point cloud to give the box-combined frustum point cloud as illustrated on the right of Fig. 4. This representation allows the network to easily reason about regions where the box surfaces cut the point cloud since the perpendicular distance will be close to 0. Additionally, it is possible to perform feature learning on the 3D box and the point cloud jointly with a single PointNet-based [22] network, improving performance.

4.3 Weak Losses

We supervise the initial 3D box predictions with 2D box labels and additional priors.

Figure 5: Illustration of the relaxed reprojection loss.
bathtub bed toilet chair desk dresser nstand sofa table bkshelf mAP
Fully-Supervised:
DSS [27] 44.2 78.8 78.9 61.2 20.5 6.4 15.4 53.5 50.3 11.9 42.1
COG [24] 58.3 63.7 70.1 62.2 45.2 15.5 27.4 51.0 51.3 31.8 47.6
2D-driven [14] 43.5 64.5 80.4 48.3 27.9 25.9 41.9 50.4 37.0 31.4 45.1
LSS [25] 76.2 73.2 73.7 60.5 34.5 13.5 30.4 60.4 55.4 32.9 51.0
FPN [21] 43.3 81.1 90.9 64.2 24.7 32.0 58.1 61.1 51.1 33.3 54.0
FPN* 46.0 80.3 82.8 58.5 24.2 36.9 47.6 54.9 42.0 41.6 51.5
FPN* + BoxPC Refine 51.5 81.0 85.0 59.0 24.5 38.6 52.2 55.3 43.8 40.9 53.2
CC Semi-Supervised:
FPN* 6.1 69.5 28.1 12.4 18.1 17.5 37.4 25.8 23.5 6.9 24.5
FPN* w/o OneHot 24.4 69.5 30.5 15.9 22.6 19.4 39.0 37.3 29.0 12.8 30.1
Ours + R 29.5 60.9 65.3 36.0 20.2 27.3 50.9 46.4 28.4 6.7 37.2
Ours + BoxPC 28.4 67.9 73.3 32.3 23.3 31.0 50.9 48.9 33.7 16.4 40.6
Ours + BoxPC + R 28.4 68.1 77.9 32.9 23.3 30.6 51.1 49.9 34.7 13.7 41.1
Ours + BoxPC + R + P 30.2 70.7 76.3 33.6 24.0 32.5 52.0 49.8 34.2 14.8 41.8
Table 1: 3D object detection AP on SUN-RGBD val set. Fully-supervised methods are trained on 2D and 3D box labels of all classes while Cross-category (CC) Semi-supervised methods are trained on 3D box labels of classes in and 2D box labels of all classes. BoxPC, R, P refer to the BoxPC Fit network, relaxed reprojection loss and box prior loss respectively. * refers to our implementation.

4.3.1 Relaxed Reprojection Loss

Despite the correlation, the 2D box that encloses all the points of a projected 3D box label does not coincide with the 2D box label. We show in Sec. 6.3 that performance deteriorates if we simply minimize the reprojection loss between the 2D box label and the 2D box enclosing the projection of the predicted 3D box onto the image plane. Instead, we propose a relaxed reprojection loss, where boundaries close to the 2D box labels are not penalized. Let (green box in Fig. 5) that consists of the left, top, right and bottom image coordinates be the 2D box label with the top-left hand corner as origin. We define an upper bound box and a lower bound box (outer and inner blue boxes in Fig. 5), where and are vectors used to adjust the size of . Given an initial 3D box prediction , we project it onto the image plane (red box in Fig. 5) and obtain an enclosing 2D box around the projected points (yellow box in Fig. 5), where is the function that returns the enclosing 2D box. We penalize for violating the bounds specified by and . An example is shown in Fig. 5, the bounds on the left and right are not violated because the left and right sides of the 2D box stays within the bounds specified by and (blue boxes). However, the top and bottom of beyond the bounds are penalized. More formally, the relaxed reprojection loss is given by

(9)
(10)

is a relaxed Smooth L1 loss such that there is no penalty on if , and retrieves the -th component of .

4.3.2 Box Prior Loss

We use prior knowledge on the object volume and size to regularize the training loss. More specifically, we add the prior loss

(11)

to train our network, where

(12)

is to penalize predictions with volumes that are below class-specific thresholds from prior knowledge about the scale of objects, and

(13)

is to penalize the size variance of the predictions within each class in each minibatch since there should not be excessive size variations within a class. are the size predictions from for class , and are the average size predictions per minibatch for class and is the volume threshold for class .

5 Training

We train a Faster RCNN-based [23] 2D object detector with the 2D box labels from classes and train the BoxPC Fit Network by minimizing from Eq. 6 with the 3D box labels from classes . Finally, when we train the 3D object detector, we alternate between optimizing the losses for classes in and classes in . Specifically, when optimizing for , we train to minimize

(14)

where , and are from Eq. 5, 9 and 11 respectively. When optimizing for , we train and to minimize and from Eq. 1 and 2 respectively. Hence, the loss functions for and are respectively given by

(15)
(16)

6 Experiments

Method Cars Pedestrians Cyclists
Easy Moderate Hard Easy Moderate Hard Easy Moderate Hard
Fully-Supervised:
FPN [21] 96.69 96.29 88.77 85.40 79.20 71.61 86.59 67.60 63.95
CC Semi-Supervised:
FPN 0.00 0.00 0.00 4.31 4.37 3.74 1.05 0.71 0.71
FPN w/o OneHot 0.08 0.07 0.07 55.66 47.33 41.10 54.35 36.46 34.85
Ours + BoxPC + R 14.71 12.19 11.21 75.13 62.46 57.18 61.25 41.96 40.07
Ours + BoxPC + R + P 69.78 58.66 51.40 76.85 66.92 58.71 63.29 47.54 44.66
Table 2: 3D object detection AP on KITTI val set, evaluated at 3D IoU threshold of 0.25 for all classes.
Method Cars Pedestrians Cyclists
Easy Moderate Hard Easy Moderate Hard Easy Moderate Hard
Fully-Supervised:
VeloFCN [15] 15.20 13.66 15.98 - - - - - -
MV3D [7] 71.29 62.68 56.56 - - - - - -
Voxelnet [41] 89.60 84.81 78.57 65.95 61.05 56.98 74.41 52.18 50.49
FPN [21] 84.40 71.37 63.38 65.80 56.40 49.74 76.11 56.88 53.17
FPN + BoxPC Refine 85.63 72.10 64.25 68.83 59.41 52.08 78.77 58.41 54.52
Table 3: 3D object detection AP on KITTI val set, evaluated at 3D IoU threshold of 0.7 for Cars and 0.5 for Pedestrians and Cyclists.

6.1 Datasets

We evaluate our model on SUN-RGBD and KITTI benchmarks for 3D object detection. For the SUN-RGBD benchmark, we evaluate on 10 classes as in [21]. To test the performance of cross-category semi-supervised 3D object detection (CS3D), we randomly split the 10 classes into 2 subsets ( and ) of 5 classes each (the 5 classes on the left and on the right of Tab. 1). First, we let be the strong classes and be the weak classes by setting to obtain the evaluation results for . Next, we reversed and to obtain the evaluation results for . We follow the same train/val split and performance measure of average precision (AP) with 3D Intersection-over-Union (IoU) of 0.25 as in [21].

The KITTI benchmark evaluates on Cars, Pedestrians and Cyclists classes. To test the CS3D setting, we set any 2 classes to and the remaining class to to obtain the evaluation results on . For example, to obtain the evaluation results on Cyclists, we set . We follow the same train/val split as in [21] and measure performance using AP with 3D IoU of 0.25 for all classes. For fully-supervised setting, we use 3D IoU of 0.7 for Cars and 0.5 for Pedestrians and Cyclists.

6.2 Comparison with Baselines

To the best of our knowledge, there are no other methods that demonstrate weakly- or semi-supervised 3D object detection, therefore we design baselines (details in supplementary) with the state-of-the-art FPN [21] where it trains on strong labels from and tests on . Specifically, we use the original network (“FPN*”) and a network without the one hot class vector (“FPN* w/o OneHot”). The latter performs better since there are no 3D box labels for classes . Finally, we train fully-supervised BoxPC Fit networks to further improve the fully-supervised FPN [21].

Sun-Rgbd

In Tab. 1, BoxPC, R, P refer to the usage of the proposed BoxPC Fit network, relaxed reprojection loss and box prior loss. black The fully-supervised “FPN*” serves as an upper-bound performance for the semi-supervised methods. The semi-supervised baselines “FPN*” and “FPN* w/o OneHot” perform poorly because they are not able to reason about the weak classes and predict boxes that are only roughly correct. By adding R and BoxPC, it allows the network to reason about 3D box predictions on weak classes, improving performance significantly from 30.1% to 41.1%. black Finally, the addition of the prior knowledge P allows our model to achieve a good performance of 41.8% mAP, which is 81.2% of the fully-supervised “FPN*” (vs 58.4% for baseline “FPN* w/o OneHot”). We consider this a promising result for semi-supervised methods, and it demonstrates the effectiveness of our proposed model in transferring knowledge to unseen classes. In addition, we show the effectiveness of the BoxPC Fit network in fully-supervised settings by training the BoxPC Fit network on all classes and using the network to refine the 3D box predictions of the state-of-the-art FPN [21]. As seen in Tab. 1, “FPN* + BoxPC Refine” outperforms the vanilla fully-supervised “FPN*” in every single class except for bookshelf, showing the usefulness of the BoxPC Fit network even in fully-supervised settings.

Kitti

In Tab. 3, the baselines are unable to achieve any performance on Cars when trained on Pedestrians and Cyclists because of huge differences in sizes. The network simply assumes a Car instance is small and makes small predictions. With the addition of a prior volume, it becomes possible to make predictions on Cars. We observe that adding “BoxPC + R + P” significantly improves performance over the baseline’s mean AP of 0.07% to 59.95% for Cars. It also improved the baseline’s performance from 48.03% to 67.49% for Pedestrians and from 41.89% to 51.83% for Cyclists. Similarly, we show that a fully-supervised BoxPC Fit network is able to refine and improve the 3D box predictions made by “FPN” in Tab. 3. In this case, “FPN + BoxPC Refine” improves AP for all classes at all difficulty levels over the original “FPN”.

6.3 Ablation Studies

bathtub bed toilet chair desk dresser nstand sofa table bkshelf mAP
17.0 11.9 60.1 24.3 11.2 24.0 46.8 30.3 19.2 11.3 25.6
29.5 60.9 65.3 36.0 20.2 27.3 50.9 46.4 28.4 6.7 37.2
16.9 23.4 55.8 35.0 19.6 25.1 49.2 48.3 31.3 3.5 30.8
Table 4: 3D object detection AP on SUN-RGBD val set with varying for the relaxed reprojection loss function.
Features Cls Reg bathtub bed toilet chair desk dresser nstand sofa table bkshelf mAP
Combined 25.9 67.3 71.8 30.6 23.6 31.8 48.8 48.1 32.6 17.3 39.8
Combined 27.4 69.4 39.8 17.7 23.1 18.5 39.9 41.7 30.2 12.9 32.1
Independent 39.6 68.1 24.5 26.8 22.3 34.4 38.8 46.7 30.3 12.3 34.4
Combined 28.4 67.9 73.3 32.3 23.3 31.0 50.9 48.9 33.7 16.4 40.6
Table 5: 3D object detection AP on SUN-RGBD val set when BoxPC Fit Network is trained on different representations and objectives.
Figure 6: Qualitative comparisons between the baseline, fully-supervised and proposed models on the SUN-RGBD val set.
Relaxed Reprojection Loss

To illustrate the usefulness of using a relaxed version of the reprojection loss, we vary the lower and upper bounds and . We set to different scales of , i.e refers to the 2D box centered at the same point as but with 1.5 size. By setting , it reverts from the relaxed version to the direct version of reprojection loss. We train with similar settings as “Ours + R” in Tab. 1, and show in Tab. 5 that mean AP improves from 25.6% () to 37.2% ().

BoxPC Fit Network Objectives

The BoxPC Fit network is trained on classification and regression objectives derived from the 3D box labels. To show the effectiveness of having both objectives, we adopt different settings in which classification or regression objectives are active. We train with similar settings as “Ours + BoxPC” in Tab. 1. Without the classification objective for the BoxPC Fit network, the 3D object detector is unable to maximize during training. Omission of the regression objective prevents the initial box predictions from being refined further by to the final box predictions . From Tab. 5, we see that training without either classification or regression objective causes performance to drop from 40.6% to 32.1% or 39.8%.

BoxPC Representation

To demonstrate the importance of a joint Box to Point Cloud input representation as seen on the right of Fig. 4, we train on the “Ours + BoxPC” setting in Tab. 1 with different BoxPC representations. As shown in Tab. 5, the performance of combined learning of BoxPC features exceeds the performance of independent learning of BoxPC features by 6.2%.

6.4 Qualitative Results

We visualize some of the predictions by the baseline, fully-supervised and proposed models in Fig. 6. We used 2D box labels for frustum proposals to reduce noise due to 2D detections. We observe that the proposed model’s predictions are closer to the fully-supervised model than the baseline’s. The baseline’s predictions tend to be inaccurate since it is not trained to fit the objects from weak classes.

7 Conclusion

We propose a transferable semi-supervised model that is able to perform 3D object detection on weak classes with only 2D box labels. We achieve strong performance over the baseline approaches and in the fully-supervised setting, we improved the performance of existing detectors. In conclusion, our method improves the practicality and usefulness of 3D object detectors in new applications.

References

  • [1] R. Arandjelovic, P. Gronat, A. Torii, T. Pajdla, and J. Sivic. Netvlad: Cnn architecture for weakly supervised place recognition. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    , pages 5297–5307, 2016.
  • [2] Y. Cai, L. Ge, J. Cai, and J. Yuan. Weakly-supervised 3d hand pose estimation from monocular rgb images. In European Conference on Computer Vision (ECCV), 2018.
  • [3] L.-C. Chen, S. Fidler, A. L. Yuille, and R. Urtasun. Beat the mturkers: Automatic image labeling from weak 3d supervision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3198–3205, 2014.
  • [4] X. Chen and A. Gupta. An implementation of faster rcnn with study for region sampling. arXiv preprint arXiv:1702.02138, 2017.
  • [5] X. Chen, K. Kundu, Z. Zhang, H. Ma, S. Fidler, and R. Urtasun. Monocular 3d object detection for autonomous driving. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2147–2156, 2016.
  • [6] X. Chen, K. Kundu, Y. Zhu, A. G. Berneshawi, H. Ma, S. Fidler, and R. Urtasun. 3d object proposals for accurate object class detection. In Advances in Neural Information Processing Systems, pages 424–432, 2015.
  • [7] X. Chen, H. Ma, J. Wan, B. Li, and T. Xia. Multi-view 3d object detection network for autonomous driving. In IEEE CVPR, volume 1, page 3, 2017.
  • [8] A. Geiger, P. Lenz, and R. Urtasun. Are we ready for autonomous driving? the kitti vision benchmark suite. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pages 3354–3361. IEEE, 2012.
  • [9] S. Hong, H. Noh, and B. Han.

    Decoupled deep neural network for semi-supervised semantic segmentation.

    In Advances in neural information processing systems, pages 1495–1503, 2015.
  • [10] S. Hong, J. Oh, H. Lee, and B. Han.

    Learning transferrable knowledge for semantic segmentation with deep convolutional neural network.

    In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3204–3212, 2016.
  • [11] R. Hu, P. Dollár, K. He, T. Darrell, and R. Girshick. Learning to segment every thing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4233–4241, 2018.
  • [12] A. Kolesnikov and C. H. Lampert. Seed, expand and constrain: Three principles for weakly-supervised image segmentation. In European Conference on Computer Vision (ECCV), pages 695–711. Springer, 2016.
  • [13] J. Ku, M. Mozifian, J. Lee, A. Harakeh, and S. Waslander. Joint 3d proposal generation and object detection from view aggregation. arXiv preprint arXiv:1712.02294, 2017.
  • [14] J. Lahoud and B. Ghanem. 2d-driven 3d object detection in rgb-d images. In 2017 IEEE International Conference on Computer Vision (ICCV), pages 4632–4640. IEEE, 2017.
  • [15] B. Li. 3d fully convolutional network for vehicle detection in point cloud. In Intelligent Robots and Systems (IROS), 2017 IEEE/RSJ International Conference on, pages 1513–1518. IEEE, 2017.
  • [16] Q. Li, A. Arnab, and P. H. Torr. Weakly- and semi-supervised panoptic segmentation. In European Conference on Computer Vision (ECCV), 2018.
  • [17] A. Mousavian, D. Anguelov, J. Flynn, and J. Košecká. 3d bounding box estimation using deep learning and geometry. In Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on, pages 5632–5640. IEEE, 2017.
  • [18] D. P. Papadopoulos, J. R. Uijlings, F. Keller, and V. Ferrari. Extreme clicking for efficient object annotation. In Proceedings of the ICCV, pages 4940–4949. IEEE, 2017.
  • [19] G. Papandreou, L.-C. Chen, K. P. Murphy, and A. L. Yuille. Weakly-and semi-supervised learning of a deep convolutional network for semantic image segmentation. In Proceedings of the IEEE international conference on computer vision, pages 1742–1750, 2015.
  • [20] D. Pathak, P. Krahenbuhl, and T. Darrell. Constrained convolutional neural networks for weakly supervised segmentation. In Proceedings of the IEEE international conference on computer vision, pages 1796–1804, 2015.
  • [21] C. R. Qi, W. Liu, C. Wu, H. Su, and L. J. Guibas. Frustum pointnets for 3d object detection from rgb-d data. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 918–927, 2018.
  • [22] C. R. Qi, H. Su, K. Mo, and L. J. Guibas. Pointnet: Deep learning on point sets for 3d classification and segmentation. Proc. Computer Vision and Pattern Recognition (CVPR), IEEE, 1(2):4, 2017.
  • [23] S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pages 91–99, 2015.
  • [24] Z. Ren and E. B. Sudderth. Three-dimensional object detection and layout prediction using clouds of oriented gradients. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1525–1533, 2016.
  • [25] Z. Ren and E. B. Sudderth. 3d object detection with latent support surfaces. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 937–946, 2018.
  • [26] S. Song, S. P. Lichtenberg, and J. Xiao.

    Sun rgb-d: A rgb-d scene understanding benchmark suite.

    In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 567–576, 2015.
  • [27] S. Song and J. Xiao. Deep sliding shapes for amodal 3d object detection in rgb-d images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 808–816, 2016.
  • [28] D. Stutz and A. Geiger. Learning 3d shape completion from laser scan data with weak supervision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1955–1964, 2018.
  • [29] M. Sundermeyer, Z.-C. Marton, M. Durner, M. Brucker, and R. Triebel. Implicit 3d orientation learning for 6d object detection from rgb images. In European Conference on Computer Vision (ECCV), pages 712–729. Springer, 2018.
  • [30] P. Tang, X. Wang, A. Wang, Y. Yan, W. Liu, J. Huang, and A. Yuille. Weakly supervised region proposal network and object detection. In European Conference on Computer Vision (ECCV), pages 352–368, 2018.
  • [31] P. Vernaza and M. Chandraker. Learning random-walk label propagation for weakly-supervised semantic segmentation. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), volume 3, 2017.
  • [32] D. Z. Wang and I. Posner. Voting for voting in online point cloud object detection. In Robotics: Science and Systems, volume 1, page 5, 2015.
  • [33] Y. Xiang, W. Kim, W. Chen, J. Ji, C. Choy, H. Su, R. Mottaghi, L. Guibas, and S. Savarese. Objectnet3d: A large scale database for 3d object recognition. In European Conference Computer Vision (ECCV), 2016.
  • [34] Y. Xiang, R. Mottaghi, and S. Savarese. Beyond pascal: A benchmark for 3d object detection in the wild. In Applications of Computer Vision (WACV), 2014 IEEE Winter Conference on, pages 75–82. IEEE, 2014.
  • [35] H. Xiao, Y. Wei, Y. Liu, M. Zhang, and J. Feng. Transferable semi-supervised semantic segmentation. In

    Association for the Advancement of Artificial Intelligence

    , 2017.
  • [36] D. Xu, D. Anguelov, and A. Jain. Pointfusion: Deep sensor fusion for 3d bounding box estimation. arXiv preprint arXiv:1711.10871, 2017.
  • [37] J. Xu, A. G. Schwing, and R. Urtasun. Learning to segment under various forms of weak supervision. In Computer Vision and Pattern Recognition (CVPR), 2015 IEEE Conference on, pages 3781–3790. IEEE, 2015.
  • [38] Z. J. Yew and G. H. Lee. 3dfeat-net: Weakly supervised local 3d features for point cloud registration. In European Conference on Computer Vision (ECCV), 2018.
  • [39] D. Zhang, J. Han, Y. Yang, and D. Huang. Learning category-specific 3d shape models from weakly labeled 2d images. In Proc. CVPR, pages 4573–4581, 2017.
  • [40] X. Zhou, Q. Huang, X. Sun, X. Xue, and Y. Wei. Towards 3d human pose estimation in the wild: a weakly-supervised approach. In IEEE International Conference on Computer Vision, 2017.
  • [41] Y. Zhou and O. Tuzel. Voxelnet: End-to-end learning for point cloud based 3d object detection. arXiv preprint arXiv:1711.06396, 2017.
  • [42] Y. Zhou, Y. Zhu, Q. Ye, Q. Qiu, and J. Jiao. Weakly supervised instance segmentation using class peak response. In Proceedings of the IEEE international conference on computer vision, pages 3791–3800, 2018.

Supplementary Materials

black In Sec. A, we discuss the costs of different types of labels. In Sec. B, we observe the performance with varying amounts of 3D box labels for classes. In Sec. C and Sec. D, we elaborate on the details of the networks and the details of the training procedures, respectively. In Sec. E, we provide additional qualitative results and figures on the SUN-RGBD dataset and in Sec. F, qualitative results for the KITTI dataset.

black

A Time Costs of Labels

The SUN-RGBD dataset [26] required 2,051 hours to label 64,595 3D bboxes, which is an average of 114s per object. In contrast, the average time to label a 2D bbox according to [18] is 35s per object but can be as fast as 7s when using their proposed labeling method with no loss of accuracy. Hence, it is potentially 3-16 times faster to label 2D compared to 3D bboxes.

black

B In-Category Semi-supervision Performance

In the main paper, we assumed that there are no 3D box labels for weak classes , which is a cross-category semi-supervised setting. In Fig. 7, we train our model and the baseline “FPN*” on varying amounts of 3D box labels for to understand the performance of our model in an in-category semi-supervised setting. When the percentage of 3D box labels used for is , the semi-supervised baseline “FPN*” (green line) becomes the fully-supervised “FPN*” (51.5% mAP) and our proposed semi-supervised method (blue line) becomes the fully-supervised “FPN* + BoxPC Refine” (53.2% mAP).

We observe that our proposed method always performs better than the baseline for different percentages of 3D box labels of classes. This demonstrates the usefulness of our method even when labels are available for classes. Additionally, we note that the baseline with 50% 3D box labels available for achieves a similar performance to our proposed method with 0% 3D box labels for , which demonstrates the effectiveness of the knowledge that has been transferred from classes.

Figure 7: black3D object detection mAP on SUN-RGBD val set of our proposed model with different percentages of 3D labels for . The blue and green lines correspond to the “Ours + BoxPC + R + P” and “FPN*” semi-supervised settings, respectively.

C Network Details

c.1 Network Architecture of Baselines

Fig. 9 shows the network architecture for the original Frustum PointNets (FPN) [21]. The same architecture is used to obtain the results for “FPN” and the baseline “FPN*” in Tab. 1 of the main paper. We remove the one hot class vectors given as features to the and networks to get the stronger baseline “FPN* w/o OneHot”. The performance improves because in the cross-category semi-supervised setting, the network does not train on the strong labels of inference classes. Hence, having class information does not help the network during inference.

Figure 8: Network architecture for the original Frustum PointNets for fully-supervised 3D object detection. This network corresponds to “FPN” and the baseline “FPN*” in Tab. 1 of the main paper. To obtain the stronger baseline “FPN* w/o OneHot”, we remove the one hot class vectors that are given as features to and .
Figure 9: Network component details for our baseline and proposed models that are used in cross-category semi-supervised 3D object detection. In fully-supervised setting, a one hot class vector is added to the “Global Features” of the Instance Segmentation Network and the “BoxPC Features” of the BoxPC Fit Network.
Figure 8: Network architecture for the original Frustum PointNets for fully-supervised 3D object detection. This network corresponds to “FPN” and the baseline “FPN*” in Tab. 1 of the main paper. To obtain the stronger baseline “FPN* w/o OneHot”, we remove the one hot class vectors that are given as features to and .
Figure 10: Precision recall (PR) curves for 3D object detection on SUN-RGBD val set with different methods. “FPN (Fully-supervised)” is the fully-supervised model that gives an upper-bound performance for the rest of the models which are cross-category semi-supervised.

c.2 Network Component Details

In this section, we describe the details of the network components in the baseline and proposed models for the cross-category semi-supervised 3D object detection (CS3D) setting. The fully-supervised 3D object detection (FS3D) setting uses the same components except for minor changes.

The network components are given in Fig. 9. We use the v1 instance segmentation and v1 box estimation networks in FPN [21] as the instance segmentation and box estimation networks in our baseline and proposed models. The BoxPC Fit networks also use PointNet [22] and MLP layers to learn the BoxPC features.

In the CS3D setting, we remove the one hot class vector that is originally concatenated with the “Global Features” of the v1 instance segmentation network because we are performing class-agnostic instance segmentations. In the FS3D setting, the one hot class vector is added back.

The v1 box estimation network is composed of the T-Net and Box Estimation PointNet. The T-Net gives an initial prediction for the center of the box which is used to translate the input point cloud to reduce translational variance. Then, the Box Estimation PointNet makes the box predictions using the translated point cloud. Both CS3D and FS3D settings use the same box estimation networks. The details of the loss functions to train the box estimation networks can be found in [21].

In the CS3D setting, the BoxPC Fit network learns class-agnostic BoxPC Fit between 3D boxes and point clouds. In the FS3D setting, we concatenate a one hot class vector to the “BoxPC Features” to allow the network to learn BoxPC Fit that is specific to each class.

D Training Details

d.1 Training of BoxPC Fit Network

As discussed in the paper, we have to sample from 2 sets of perturbations and to train the BoxPC Fit Network to understand what is a good BoxPC fit. To sample perturbations from either or , we uniformly sample center perturbations , size perturbations and rotation perturbations . We perturb a 3D box label to obtain a perturbed 3D box label with a sampled perturbation . Next, we check if has an IOU with that is within the range specified by the set or , i.e if or , respectively. We accept and use it as a single input if it satisfies the IOU range for the set. We repeat the process until we have enough samples for a minibatch. Specifically, each minibatch has equal number of samples from and .

d.2 Cross-category Semi-supervised Learning

Let = {bathtub, bed, toilet, chair, desk} and = {dresser, nightstand, sofa, table, bookshelf}. We train with and , i.e. we train on the 2D and 3D box labels of and the 2D box labels of , to obtain the evaluation results for CS3D on . We train with and to get the evaluation results for CS3D on .

Sun-Rgbd

For our 2D object detector, we train a Faster RCNN [23, 4] network on all classes . When we train the BoxPC Fit network on classes , we set the perturbation parameters to . The loss weights for the BoxPC Fit network are set to . When we train the 3D object detector, we set the lower and upper bound boxes for the relaxed reprojection loss to and volume threshold for all classes to . The loss weights for the 3D detector are .

Kitti

For our 2D object detector, we use the released detections from FPN [21] for fair comparisons with FPN. We set the perturbation parameters to when we train the BoxPC Fit network on classes . The loss weights of the BoxPC Fit network are set to . We set the lower and upper bound boxes for the relaxed reprojection loss to for Pedestrians and Cyclists, and for Cars when we train the 3D object detector on the classes . We set the volume threshold for Cars to and 0 for the other classes. The loss weights for the 3D object detector are set to .

d.3 Fully-supervised Learning

Figure 11: Additional qualitative comparisons between the baseline, fully-supervised and proposed models on the SUN-RGBD val set. The baseline model’s predictions tend to be large and inaccurate because it is unable to understand how to fit objects from the weak classes. The last two examples (bottom right) are difficult scenes for the baseline and proposed models due to heavy occlusions.
Figure 12: Qualitative comparisons between the baseline, fully-supervised and proposed models on the KITTI val set. The proposed model is much closer to the fully-supervised model than the baseline model. The baseline model’s predictions for Cars tend to be inaccurate or excessively small due to the lack of a prior understanding on the scale of Cars.
Figure 13: Additional qualitative comparisons between the baseline, fully-supervised and proposed models on the KITTI val set. The last two examples (bottom right) are difficult scenes for the baseline and proposed models. This is due to heavy occlusions and poor understanding on how to fit Cars using transferred 3D information from Pedestrians and Cyclists, which have very different shapes from Cars.

In the fully-supervised setting, we train and evaluate on .

Sun-Rgbd

We use the same 2D detector as in the CS3D setting. When we train our BoxPC Fit network, we set the perturbation parameters to . The loss weights of the BoxPC Fit network are set to . For our 3D object detector, we set the parameters to . Next, we use the trained BoxPC Fit network to refine the 3D box predictions of this fully-supervised model without further training.

Kitti

We use the same 2D detections as the CS3D setting. We train one BoxPC Fit network for each of the 3 classes to allow each network to specialize on improving box predictions for a single class. For Cars, we set . For Pedestrians, we set . For Cyclists, we set . The loss weights of the BoxPC Fit network are set to . Next, we use the trained BoxPC Fit networks to refine the 3D box predictions of the 3D object detector model released by [21] without further training.

E Additional Results for SUN-RGBD

In Fig. 10, we plot the precision-recall curves for different methods to study the importance of each proposed component. All the methods are cross-category semi-supervised except for “FPN (Fully-supervised)”, which is the fully-supervised FPN [21] that gives an upper-bound performance for the methods. We observe that “Ours + BoxPC + R + P” (in yellow) has higher precision at every recall than the baseline “FPN w/o OneHot” (in blue) for almost all classes.

We also provide additional qualitative results on the SUN-RGBD dataset in Fig. 11. The predictions made by the baseline model tend to be large and inaccurate due to the lack of strong labels in the weak classes. In the third example of Fig. 11, we see that the predictions by the baseline model can also be highly unnatural as it cuts through the wall. On the contrary, we observe that the proposed model’s predictions tend to be more reasonable and closer to the fully-supervised model despite not having strong labels for the weak classes. In the last two examples of Fig. 11, the heavy occlusions in the scene makes it difficult for the baseline and proposed models to make good predictions.

F Qualitative Results for KITTI

In Fig. 12 and Fig. 13, we provide qualitative comparisons between the baseline, fully-supervised and proposed models for the KITTI dataset. In both Fig. 12 and Fig. 13, we again observe that the proposed model is closer to the fully-supervised model than the baseline model. Notably, in the first example (top left) and ninth example (bottom right) of Fig. 12, we observe that the proposed model is able to make significantly good predictions on Pedestrians despite the crowded scene.

For almost all of the Car instances, the baseline model makes excessively small predictions because it was trained on Pedestrian and Cyclist classes which are much smaller. Our proposed model is able to make better predictions on Cars but the orientation of the 3D box predictions can be improved further.

In the last two examples of Fig. 13, we show some of the difficult scenes for our baseline and proposed models. This is because there are heavy occlusions and poor understanding of Cars since the models were trained on Pedestrians and Cyclists, which have very different shapes and sizes from Cars.