Aug3D-RPN: Improving Monocular 3D Object Detection by Synthetic Images with Virtual Depth

07/28/2021 ∙ by Chenhang He, et al. ∙ 5

Current geometry-based monocular 3D object detection models can efficiently detect objects by leveraging perspective geometry, but their performance is limited due to the absence of accurate depth information. Though this issue can be alleviated in a depth-based model where a depth estimation module is plugged to predict depth information before 3D box reasoning, the introduction of such module dramatically reduces the detection speed. Instead of training a costly depth estimator, we propose a rendering module to augment the training data by synthesizing images with virtual-depths. The rendering module takes as input the RGB image and its corresponding sparse depth image, outputs a variety of photo-realistic synthetic images, from which the detection model can learn more discriminative features to adapt to the depth changes of the objects. Besides, we introduce an auxiliary module to improve the detection model by jointly optimizing it through a depth estimation task. Both modules are working in the training time and no extra computation will be introduced to the detection model. Experiments show that by working with our proposed modules, a geometry-based model can represent the leading accuracy on the KITTI 3D detection benchmark.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

page 7

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

3D object detection is one of the key technologies in autonomous driving. Detecting and localizing the objects accurately in the 3D space is critical for the autonomous vehicles to run safely over a long distance. While the use of high-end LiDAR for high-precision 3D object localization [52, 45, 36, 17, 46, 13, 35] has drawn much interest in academia and industry, the demand for more economic monocular based alternatives is also increasing, and many monocular 3D object detection methods [2, 24, 30, 27, 51, 33, 1, 8, 15, 11, 26, 32, 40, 47, 34, 44] have been developed in the past years.

(a)
(b)
Figure 1: (a) Bounding box predictions by M3D-RPN [2], where the model can accurately predict the size and orientation of the objects, but fail to accurately localize the objects at far distances. (b) Bounding box predictions by M3D-RPN, learnt with synthetic virtual-depth images. The predicted and ground-truth boxes are shown in green and red, respectively.

Current researches on monocular 3D object detection can be categorized into two streams, i.e., geometry-based methods and depth-based methods. Geometry-based methods [2, 30, 31, 24, 33, 51, 8, 19] mainly focus on predicting various geometric primitives based on perspective transformation and detecting the objects to fit these primitives. Good geometric primitives can establish a robust connection between the 3D object predictions and 2D image features. Figure 1(a) shows an example, where the geometric relationship between 2D and 3D anchors [2] is used to learn the object size and orientation from image semantics. However, such methods are difficult to estimate the object depth from a single image. As a result, the geometry-based methods are not able to accurately localize the objects at a long distance.

Figure 2: Examples of reference image (in red) and its corresponding synthetic virtual-depth images (in yellow) with relative virtual depth -1, +1, +2, +3 and +4 meters.

The above mentioned problem can be mitigated by depth-based methods [44, 10, 11, 40, 47, 32, 26, 1, 27, 15], where the depth information are utilized to provide more cues for bounding-box reasoning. For instance, the depth can be transformed into Pseudo-liDAR representations [40, 47, 32, 27, 1, 41, 26], fused with image features [44], or used to change the receptive field of convolutional kernels [11] so that the detection model can capture more structural details of the objects. However, since the depth information is not available at the testing time, a depth estimator that trained from a collection of depth images should be plugged in the detection pipeline, making the depth-based methods hard to reach real-time speed.

In this work, we investigate a novel learning approach to improve the 3D object detection from depth images without training a cumbersome depth estimator. By utilizing the depth images, we augment the input data by synthesizing variety of images in the training stage. Unlike traditional data augmentation techniques that manipulate images in 2D space, we synthesize images at different virtual depths (Figure 2), so that the detection model can learn more robust features to adapt to the depth changes of the same semantics. Therefore, there is no need to introduce a depth estimator in the testing stage. As depicted in Figure 1(b), the model trained with synthesized images can better estimate the object depth and achieve more accurate bounding-box localization.

To synthesize the images with virtual depth, we take advantage of novel-view synthesis methods [42, 7, 9, 49] and propose an efficient rendering module that uses depth image to reconstruct the 3D scene of the reference image, and synthesize images with virtual depth through the camera displacements. The ground-truth bounding-boxes are also shifted along the depth axis according to the displacement of current scene, producing synthetic training pairs for detection model to learn. There are several challenges for the detection model to learn from those synthetic images. First, the depth images are very sparse, making the synthetic images contain many holes but few meaningful semantics. Second, the provided depth images are not aligned well with the RGB images due to the calibration error, those unmatched depth pixels around the object boundaries make their appearance highly distorted and cause ghost artifacts in the synthetic image. To address these issues, we propose a multi-stage rendering module to generate high-quality synthesized images from coarse to fine.

In addition, we propose an auxiliary module in conjunction with the backbone of detection model. The auxiliary module consumes the backbone features for depth estimation and jointly optimize them with a pixel-wise loss. In this way, the detection model can learn more fine-grained details and achieve more reliable 3D reasoning.

To the best of our knowledge, it is the first attempt to successfully employ synthetic images for 3D object detection. Our proposed learning framework will not introduce additional building blocks (e.g., depth estimator) and computation in the final deployment. The synthetic images can be readily learnt by other geometry-based detectors. Our contributions are summarized as follows:

  • We propose a novel data augmentation strategy to ensure the effective learning of monocular 3D object detection by generating synthetic images with virtual depth.

  • We propose an efficient rendering module that can synthesize photo-realistic images from coarse to fine.

  • We propose an auxiliary module to jointly optimize the detection features through a pixel-level depth estimation task, leading to more accurate bounding-box reasoning.

We evaluate our learned 3D object detector on the KITTI [12] 3D/BEV detection benchmark, and it demonstrates the state-of-the-art performance among the current monocular detectors.

(a)
(b)
Figure 3: (a) Overview of the proposed learning framework. The framework contains a detection model, a rendering module to synthesize virtual-depth images and a (b) pyramidal auxiliary module to transform the backbone feature for dense depth prediction.

2 Related work

Geometry-based monocular detection. The effective use of geometric priors can significantly improve the performance of monocular 3D object detection. Brazil et al. [2] proposed a 2D-3D anchoring mechanism, which employs 2D image features for 3D bounding-box detection. Mousavian et al. [30] and Naiden et al. [31] utilized the 2D-3D perspective constraint to stabilize the predictions of 3D bounding-box. Liu et al. [24, 19] proposed to estimate the 3D bounding-box in a group of key-points. Qin et al. [33] and Zhou et al. [51] decoupled the detection task into multiple subtasks and optimized these subtasks in a global context. Chen et al. [8] utilized the spatial constrains between the object and its occluded neighbors, and proposed a pair-distance function to regularize the bounding-box prediction. Roddick et al. [34] converted the 2D image features to bird-eye view using orthographic transformation, providing a holistically view of image features in 3D space.

In this paper, we demonstrate our proposed synthetic images can work well with three geometry-based detectors, SMOKE [24], M3D-RPN [2] and RTM3D [19].

Depth-based monocular detection. Depth images can effectively compensate for the lack of 3D cues in 2D images. Wang et al. [40, 47, 41] explored to convert the depth image from monocular input into a pseudo-LiDAR representation, which can better capture the 3D structure of objects using existing point cloud detector. Qian et al. [32] devised a soft-quantization technique to convert the depth image into the voxel input of point cloud detector, forming an end-to-end learning scheme. Xu et al. [26] conducted an in-depth investigation of pseudo-LiDAR representation and proposed an efficient scheme based on 2D CNN with coordinate transformation. In additional to pseudo-LiDAR conversion, Xu et al. [44] introduced a multi-level fusion algorithm that progressively aggregated the features from RGB and depth images. Ding et al. [11] proposed a depth-aware model, where the convolutional receptive fields were dynamically adjusted by the depth image, distinctly processing the image contents at different depth levels.

Unlike the above mentioned methods that require estimating depth information during the inference, our proposed framework only applies depth information in the training phase, thus presenting more efficient detection.

Novel-view synthesis. Novel-view synthesis has been widely studied in both vision and graphics. Early approaches [4, 16, 39, 50] usually perform direct image-to-image transformation, enabling an end-to-end view manipulation. Later approaches [42, 7, 9, 49, 43, 38]

focus on the learning of scene geometry, such as point cloud, multi-plane images, or implicit surfaces, from which continuous novel views with high degrees of freedom can be derived from the estimated 3D representations.

In this work, considering that the monocular detectors are sensitive to the depth changes, we synthesize images by fixing the target view on the z-axis i.e., along depth direction, thereby avoiding the need to complement unknown information from extrapolated views.

Figure 4: (a) The pipeline of rendering virtual-depth image. For background images, we first apply sparse rendering to generate virtual images with plausible semantics, and then a in-painting network to fill-in the missing areas. For foreground areas, we warp the 2D view-port of 3D boxes based on the perspective transformation from two camera poses.
Figure 5: The naive renderings (top) and renderings from contextual image (bottom).
Figure 6: (a) Rendering foreground pixels on virtual image with bounding-box. (b) The foreground depth map, where the depth values are approximated by projecting the ground-truth 3D bounding-box (c) onto the image. (d) The foreground renderings on virtual image.

3 Methodology

3.1 Overall learning framework

Our proposed learning framework is illustrated in Figure 3. It is composed by three primary components: a detection model, a rendering module and an auxiliary module. On one hand, the rendering module takes as input the RGB image and its corresponding sparse depth image, produces a set of synthetic images at various virtual depth. All these synthetic images are used to train the detector as augmented data. On the other hand, the auxiliary module augments the backbone features of the detection model by jointly optimizing them with a pixel-wise loss, through a depth estimation task. Both rendering and auxiliary modules will be removed after training, and only the learned detector is deployed. In this paper, we employ an efficient detection model, termed Aug3D-RPN, which consist of a ResNet-101 [14] backbone and a 2D-3D detection head [11, 2, 3].

In the following subsections, we introduce the above components and the training scheme in detail.

3.2 Synthesizing virtual-depth images for data augmentation

To improve the ability of monocular detector in discriminating object depth, we augment the training data by generating synthetic images with virtual depth. Here we use to specify the camera displacement between the reference and virtual views. The overall pipeline is shown in Figure 4, in which we facilitate the depth information to perform virtual view synthesis and apply an in-painting network to fill-in the pixels with empty value. We first map the reference image into a 3D point cloud based on the nonzero values of its corresponding depth image . For each position of the reference image, its 3D coordinate of point cloud can be calculated by:

(1)

where and and denote the focal length and the principle points of the reference camera, respectively. The virtual-depth image can then be synthesized by projecting the point cloud from the current camera coordinates to the image plane in virtual view. Given the camera projection matrix P, the position of each point after projection can be calculated by:

(2)

The rendering process can then be regarded as transporting each colored pixel from to .

However, as the depth image is extremely sparse, the output image through the above sparse rendering method is dominated by tremendous black pixels. The limited semantics in the resulted renderings make the in-painting network hard to train. In addition, some depth images are not aligned well with the reference images due to the inevitable calibration error. The unmatched pixels will distort the object’s appearance and yield ghost artifacts at the object boundaries.

To address the above issue, we perform a pre-rendering step to enhance the density of renderings. Specifically, we “unfold” the reference image into a contextual image by stacking its neighboring pixels of each position into a single channel. Given an image of shape , the contextual image has a shape of . Then we applied the above rendering formula to extract the sparse version of contextual image and then “fold” the output back to the RGB space. As displayed in Figure 5, the renderings from contextual image is more dense and full of meaningful contents.

As for foreground objects, since they mainly have rigid body, we can use the surface of the ground-truth bounding-box to approximate their depth. As shown in Figure 6, we first calculate the depth values in the 2D “viewport” of these objects based on the geometries of their bounding-boxes. Then, based on these depth values, we can calculate the appearance flow for each foreground pixel according to Eq. 1 and Eq.2, and remap them on the synthetic image.

After that, we employ a standard U-Net architecture with partial convolutions [21] to complete the color pixels from missing areas. To train the in-painting network, we let the detection images as training targets, and their pre-renderings with as training inputs. We use pixel-wise , perceptual loss and adversarial loss to supervise the in-painting network. Given a reference image and its pre-renderings , we have:

(3)

where is the hidden feature activation from a pre-trained VGG-16 classification network. is a relativistic adversarial loss from [48], in which a global and a local discriminator are utilized for perception enhancement. We leave the configuration of our in-painting network and more training details to the supplementary material.

3.3 Joint depth estimation for feature augmentation.

One of the core idea of this work is utilizing pixel-level supervision to improve the understanding of object-level task, which has been proven effective in many prior arts [13, 20, 29]. Based on this, we propose a detachable auxiliary module to augment the backbone features by jointly optimizing depth estimation and object detection tasks. In particular, we use a pyramidal architecture to deploy our auxiliary module, so that it can aggregate multi-scale detection features into a full-resolution representation. For ResNet-101, we fuse the outputs from the first convolutional layer Conv1 and four residual blocks layer1, layer2, layer3, layer4.

As depicted in Figure 3

(b), we upsample the feature from the bottom stage with bilinear interpolation. The features across stages are first concatenated and then undergone a

convolutional layer to generate feature with 256 channels. In the final stage, one additional 1×1 convolution is applied to produce depth estimation . Since the foreground components and background components have different depth distributions, we applied the following weighted loss to jointly optimize the detection and depth estimation tasks:

(4)

where is the detection loss in [2], is the ground-truth depth image, and respectively represent the nonzero foreground and background pixels in the ground-truth depth image. We found that by setting , and with [23] loss, our detection model can generally achieve the best performance.

3.4 3D object reasoning with 2D-3D detection head

The detection head predicts both 2D and 3D geometries of the bounding-boxes based on a set of pre-defined 2D-3D anchors, and then associates these geometries according to the perspective relationship.

Specifically, we define a 2D-3D anchor by , where denote its width and height in image scale, , and represent its 3D dimensions, depth and observation angle in camera coordinates, respectively. Given 2D predictions , , , , the 2D bounding-box at each anchor position , can be written as:

(5)
(6)

Given 3D predictions , , , , , the dimension of 3D bounding box , in terms of height, width, length and observation angle, can be given as follows:

(7)
(8)

Its location in camera coordinates therefore subjects to

(9)

where is the camera projection and serves to round the radian value within .

4 Experiments

width=0.90 Method Conference #Frames 3D BEV runtime (s/img) Easy Moderate Hard Easy Moderate Hard FQNet[22] CVPR19 1 2.77 1.51 1.01 5.40 3.23 2.46 0.5 ROI-10D[28] CVPR19 1 4.32 2.02 1.46 9.78 4.91 3.74 0.2 GS3D[18] CVPR19 1 4.47 2.90 2.47 8.41 6.08 4.94 2.3 Shift R-CNN [31] ICIP19 1 6.88 3.87 2.83 11.84 6.82 5.27 0.25 MonoFENet[1] TIP19 1 8.35 5.14 4.10 17.03 11.03 9.05 0.15 MLF [44] CVPR18 1 7.08 5.18 4.68 - - - 0.12 MonoGRNet[33] AAAI19 1 9.61 5.74 4.25 18.19 11.17 8.73 0.04 MonoPSR [15] CVPR19 1 10.76 7.25 5.85 18.33 12.58 9.91 0.2 MonoPL [41] CVPR19 1 10.76 7.50 6.10 21.27 13.92 11.25 - MonoDIS [37] ICCV19 1 10.37 7.94 6.40 17.23 13.19 11.12 - M3D-RPN[2] ICCV19 1 14.76 9.71 7.42 21.02 13.67 10.23 0.16 SMOKE [24] CVPRW20 1 14.03 9.76 7.84 20.83 14.49 12.75 0.03 MonoPair [8] CVPR20 1 13.04 9.99 8.65 19.28 14.83 12.89 0.06 AM3D [27] ICCV19 1 16.50 10.74 9.52 25.03 17.32 14.91 0.4 RTM3D [19] ECCV20 1 14.41 10.34 8.77 19.17 14.20 11.99 0.05 PatchNet [26] ECCV20 1 15.68 11.12 10.17 22.97 16.86 14.97 0.4 D4LCN [11] CVPR20 1 16.65 11.72 9.51 22.51 16.02 12.55 0.6222The runtime reported on the KITTI leaderboard does not count the runtime of depth estimation, we approximate its overall runtime based on its official code. Kinematic3D [3] ECCV20 4 19.07 12.72 9.17 26.69 17.52 13.10 0.12 Aug3D-RPN (ours) - 1 17.82 12.99 9.78 26.00 17.89 14.18 0.08

Table 1: Performance comparison with state-of-the-art methods on KITTI test set. BEV and 3D object detection metric are used, reported by AP40 (IoU0.7). The top three performers are indicated by red, blue and green

colors, respectively. The runtime is reported from the KITTI leaderboard with slight variances in hardware.

We use KITTI dataset [12] to evaluate our proposed Aug3D-RPN. The dataset contains 7,481 training samples and 7,518 testing samples. To evaluate the effectiveness of different modules, we use the splits in [6] by dividing the training set into 3,712 training samples and 3,769 validation samples.

There are two detection tasks, known as 3D detection (3D) and birds-eye-view localization (BEV) tasks. Each task has three levels of difficulties: easy, moderate and hard, according to the object size, occlusion and truncation level, and the ranking is based the average precision (AP) on the moderate level. The AP with 11 points [5] and 40 points [37] interpolation are respectively denoted by AP11 and AP40.

4.1 Implementation details

We first extract the depth images following [47, 32]

. Both depth images and RGB images are padded to

and their mirrored versions are randomly sampled with a probability of 0.25 during training. For each ground-truth 3D box, we calculate its 2D box by projecting it onto the image plane. We ignore those objects which have 2D boxes less than 16 pixels or larger than 256 pixels in height, and those with visibility less than 0.5.

To build the anchor classification target, we calculate the IoU between the 2D anchors and 2D ground-truths. An IoU threshold of 0.5 is used to label the anchors as positive or negative. The anchors that have IoU greater than 0.5 with the ignore region, will be ignored. Besides, we apply online hard negative mining (OHNM) [23] to re-balance the positive and negative anchor targets with a ratio of 1:3.

To create virtual-depth images, we ignore those objects that are truncated by camera or have visibility less than 0.6. By removing the depth values of these objects, our in-painting network will complete their area as background. For each training iteration we sample 2 camera displacements from a uniform distribution

and 1 camera displacement from . The later is to increase the number of objects that truncated by camera.

Our Aug3D-RPN detector is trained for 60 epochs by using the SGD optimizer. The batch size, learning rate, and weight decay are set to 4, 0.004, and 0.0005, respectively. The learning rate is decayed with a cosine annealing strategy

[25]. In the inference phase, we select the predicted boxes with confidence larger than 0.75 and apply non-maximum suppression (NMS) with an IoU threshold of 0.4 to remove the duplicated boxes.

width=0.99 Method 3D (IoU0.7) BEV (IoU0.7) 3D (IoU0.5) BEV (IoU0.5) Easy Moderate Hard Easy Moderate Hard Easy Moderate Hard Easy Moderate Hard [2] 20.88 17.39 15.51 27.13 21.71 18.30 50.17 40.27 33.62 58.21 43.68 36.23 M3D-RPN + Syn. 24.41 20.52 17.13 31.86 26.12 20.12 54.56 43.72 35.25 63.68 47.99 36.79 Delta 3.53 3.13 1.62 4.73 4.41 1.82 4.39 3.45 1.63 5.47 4.31 0.56 [24] 18.32 15.84 15.23 24.39 20.64 17.37 50.08 37.39 32.87 53.93 39.98 39.20 SMOKE + Syn. 21.46 18.31 16.60 28.11 24.12 18.99 54.23 41.02 35.22 56.98 44.55 40.47 Delta 3.14 2.47 1.37 3.72 3.48 1.62 4.15 3.63 2.35 3.05 4.57 1.27 RTM3D[19] 19.19 16.70 16.14 25.96 21.88 18.88 54.97 42.68 36.95 60.98 45.74 42.93 RTM3D + Syn. 22.12 19.29 17.25 30.21 25.37 20.28 58.72 45.77 38.27 64.62 49.91 44.15 Delta 2.93 2.59 1.11 4.25 3.49 1.40 3.75 3.09 1.32 3.64 4.17 1.22

Table 2: Evaluation of 3D/BEV detection performance on KITTI val set, reported by AP11. The abbreviation “Syn.” indicates using synthetic images for training. “” indicates the results are reproduced by the official public codes, which are slightly different from the results in the paper.

4.2 Comparison with the state-of-the-art

We compare our Aug3D-RPN detector with other state-of-the-art detectors by submitting the detection results to the KITTI server for evaluation. The evaluation is based on AP40 and the results are presented in Table 1. As can be seen, our method ranks higher than all previous single-frame methods in both 3D and BEV metric. The performance on easy level are comparable to the best performers Kinematic3D [3], which is a multi-frame detector. Besides, our method only runs at 80ms per image, which is more than five times faster than the recently proposed detectors, such as AM3D (400ms), D4LCN (600ms) and PatchNet (400ms), as all these competitors are depth-based detectors and they all rely on an expensive dense depth input. In contrast, Aug3D-RPN is trained to be depth-aware without explicitly requiring depth information in the testing stage. The ResNet backbone and detection head take only around 50ms and 30ms in the inference, respectively.

Comparing with those geometry-based monocular 3D detectors, such as SMOKE[24], M3D-RPN[2], and RTM3D[19], Aug3D-RPN presents significant advantages. Aug3D-RPN outperforms these methods by up to in 3D detection and in BEV detection. The performance on KITTI val set are also shown in Table 3. Again, our method can achieve state-of-the-art performance on most detection metrics, which demonstrates the effective learning with synthetic images and auxiliary module.

width=1 Method 3D / BEV (IoU0.7) E M H MonoPSR [15] 12.75 / 20.63 11.48 / 18.67 8.59 / 14.45 MonoDIS [37] 18.05 / 24.26 14.98 / 18.43 13.42 / 16.95 SMOKE [24] 14.76 / 19.99 12.85 / 15.61 11.50 / 15.28 M3D-RPN [2] 20.27 / 25.94 17.06 / 21.18 15.21 / 17.90 RTM3D [19] 19.19 / 25.56 16.70 / 22.12 16.14 / 20.91 AM3D[27] 32.23 / - 21.09 / - 17.26 / - D4LCN [11] 26.97 / 34.82 21.71 / 25.83 18.22 / 23.53 Aug3D-RPN (ours) 28.53 / 36.27 22.61 / 28.76 18.34 / 23.70

Table 3: 3D object detection performance on KITTI val set, reported by AP11. “-” means that the results are not available. The top 2 performers are indicated by red and blue colors, respectively.

width=0.95 Exp. Syn. Aux-fg. Aux-bg. 3D (IoU 0.7) E M H 1 22.62 17.44 15.48 2 25.89 20.45 17.22 3 24.51 18.91 16.63 4 23.14 17.98 16.25 5 25.16 19.48 17.31 6 28.53 22.61 18.34

Table 4: Ablation Experiments on the KITTI val set using AP11. “Syn” means using synthetic images for training, “Aux-fg” and “Aux-bg” refer to imposing auxiliary loss on the foreground and background image areas, respectively.
(a)
(b)
Figure 7: The Pseudo-LiDAR generated by projecting the depth estimation from (a) Exp.5 and (b) Exp.6. The ground-truth bounding-boxes are shown in red.

4.3 Ablation study

We first demonstrate our synthetic virtual-depth images can work well with other geometry-based detectors. Here we choose three baseline detectors, M3D-RPN [2], SMOKE [24] and RTM3D [19]. All of them can achieve real-time efficiency and with neat architecture. By running their public codes, we present their performances on KITTI val set. As shown in Table 2, the model trained with virtual-depth images can consistently achieve around 23 points (in AP11) improvements on easy and moderate levels. The slight improvement on hard level is because most objects in this category have very few depth values due to high occlusion and long distance, which makes them hard to be rendered in the synthetic images.

Then we perform a in-depth analysis of how different components contribute to Aug3D-RPN.

(a) The predictions by the baseline detection model. (b) The predictions by the proposed Aug3D-RPN model.

Figure 8: Examples of detection results on KITTI val set. The predicted and ground-truth bounding boxes are shown in green and red colors, respectively. The predictions on 3D point cloud are plotted for better visualization. Best viewed in color.

width=0.95 Method 0-10m 10-20m 20-30m 30-40m w/o Syn. 2.05 3.29 5.07 9.41 w/ Syn. 1.47 2.18 4.23 8.80 Relative Delta. -28.3% -33.7% -16.6% -6.5%

Table 5: The absolute depth error (in meters) in different depth intervals.

Effectiveness of synthetic virtual-depth images. As presented in Table 4, the model learning with synthetic images can achieve improvements by the baseline model (Exp.2 vs. Exp.1), and by the auxiliary-driven model (Exp.6 vs. Exp.5). On top of the auxiliary-driven model, we also evaluate how virtual-depth images can boost the overall depth estimation in pixel-level. We calculate the absolute depth error (absError) within a reliable depth range (0m, 40m), the error rates (in meters) of each 10m interval are reported in Table 5. As can be seen, the model trained with synthetic images can exhibit more accurate depth estimation. Fig. 7 presents a qualitative comparison by projecting the estimation into 3D space, in the form of Pseudo-LiDAR. As can be seen, the model trained with synthetic data can obtain more compact 3D points attached to the bounding-box boundary, which means that the depth prediction of the object is more accurate and consistent.

Effectiveness of auxiliary module. As shown in Table 4, the model guided by auxiliary module can earn additional points (Exp.6 vs. Exp.2). Actually, we can apply depth estimation loss separately on foreground/background regions of the input image. As shown in Table 4, applying loss on the foreground regions can result in performance gains, while applying loss on background regions can improve the performance by . This reveals the fact that both images components are important to support the reasoning of object depth. By discriminating the depth of foreground object, the model can be aware of its appearance, therefore better inferring its orientation and local geometry. The background of images encode the geometry of the scene, which provides additional contextual cues to estimate the object depth.

Qualitative results. In Figure 8, we demonstrate some predictions from the baseline and full-setting models (Exp.1 vs. Exp.6). As can be observed, the detector resulted from our learning framework can predict more reliable 3D bounding boxes without having additional memory and runtime.

5 Conclusion

In this paper, we propose a effective learning framework, which consist of a rendering module and an auxiliary module, to improve the monocular 3D object detection. The rendering module augments the training data by synthesizing images with virtual depths, from which the detector can learn to discriminate the depth changes of the objects. The auxiliary module can guide the detection model to learn better structure information about objects and scenes. Both modules significantly improve the detection accuracy and then are removed in the final deployment, adding no computational cost. Experiments on the KITTI 3D/BEV detection benchmark demonstrate the synthetic images can work well with existing monocular detectors. The proposed Aug3D-RPN can achieve leading accuracy and more significantly be 4 times faster than the recent state-of-the-art methods.

References

  • [1] W. Bao, B. Xu, and Z. Chen (2019) Monofenet: monocular 3d object detection with feature enhancement networks. IEEE Transactions on Image Processing 29, pp. 2753–2765. Cited by: §1, §1, Table 1.
  • [2] G. Brazil and X. Liu (2019) M3d-rpn: monocular 3d region proposal network for object detection. In

    Proceedings of the IEEE International Conference on Computer Vision

    ,
    pp. 9287–9296. Cited by: Figure 1, §1, §1, §2, §2, §3.1, §3.3, §4.2, §4.3, Table 1, Table 2, Table 3.
  • [3] G. Brazil, G. Pons-Moll, X. Liu, and B. Schiele (2020) Kinematic 3d object detection in monocular video. In European conference on computer vision, Cited by: §3.1, §4.2, Table 1.
  • [4] X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever, and P. Abbeel (2016) Infogan: interpretable representation learning by information maximizing generative adversarial nets. In Advances in neural information processing systems, pp. 2172–2180. Cited by: §2.
  • [5] X. Chen, K. Kundu, Y. Zhu, A. G. Berneshawi, H. Ma, S. Fidler, and R. Urtasun (2015) 3d object proposals for accurate object class detection. In Advances in Neural Information Processing Systems, pp. 424–432. Cited by: §4.
  • [6] X. Chen, H. Ma, J. Wan, B. Li, and T. Xia (2017) Multi-view 3d object detection network for autonomous driving. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    ,
    pp. 1907–1915. Cited by: §4.
  • [7] X. Chen, J. Song, and O. Hilliges (2019) Monocular neural image based rendering with continuous view control. In Proceedings of the IEEE International Conference on Computer Vision, pp. 4090–4100. Cited by: §1, §2.
  • [8] Y. Chen, L. Tai, K. Sun, and M. Li (2020) MonoPair: monocular 3d object detection using pairwise spatial relationships. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12093–12102. Cited by: §1, §1, §2, Table 1.
  • [9] I. Choi, O. Gallo, A. Troccoli, M. H. Kim, and J. Kautz (2019) Extreme view synthesis. In Proceedings of the IEEE International Conference on Computer Vision, pp. 7781–7790. Cited by: §1, §2.
  • [10] H. Chu, W. Ma, K. Kundu, R. Urtasun, and S. Fidler (2018) Surfconv: bridging 3d and 2d convolution for rgbd images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3002–3011. Cited by: §1.
  • [11] M. Ding, Y. Huo, H. Yi, Z. Wang, J. Shi, Z. Lu, and P. Luo (2020-06) Learning depth-guided convolutions for monocular 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Cited by: §1, §1, §2, §3.1, Table 1, Table 3.
  • [12] A. Geiger, P. Lenz, C. Stiller, and R. Urtasun (2013) Vision meets robotics: the kitti dataset. The International Journal of Robotics Research 32 (11), pp. 1231–1237. Cited by: §1, §4.
  • [13] C. He, H. Zeng, J. Huang, X. Hua, and L. Zhang (2020) Structure aware single-stage 3d object detection from point cloud. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §1, §3.3.
  • [14] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §3.1.
  • [15] J. Ku, A. D. Pon, and S. L. Waslander (2019) Monocular 3d object detection leveraging accurate proposals and shape reconstruction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 11867–11876. Cited by: §1, §1, Table 1, Table 3.
  • [16] T. D. Kulkarni, W. F. Whitney, P. Kohli, and J. Tenenbaum (2015) Deep convolutional inverse graphics network. In Advances in neural information processing systems, pp. 2539–2547. Cited by: §2.
  • [17] A. H. Lang, S. Vora, H. Caesar, L. Zhou, J. Yang, and O. Beijbom (2019) PointPillars: fast encoders for object detection from point clouds. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 12697–12705. Cited by: §1.
  • [18] B. Li, W. Ouyang, L. Sheng, X. Zeng, and X. Wang (2019) Gs3d: an efficient 3d object detection framework for autonomous driving. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1019–1028. Cited by: Table 1.
  • [19] P. Li, H. Zhao, P. Liu, and F. Cao (2020) RTM3D: real-time monocular 3d detection from object keypoints for autonomous driving. arXiv preprint arXiv:2001.03343 2. Cited by: §1, §2, §2, §4.2, §4.3, Table 1, Table 2, Table 3.
  • [20] M. Liang, B. Yang, Y. Chen, R. Hu, and R. Urtasun (2019-06) Multi-task multi-sensor fusion for 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §3.3.
  • [21] G. Liu, F. A. Reda, K. J. Shih, T. Wang, A. Tao, and B. Catanzaro (2018) Image inpainting for irregular holes using partial convolutions. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 85–100. Cited by: §3.2.
  • [22] L. Liu, J. Lu, C. Xu, Q. Tian, and J. Zhou (2019) Deep fitting degree scoring network for monocular 3d object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1057–1066. Cited by: Table 1.
  • [23] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C. Fu, and A. C. Berg (2016) Ssd: single shot multibox detector. In European conference on computer vision, pp. 21–37. Cited by: §3.3, §4.1.
  • [24] Z. Liu, Z. Wu, and R. Tóth (2020) SMOKE: single-stage monocular 3d object detection via keypoint estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pp. 996–997. Cited by: §1, §1, §2, §2, §4.2, §4.3, Table 1, Table 2, Table 3.
  • [25] I. Loshchilov and F. Hutter (2016)

    Sgdr: stochastic gradient descent with warm restarts

    .
    arXiv preprint arXiv:1608.03983. Cited by: §4.1.
  • [26] X. Ma, S. Liu, Z. Xia, H. Zhang, X. Zeng, and W. Ouyang (2020) Rethinking pseudo-lidar representation. In Proceedings of the European Conference on Computer Vision, Cited by: §1, §1, §2, Table 1.
  • [27] X. Ma, Z. Wang, H. Li, P. Zhang, W. Ouyang, and X. Fan (2019) Accurate monocular 3d object detection via color-embedded 3d reconstruction for autonomous driving. In Proceedings of the IEEE International Conference on Computer Vision, pp. 6851–6860. Cited by: §1, §1, Table 1, Table 3.
  • [28] F. Manhardt, W. Kehl, and A. Gaidon (2019) Roi-10d: monocular lifting of 2d detection to 6d pose and metric shape. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2069–2078. Cited by: Table 1.
  • [29] T. Mordan, N. Thome, G. Henaff, and M. Cord (2018) Revisiting multi-task learning with ROCK: a deep residual auxiliary block for visual detection. In Advances in Neural Information Processing Systems (NeurIPS), pp. 1310–1322. Cited by: §3.3.
  • [30] A. Mousavian, D. Anguelov, J. Flynn, and J. Kosecka (2017)

    3d bounding box estimation using deep learning and geometry

    .
    In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7074–7082. Cited by: §1, §1, §2.
  • [31] A. Naiden, V. Paunescu, G. Kim, B. Jeon, and M. Leordeanu (2019) Shift r-cnn: deep monocular 3d object detection with closed-form geometric constraints. In 2019 IEEE International Conference on Image Processing, pp. 61–65. Cited by: §1, §2, Table 1.
  • [32] R. Qian, D. Garg, Y. Wang, Y. You, S. Belongie, B. Hariharan, M. Campbell, K. Q. Weinberger, and W. Chao (2020) End-to-end pseudo-lidar for image-based 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5881–5890. Cited by: §1, §1, §2, §4.1.
  • [33] Z. Qin, J. Wang, and Y. Lu (2019) Monogrnet: a geometric reasoning network for monocular 3d object localization. In

    Proceedings of the AAAI Conference on Artificial Intelligence

    ,
    Vol. 33, pp. 8851–8858. Cited by: §1, §1, §2, Table 1.
  • [34] T. Roddick, A. Kendall, and R. Cipolla (2019) Orthographic feature transform for monocular 3d object detection. In Proceedings of the British Machine Vision Conference, Cited by: §1, §2.
  • [35] S. Shi, C. Guo, L. Jiang, Z. Wang, J. Shi, X. Wang, and H. Li (2020) PV-rcnn: point-voxel feature set abstraction for 3d object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §1.
  • [36] S. Shi, X. Wang, and H. Li (2019) Pointrcnn: 3d object proposal generation and detection from point cloud. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–779. Cited by: §1.
  • [37] A. Simonelli, S. R. Bulo, L. Porzi, M. López-Antequera, and P. Kontschieder (2019) Disentangling monocular 3d object detection. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1991–1999. Cited by: Table 1, Table 3, §4.
  • [38] V. Sitzmann, M. Zollhöfer, and G. Wetzstein (2019) Scene representation networks: continuous 3d-structure-aware neural scene representations. In Advances in Neural Information Processing Systems, pp. 1121–1132. Cited by: §2.
  • [39] M. Tatarchenko, A. Dosovitskiy, and T. Brox (2016) Multi-view 3d models from single images with a convolutional network. In European Conference on Computer Vision, pp. 322–337. Cited by: §2.
  • [40] Y. Wang, W. Chao, D. Garg, B. Hariharan, M. Campbell, and K. Q. Weinberger (2019) Pseudo-lidar from visual depth estimation: bridging the gap in 3d object detection for autonomous driving. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8445–8453. Cited by: §1, §1, §2.
  • [41] X. Weng and K. Kitani (2019-10) Monocular 3d object detection with pseudo-lidar point cloud. In IEEE International Conference on Computer Vision Workshops, Cited by: §1, §2, Table 1.
  • [42] O. Wiles, G. Gkioxari, R. Szeliski, and J. Johnson (2020) Synsin: end-to-end view synthesis from a single image. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7467–7477. Cited by: §1, §2.
  • [43] D. E. Worrall, S. J. Garbin, D. Turmukhambetov, and G. J. Brostow (2017) Interpretable transformations with encoder-decoder networks. In Proceedings of the IEEE International Conference on Computer Vision, pp. 5726–5735. Cited by: §2.
  • [44] B. Xu and Z. Chen (2018) Multi-level fusion based 3d object detection from monocular images. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2345–2353. Cited by: §1, §1, §2, Table 1.
  • [45] Y. Yan, Y. Mao, and B. Li (2018) Second: sparsely embedded convolutional detection. Sensors 18 (10), pp. 3337. Cited by: §1.
  • [46] Z. Yang, Y. Sun, S. Liu, X. Shen, and J. Jia (2019) STD: sparse-to-dense 3d object detector for point cloud. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Cited by: §1.
  • [47] Y. You, Y. Wang, W. Chao, D. Garg, G. Pleiss, B. Hariharan, M. Campbell, and K. Q. Weinberger (2019) Pseudo-lidar++: accurate depth for 3d object detection in autonomous driving. arXiv preprint arXiv:1906.06310. Cited by: §1, §1, §2, §4.1.
  • [48] J. Yu, Z. Lin, J. Yang, X. Shen, X. Lu, and T. S. Huang (2018) Generative image inpainting with contextual attention. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5505–5514. Cited by: §3.2.
  • [49] T. Zhou, R. Tucker, J. Flynn, G. Fyffe, and N. Snavely (2018) Stereo magnification: learning view synthesis using multiplane images. arXiv preprint arXiv:1805.09817. Cited by: §1, §2.
  • [50] T. Zhou, S. Tulsiani, W. Sun, J. Malik, and A. A. Efros (2016) View synthesis by appearance flow. In European conference on computer vision, pp. 286–301. Cited by: §2.
  • [51] X. Zhou, Y. Peng, C. Long, F. Ren, and C. Shi (2020) MoNet3D: towards accurate monocular 3d object localization in real time. ArXiv abs/2006.16007. Cited by: §1, §1, §2.
  • [52] Y. Zhou and O. Tuzel (2018) Voxelnet: end-to-end learning for point cloud based 3d object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4490–4499. Cited by: §1.