Task-Aware Monocular Depth Estimation for 3D Object Detection

09/17/2019 ∙ by Xinlong Wang, et al. ∙ 9

Monocular depth estimation enables 3D perception from a single 2D image, thus attracting much research attention for years. Almost all methods treat foreground and background regions ("things and stuff") in an image equally. However, not all pixels are equal. Depth of foreground objects plays a crucial role in 3D object recognition and localization. To date how to boost the depth prediction accuracy of foreground objects is rarely discussed. In this paper, we first analyse the data distributions and interaction of foreground and background, then propose the foreground-background separated monocular depth estimation (ForeSeE) method, to estimate the foreground depth and background depth using separate optimization objectives and depth decoders. Our method significantly improves the depth estimation performance on foreground objects. Applying ForeSeE to 3D object detection, we achieve 7.5 AP gains and set new state-of-the-art results among other monocular methods.



There are no comments yet.


page 1

page 4

page 6

page 7

page 8

Code Repositories


Task-Aware Monocular Depth Estimation for 3D Object Detection, AAAI2020

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Depth bridges the gap between 2D and 3D perception in computer vision. A precise depth map of an image provides rich 3D geometry information like locations and shapes for objects and stuff in a scene, thus attracting more and more attention in both 2D and 3D understanding fields. Monocular depth estimation, which aims to predict the depth map from a single image, is an ill-posed problem, as infinite number of 3D scenes can be projected to the same 2D image. With the development of deep convolutional neural networks 

[11, 34, 8], recent works have made great progress [42, 6, 17]

. They typically consist of an encoder for feature extraction and a decoder for generating the depth of the whole scene, either by regressing the depth values or predicting the depth range categories. Plausible results have been shown.

When the monocular methods are applied to other tasks focusing on foreground object analysis, e.g., 3D object detection, there are two main obstacles from the low precision of foreground depth: (1) Poor estimate of the object center location; (2) Distorted or faint object shapes. We show some examples in Figure 1. The inaccurate object location and shape make the downstream localization and recognition challenging. The above issues could be handled by enhancing the depth estimation performance on foreground regions. However, all these state-of-the-art methods treat foreground depth and background depth equally, which leads to sub-optimal performance on foreground objects.

Figure 1: Examples of low precision prediction of foreground depth. For each row, the left picture is the projected point cloud transformed from ground truth depth map and RGB image; the right picture is the bird’s-eye-view close-up to compare the depth (in green) predicted by the baseline depth estimation method with the ground truth (in white). The inaccurate object location and shape pose challenges for 3D recognition, localization and orientation estimation.

In fact, foreground depth and background depth show different data distributions. We make qualitative and quantitative comparisons in Figure 1, Figure 2 and Table 1. Foreground pixels tend to gather into clusters, bring more and bigger depth change and look like frustums in 3D space rather than flat surfaces like road and buildings. Second, foreground pixels account for only a small part of the whole scene. For instance, in the KITTI-Object dataset [7], pixels belong to background, while only pixels fall within foreground. Furthermore, not all pixels are equal. As just described, foreground pixels play a more crucial role in downstream applications, e.g., autonomous driving and robotic grasping. For example, an estimation error on a car is much different from the same error on a building. The inaccurate shape and location of the car could be catastrophic for 3D object detectors.

The observations make one wonder how to boost the estimation accuracy of foreground while do no harm to background. First of all, it is neither a hard example mining problem, nor a self-learned local attention problem. Different from the former one, here we want to further enhance the performance on specific regions in the scene, which does not have to be harder example. As we can see in Figure 3, foreground is indeed not harder than background. Attention mechanism is widely used to focus on more discriminative local regions in semantic classification problems, e.g

., semantic segmentation and fine-grained classification. But this is not the case for depth estimation. Given a close-up of a car in a scene, one could classify the semantic categories, but could not tell the depth. Another choice is to separately train the foreground and background regions, since the data distributions are different. However, we show that the foreground and background are interdependent to each other for inferring the depth and boosting the performance.

Instead, we formulate it as a multi-objective optimization problem. The objective functions of foreground and background depth are separated. So do the depth decoders. Thus, the foreground depth decoder could fit the foreground depth as well as possible while do no harm to background.

To summarize, our contributions are as follows:

  • We conduct pioneering discussion about difference and interaction of foreground and background in monocular depth estimation. We show that the different patterns of foreground and background depth lead to sub-optimal results on foreground pixels.

  • We propose a new framework, termed ForeSeE, to learn and predict foreground and background depth separately. Specifically, it contains separate depth decoders for foreground and background regions, an objective sensitive loss funcion to optimize corresponding decoders, and a simple yet effective foreground-background merging strategy.

  • With the proposed ForeSeE, we are able to predict much superior foreground depth, whereas background depth is not affected. Furthermore, utilizing the predicted depth maps, our model achieves 7.5 AP gains on 3D object detection task, which effectively verifies our motivation.

2 Related Work

Monocular Depth Estimation. Monocular depth estimation is a long-lasting problem in computer vision and robotics. Early works [30, 29] mainly leverage MRF formulations and non-parametric optimization to predict the depth from handcrafted features [9, 14, 19]. As large-scale RGB-D dataset [33, 7]

is available and high-dimentional features are constructed by deep convolutional neural networks (DCNN), numerous novel methods are proposed to predict per-pixel depth. Most of methods formulate the depth estimation as a pixel-wise supervised learning problem. Eigen 

et al[4] is the first to utilize the multi-scale convolutional neural networks to regress the depth map from a single image. Then, various innovative network architectures [21, 20, 16, 41, 42] are proposed to leverage strong high-level features. Fu et al[6] propose to use multi-scale dilated convolution and remove subsampling in the last few pooling layers of the encoder to obtain large receptive fields. Furthermore, several methods [5, 44] propose to explicitly enforce geometric constraints for the optimizing process. Qi et al[26] combine surface normal and depth prediction together by embedding a depth-surface-normal mutual transformation module into a single network, which is a local geometric constraints. By contrast Yin et al[37] propose the virtual normal to leverage the long-range high-order geometric relations. In this work, we focus on boosting the depth prediction accuracy of foreground objects with the proposed ForeSeE optimization strategy.

Not All Pixels are Equal. Some prior works noticed that it is sub-optimal to treat all pixels equally in dense prediction tasks. Sevilla et al[31] tackle optic flow estimation by defining different models of image motion for different regions. Li et al[18] use deep layer cascade to first segment the easy pixels then the harder ones. Sun et al[35] select and weight synthetic pixels which are similar with real ones for learning semantic segmentation. Yuan et al[43]

introduce an instance-level adversarial loss for video frame interpolation problem. Shen 

et al[32]

propose an instance-aware image-to-image translation framework. However, different from the above works, we focus on depth estimation problem and aim at improving the accuracy of 3D object detection.

Monocular 3D Object Detection. The lack of depth information poses a substantial challenge for estimating 3D bounding boxes from a single image. Many works seek help from geometry priors and estimated depth information. Deep3DBox [24] proposes to generate 3D proposals based on 2D-3D bounding box consistency constraint. ROI-10D [23] introduces RoI lifting to extract fused feature maps from input image and estimated depth map, before the 3D bounding box regression. MonoGRNet [27] estimates the depth of the targeting 3D bounding box’s center to aid the 3D localization. Recently, some works [40, 36, 38] propose to convert estimated depth map to lidar-like point cloud to help localize 3D objects. Wang et al[36] directly applies 3D object detection methods on the generated pseudo-lidar, and claim 3D point cloud is a much superior representation than 2D depth map for better utilizing depth information. In these methods, a reliable depth map, especially the precise foreground depth, is the key to a successful 3D object detection framework. We perform 3D object detection using the pseudo-LiDAR generated by our depth estimation model. The proposed method largely improves the performance and outperforms state-of-the-art methods, which clearly demonstrate the effectiveness.

Figure 2: Comparison of depth value distribution between foreground pixels and background pixels. Percentage of pixels with depth value within [-8, ] meters is reported.
Foreground 96.77 1.99 1.24
Background 98.63 0.94 0.43
Table 1: Comparison of depth gradient distribution between foreground and background pixels. The gradients are uniformly discretized into three bins: I, II and III, from small to large. Percentage of pixels at each level is reported.

3 Analysis of Foreground and Background Depth

3.1 Preliminaries

Dataset. KITTI dataset [7] has witnessed inspiring progress in the field of depth estimation. As most of scenes in KITTI-Raw data have limited foreground objects, we construct a new benchmark which is based on KITTI-Object dataset. We collect the corresponding ground-truth depth map for each image in KITTI-Object training set, and term it as KITTI-Object-Depth (KOD) dataset. A total of image-depth pairs are divided into training and testing subsets with and samples respectively  [3], which makes sure that images in the two subsets belong to different video clips. 2D bounding boxes are used to distinguish the foreground and background pixels. Pixels fall within the foreground bounding boxes are designated as foreground pixels, while the other pixels are assigned to be background.

Baseline Method. We adopt the same DCNN-based baseline method [37] which has already shown state-of-the-art performance on several benchmarks. The main structure falls into the typical encoder-decoder style. Given an input image, the encoder extracts the dense features, then the decoder fetches the features and predicts the quantized depth range categories. Specifically, the depth values are discretized into discrete bins in the log space. The quantized labels are assigned to each of the pixels as their classification labels.

Figure 3: Interaction of foreground and background samples. The depth estimation results (SILog) on foreground and background regions are reported (lower is better). The weight of foreground objective is on -axis.
Figure 4: Illustration of the overall pipeline. (a) Foreground-background separated depth estimation. (b) 3D object detection.

3.2 Analysis on Data Distribution

Few works [10] have analysed the depth distribution, not to mention the foreground and background depth distributions. Here we investigate two kinds of data distribution of foreground and background pixels in training subset. Figure 2 shows the depth value distributions. As shown, more than foreground pixels have depth less than , while it is about for background. The foreground depth also shows a heavier long-tail distribution. Depth gradient distributions are shown in Table 1. We use the Laplacian of the depth images as the depth gradient, which calculates the second order spacial derivatives. The Laplacian image highlights areas of rapid depth change. The outputs are scaled to [0, 255] and uniformly discretized into three bins: , and , from small to large. In this way, all pixels are divided into three levels according to their gradient values. The foreground has much higher proportion than background at level and . Besides the depth range and depth gradient, the difference of shapes should also be noted. Generally, depth provides two kinds of information: location and shape. The foreground objects share similar shapes and look like frustums in 3D space, as shown in Figure 1. Based on the above analysis, we propose to consider the foreground and background separately when estimate their depth.

3.3 Separate Objectives

In dense prediction tasks, generally the loss function can be formulated as:


where is the number of pixels, and are the prediction and ground truth of pixel. is the error function, e.g., the widely used cross-entropy error function.

After the analysis in Section 3.2, we further investigate the interaction of foreground and background by splitting the optimization objective. The modified loss function is defined as:


where is the number of foreground pixels, is the number of background pixels and acts as the weight to balance the two loss terms. Figure 3 shows our results with different settings of . When is set to , which means only the background samples are used to supervise the training, the result on foreground becomes very poor. Similarly, the performance on background drops sharply when is set to . It verifies that the foreground depth and background depth are distributed differently. When we increase the foreground weight from to , the result on background improves, which indicates that the foreground and background to some extend could help each other. Further, it should be noted that the optimal values for foreground and background are different. For instance, the model shows its best performance on foreground when , but meanwhile the result on background is much poorer. It indicates that the optimization objectives for foreground and background are not consistent. To address these issues, in Section 4, the foreground-background separated depth estimation method is proposed to achieve the optimum points at the same time.

3.4 Analysis Summary

We highlight three observations:

  • The foreground and background depth have different depth value distributions, depth gradient distributions and shape patterns;

  • The foreground and background depth reinforce each other due to their shared similarities;

  • The optimization objectives of foreground and background depth estimation are mismatched.

Method FSL SD SO Foreground Background Global
absRel SILog absRel SILog absRel SILog
ForeSeE 0.118 0.205 0.141 0.210 0.138 0.210
0.120 0.208 0.141 0.209 0.139 0.209
0.120 0.205 0.147 0.217 0.144 0.216
Baseline 0.129 0.216 0.143 0.210 0.141 0.211
Table 2: Ablation study of depth estimation on the KOD dataset. FSL refers to foreground-background sensitive loss; SD refers to separate decoders; SO means separate objectives.

4 ForeSeE

In this section we first introduce the network architecture of our method, then present the proposed loss function, and finally show how the mask used to distinguish foreground and background could be dropped during the inference. The whole pipeline is illustrated in Figure 4.

4.1 Separate Depth Decoders

We construct an additional decoder based on the baseline method [37], thus there are two parallel decoders which have the same structure. One of the decoders is for foreground depth prediction, while the other one aims to estimate the background depth. Specifically, for an image of size

, each decoder outputs a tensor of size

, where is the number of depth range categories.

Foreground regions are cropped from the output of foreground depth decoder. The background depth range predictions are obtained in the same way. The global depth range predictions are generated by a seamless merge of foreground and background regions. Then the depth range predictions are converted to the final depth map using the soft-weighted-sum strategy [15].

4.2 Foreground-background Sensitive Loss Function

As observed in Section 3, although the foreground depth and background depth show different patterns, they do share some similarities and could reinforce each other under an appropriate ratio. Thus, we further weight the foreground and background samples. For either foreground or background branch, the loss function is a weighted average of foreground samples and background samples, but with different bias. Here we define the loss function which supervises the foreground branch as:


where represents mean errors calculated on foreground predictions; is the mean error of foreground predictions; is the weight to balance the foreground and background samples during the training of foreground branch. Larger leads to more preference for foreground samples. Similarly, the loss function of background branch is formulated as:


where is the weight; and are the mean errors of background predictions and foreground predictions on this background branch.

4.3 Inference without Mask

Here we propose a mask-free merge method such that the binary mask is no longer needed once the training is finished. A max-pooling operation is applied on the foreground and background outputs before the softmax operation, which represent the confidence scores of being each range category. For each range category of each pixel, the highest confidence score between foreground and background output is retained, to serve as the final prediction. Formally, for the

shaped outputs ,, , (), The final predictions are calculated as:


where represent the output of foreground and background branch, and

is an element-wise maximum operator which takes two vectors as input and outputs a new vector. Then the

shaped output is fed to and soft-weighted-sum operations to produce the final depth map. The results only drop slightly compared with the mask-based merge method (from to absRel).

5 Experiments

5.1 Experiment Settings

Datasets. We carry out experiments on KITTI dataset, which contains large-scale road scenes captured on driving cars, and serves as a popular benchmark for many computer vision problems related to self-driving cars. Specifically, we construct the KITTI-Object-Depth (KOD) dataset for evaluating the foreground depth estimation, as described in Section 3.1. The KOD dataset will be public available for convenience of future researches. Besides, we also apply our method on KITTI-Object dataset to perform monocular 3D object detection.

Metric Definition
% of s.t.
Table 3: Evaluation metrics of depth estimation. is the number of valid pixels in each test image, denotes the ground truth depth for pixel and is the estimated depth. Three different thresholds ( for ) are used. is a set of pixels which belong to the same image with pixel .

Evaluation Metrics. For evaluation of depth estimation, we follow common practice [17, 37] and use the mean absolute relative error (absRel) and scale invariant logarithmic error (SILog) as the main metrics. We also report mean relative squared error (sqRel), mean error () and accuracy under threshold (). The definitions of these metrics are listed in Table 3. As for 3D object detection, we follow the prior works [22, 28] and focus on the “car” class. We report the results of 3D and bird’s-eye-view (BEV) object detection on the validation set. The commonly used average precision (AP) with the IoU thresholds at 0.7 is calculated. The results on KITTI easy, moderate and hard difficulty levels are reported.

Implementation Details. For depth estimation, we follow the most of settings in baseline method [37]

. The ImageNet pretrained ResNeXt-101 

[39] is used as the backbone model. We train the network for epochs, with batch size and base learning rate set to

. The Stochastic Gradient Descent (SGD) solver is adopted to optimize the network on a single GPU.

and in foreground-background sensitive loss function are set to . Given a predicted depth map, the point cloud can be reconstructed based on the pinhole camera model. We transform each pixel with depth value to a 3D point in left camera coordinate as follows:


where and are the focal length along the and coordinate axis; and are the 2D coordinate of the optical center. Following [36], we set the reflectance to for each point and remove the points higher than 1 above the LiDAR source. The resulting point cloud is termed as pseudo-LiDAR. Afterwards, any existing LiDAR-based 3D object detection methods can be applied.

Method absRel sqRel SILog log10
Foreground DenseDepth 0.135 0.214 0.204 0.057 0.830 0.951 0.984
ForeSeE 0.118 0.193 0.205 0.053 0.851 0.952 0.982
Global DenseDepth 0.138 0.208 0.209 0.062 0.782 0.946 0.987
ForeSeE 0.138 0.213 0.210 0.061 0.793 0.949 0.987

Table 4: Depth estimation performance comapred with DenseDepth [1].

5.2 Depth Estimation

Quantitative Results. We show the ablative results in Table 2. Our ForeSeE outperforms the baseline over all metrics evaluated on foreground, background and global levels. Specifically, when evaluate on foreground level, our method improves the baseline performance by up to (from to absRel). It is in accordance with our intention that ForeSeE is specifically designed to enhance the ability of estimating the foreground depth. We further analyse the effect of each component. When equipped with the separate objectives (SO) described in Section 3.3, the baseline achieves better results on foreground while performs worse when evaluate on background pixels. Directly using the separate decoders (SD) could avoid the harm on background. Finally, the performance on foreground is further improved by applying the foreground-background sensitive loss (FSL).

To compare with other state-of-the-art methods, we apply DenseDepth [1] to the KOD benchmark, which reports the best performance on KITTI [7] and NYUv2 [33] datasets among the methods with public available training code. We obtain results of DenseDepth using the code at github111https://github.com/ialhashim/DenseDepth, published by the authors. Except the dataset used, all the settings and hyper-parameters are not modified. The results are shown in Table 4. Compared with DenseDepth, our ForeSeE shows significant advantage on foreground level. For instance, ForeSeE outperforms DenseDepth by absolute absRel (from to absRel), which is a relative improvement of .

Figure 5: Quantitative comparison of the baseline and our ForeSeE on estimated depth maps.
Figure 6: Quantitative comparison of the baseline and our ForeSeE on converted pseudo-LiDAR signals. Signals in blue are converted from ground truth depth; Baseline pseudo-LiDAR are in red; Our ForeSeE pseudo-LiDAR are in yellow.

Qualitative Results. Besides the quantitative comparison, we show some visualization results. The predicted depth maps are visualized in Figure 5. As shown, our ForeSeE estimates more precise depth on foreground regions. The contour of foreground objects is more clear and accurate. Further, in Figure 6 we compare the estimated depth in the format of 3D point cloud. Although 2D depth map is good at displaying differences on relative depth and front-view shape, it provides few clues about absolute depth and 3D shape (e.g., it is hard to tell the differences between the similar colors of predictions and ground truth in Figure 5). 3D point cloud is a more intuitive and reasonable representation for visually comparing and debugging depth maps. As shown in Figure 6, our method shows less estimation errors and more accurate bird’s-eye-view (BEV) shapes.

Figure 7: Qualitative results of 3D object detection. The ground truth 3D bounding boxes are in red; the predictions are in green.

5.3 3D Object Detection

To further validate the effectiveness, we conduct experiments on 3D object detection problem. We convert the estimated image-based depth map to LiDAR-like point cloud (pseudo-LiDAR). Then the LiDAR-based algorithms can be applied to recognizing and localizing 3D objects. Here we adopt Frustum-PointNet (F-PointNet) [25] and AVOD [12], specifically the F-PointNet-v1 and AVOD-FPN, which are top-performing 3D object detection methods and both utilize the information from LiDAR and RGB images.

Brief Introduction of Detection Methods. Frustum-PointNet leverages 2D detector to generate 2D object region proposals in a RGB image. Each 2D region corresponds to a 3D frustum in 3D space. PointNet-based networks are used to estimate a 3D bounding box from the points within the frustum. AVOD uses multimodal feature fusion RPN which aggregates the front-view image features and BEV LiDAR features to generate 3D object proposals. Based on the proposals, the bounding box regression and category classification are performed in the second subnetwork. We apply the F-PointNet and AVOD on the pseudo-LiDAR generated by our depth estimation model during the training and inference. Hyper-parameters are not modified. More details about the 3D object detector can be referred to the original papers.

Comparisons with State-of-the-art Methods. It should be noted that some works [36, 38] use the DORN [6] pre-trained on KITTI-raw dataset for depth estimation, which includes the images in training and validation subsets of KITTI-Object. Wang et al[36] claim that their results serve as the upper bound. If we also pre-train our baseline depth estimation model on KITTI-raw and use it to generate pseudo-LiDAR, we achieve AP which outperforms their reported AP when both use the F-PointNet as detector and evaluate on the moderate level of car class. But, we want to clearly set a baseline and fairly compare to other state-of-the-art monocular 3D detection methods.

We compare our method with other methods in Table 5. The compared 3D object detection methods include Mono3D [2], MLF-MONO [40], ROI-10D [23], MonoGRNet [27], MonoPSR [13], TLNet-Mono [28] and DFDSNet [22]. Our depth estimation models are trained on KOD training subset which does not contain validation subset of KITTI-Object. Either with F-PointNet or AVOD as the detection method, our ForeSeE-PL brings remarkable improvements on the basis of baseline-PL over all the metrics, e.g., from to with AVOD detector. Note that the 3D detection average precision () is the most widely used metric, on which our method achieves AP gains (from to AP) and outperforms all the state-of-the-art methods. Another advantage of our method is that it is not limited to specific 3D object detection methods. With stronger 3D object detector, we achieve larger improvements, e.g., AP gains on F-PointNet and AP gains on AVOD when evaluate on easy level of .

Method Easy Moderate Hard
Mono3D 5.2 / 2.5 5.2 / 2.3 4.1 / 2.3
MLF-MONO 22.0 / 10.5 13.6 / 5.7 11.6 / 5.4
ROI-10D 14.5 / 9.6 9.9 / 6.6 8.7 / 6.3
MonoGRNet    -    / 13.9    -    / 10.2    -    / 7.6
MonoPSR 20.6 / 12.8 18.7 / 11.5 14.5 / 8.6
TLNet-Mono 21.9 / 13.8 15.7 / 9.7 14.3 / 9.3
DFDSNet 9.5 / 6.0 8.0 / 5.5 7.7 / 4.8
F-PN (baseline-PL) 17.3 / 9.6 11.8 / 5.4 10.4 / 5.0
F-PN (Our ForeSeE-PL) 20.2 / 13.2 12.6 / 9.4 12.0 / 8.2
AVOD (baseline-PL) 19.0 / 7.5 15.3 / 6.1 13.0 / 5.4
AVOD (Our ForeSeE-PL) 23.4 / 15.0 17.4 / 12.5 15.9 / 12.0
Table 5: Monocular 3D object detection results on KITTI benchmark. We report (in %) of the car category. F-PN refers to Frustum-PointNet. PL refers to pseudo-LiDAR.

Qualitative Results. The visualization of detection results are shown in Figure 7. The 3D bounding boxes are projected into image space for better visualization. There are two obvious advantages of using ForeSeE-PL: less missed detection and more accurate localization. Inaccurate depth predictions will result in shifted localization or rotated orientation, as in Figure 7(b). Even worse, the objects can not be detected if the depth estimation model treats foreground objects as background region, thus causing more missed detections. Our ForeSeE method largely alleviates the problems through predicting more accurate depth on foreground regions.

6 Conclusion

In this paper, we first analyse the data distribution of foreground and background depth and explicitly explore the interactions. Based on the observations, a simple and effective depth estimation pipeline, namely ForeSeE, is proposed to estimate foreground depth and background depth separately. We introduce a foreground depth estimation benchmark and set fair baselines to encourage the future studies. The experiments on monocular depth estimation and 3D object detection problems demonstrate the effectiveness of ForeSeE. We expect wide application of the proposed method in depth estimation and related downstream problems, e.g., 3D object recognition and localization.


  • [1] I. Alhashim and P. Wonka (2018)

    High quality monocular depth estimation via transfer learning

    arXiv: Comp. Res. Repository. Cited by: §5.2, Table 4.
  • [2] X. Chen, K. Kundu, Z. Zhang, H. Ma, S. Fidler, and R. Urtasun (2016) Monocular 3d object detection for autonomous driving. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., Cited by: §5.3.
  • [3] X. Chen, K. Kundu, Y. Zhu, A. G. Berneshawi, H. Ma, S. Fidler, and R. Urtasun (2015) 3D object proposals for accurate object class detection. In Proc. Advances in Neural Inf. Process. Syst., Cited by: §3.1.
  • [4] D. Eigen, C. Puhrsch, and R. Fergus (2014) Depth map prediction from a single image using a multi-scale deep network. In Proc. Advances in Neural Inf. Process. Syst., Cited by: §2.
  • [5] X. Fei, A. Wong, and S. Soatto (2019) Geo-supervised visual depth prediction. IEEE Robot. Auto. Letters. Cited by: §2.
  • [6] H. Fu, M. Gong, C. Wang, K. Batmanghelich, and D. Tao (2018) Deep ordinal regression network for monocular depth estimation. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., Cited by: §1, §2, §5.3.
  • [7] A. Geiger, P. Lenz, C. Stiller, and R. Urtasun (2013) Vision meets robotics: the KITTI dataset. Int. J. Robot. Res.. Cited by: §1, §2, §3.1, §5.2.
  • [8] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., Cited by: §1.
  • [9] D. Hoiem, A. A. Efros, and M. Hebert (2007) Recovering surface layout from an image. Int. J. Comput. Vision. Cited by: §2.
  • [10] J. Jiao, Y. Cao, Y. Song, and R. Lau (2018) Look deeper into depth: monocular depth estimation with semantic booster and attention-driven loss. In Proc. Eur. Conf. Comp. Vis., Cited by: §3.2.
  • [11] A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012) Imagenet classification with deep convolutional neural networks. In Proc. Advances in Neural Inf. Process. Syst., Cited by: §1.
  • [12] J. Ku, M. Mozifian, J. Lee, A. Harakeh, and S. L. Waslander (2018) Joint 3d proposal generation and object detection from view aggregation. In Proc. IEEE/RSJ Int. Conf. Intelligent Robots & Systems, Cited by: §5.3.
  • [13] J. Ku, A. D. Pon, and S. L. Waslander (2019) Monocular 3d object detection leveraging accurate proposals and shape reconstruction. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., Cited by: §5.3.
  • [14] L. Ladicky, J. Shi, and M. Pollefeys (2014) Pulling things out of perspective. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., Cited by: §2.
  • [15] B. Li, Y. Dai, and M. He (2018) Monocular depth estimation with hierarchical fusion of dilated cnns and soft-weighted-sum inference. Pattern Recogn.. Cited by: §4.1.
  • [16] J. Li, R. Klein, and A. Yao (2017) A two-streamed network for estimating fine-scaled depth maps from single RGB images. In Proc. IEEE Int. Conf. Comp. Vis., Cited by: §2.
  • [17] R. Li, K. Xian, C. Shen, Z. Cao, H. Lu, and L. Hang (2018) Deep attention-based classification network for robust depth prediction. In Proc. Asian Conf. Comp. Vis., Cited by: §1, §5.1.
  • [18] X. Li, Z. Liu, P. Luo, C. C. Loy, and X. Tang (2017) Not all pixels are equal: difficulty-aware semantic segmentation via deep layer cascade. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., Cited by: §2.
  • [19] C. Liu, J. Yuen, A. Torralba, J. Sivic, and W. T. Freeman (2008) Sift flow: dense correspondence across different scenes. In Proc. Eur. Conf. Comp. Vis., Cited by: §2.
  • [20] F. Liu, C. Shen, G. Lin, and I. D. Reid (2016) Learning depth from single monocular images using deep convolutional neural fields. IEEE Trans. Pattern Anal. Mach. Intell.. Cited by: §2.
  • [21] F. Liu, C. Shen, and G. Lin (2015) Deep convolutional neural fields for depth estimation from a single image. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., Cited by: §2.
  • [22] L. Liu, J. Lu, C. Xu, Q. Tian, and J. Zhou (2019) Deep fitting degree scoring network for monocular 3d object detection. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., Cited by: §5.1, §5.3.
  • [23] F. Manhardt, W. Kehl, and A. Gaidon (2019) ROI-10d: monocular lifting of 2d detection to 6d pose and metric shape. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., Cited by: §2, §5.3.
  • [24] A. Mousavian, D. Anguelov, J. Flynn, and J. Kosecka (2017)

    3D bounding box estimation using deep learning and geometry

    In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., Cited by: §2.
  • [25] C. R. Qi, W. Liu, C. Wu, H. Su, and L. J. Guibas (2018) Frustum pointnets for 3d object detection from RGB-D data. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., Cited by: §5.3.
  • [26] X. Qi, R. Liao, Z. Liu, R. Urtasun, and J. Jia (2018) GeoNet: geometric neural network for joint depth and surface normal estimation. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., Cited by: §2.
  • [27] Z. Qin, J. Wang, and Y. Lu (2019) MonoGRNet: A geometric reasoning network for monocular 3d object localization. In Proc. AAAI Conf. Artificial Intell., Cited by: §2, §5.3.
  • [28] Z. Qin, J. Wang, and Y. Lu (2019) Triangulation learning network: from monocular to stereo 3d object detection. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., Cited by: §5.1, §5.3.
  • [29] A. Saxena, S. H. Chung, and A. Y. Ng (2005) Learning depth from single monocular images. In Proc. Advances in Neural Inf. Process. Syst., Cited by: §2.
  • [30] A. Saxena, M. Sun, and A. Y. Ng (2009) Make3D: learning 3d scene structure from a single still image. IEEE Trans. Pattern Anal. Mach. Intell.. Cited by: §2.
  • [31] L. Sevilla-Lara, D. Sun, V. Jampani, and M. J. Black (2016) Optical flow with semantic segmentation and localized layers. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., Cited by: §2.
  • [32] Z. Shen, M. Huang, J. Shi, X. Xue, and T. S. Huang (2019) Towards instance-level image-to-image translation. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., Cited by: §2.
  • [33] N. Silberman, D. Hoiem, P. Kohli, and R. Fergus (2012) Indoor segmentation and support inference from RGBD images. In Proc. Eur. Conf. Comp. Vis., Cited by: §2, §5.2.
  • [34] K. Simonyan and A. Zisserman (2014) Very deep convolutional networks for large-scale image recognition. arXiv: Comp. Res. Repository. Cited by: §1.
  • [35] R. Sun, X. Zhu, C. Wu, C. Huang, J. Shi, and L. Ma (2019) Not all areas are equal: transfer learning for semantic segmentation via hierarchical region selection. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., Cited by: §2.
  • [36] Y. Wang, W. Chao, D. Garg, B. Hariharan, M. Campbell, and K. Q. Weinberger (2019) Pseudo-lidar from visual depth estimation: bridging the gap in 3d object detection for autonomous driving. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., Cited by: §2, §5.1, §5.3.
  • [37] Y. Wei, Y. Liu, C. Shen, and Y. Yan (2019) Enforcing geometric constraints of virtual normal for depth prediction. In Proc. IEEE Int. Conf. Comp. Vis., Cited by: §2, §3.1, §4.1, §5.1, §5.1.
  • [38] X. Weng and K. Kitani (2019) Monocular 3d object detection with pseudo-lidar point cloud. arXiv: Comp. Res. Repository. Cited by: §2, §5.3.
  • [39] S. Xie, R. B. Girshick, P. Dollár, Z. Tu, and K. He (2017) Aggregated residual transformations for deep neural networks. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., Cited by: §5.1.
  • [40] B. Xu and Z. Chen (2018) Multi-level fusion based 3d object detection from monocular images. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., Cited by: §2, §5.3.
  • [41] D. Xu, E. Ricci, W. Ouyang, X. Wang, and N. Sebe (2019) Monocular depth estimation using multi-scale continuous crfs as sequential deep networks. IEEE Trans. Pattern Anal. Mach. Intell.. Cited by: §2.
  • [42] D. Xu, W. Wang, H. Tang, H. Liu, N. Sebe, and E. Ricci (2018) Structured attention guided convolutional neural fields for monocular depth estimation. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., Cited by: §1, §2.
  • [43] L. Yuan, Y. Chen, H. Liu, T. Kong, and J. Shi (2019) Zoom-in-to-check: boosting video interpolation via instance-level discrimination. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., Cited by: §2.
  • [44] S. Zhao, H. Fu, M. Gong, and D. Tao (2019) Geometry-aware symmetric domain adaptation for monocular depth estimation. arXiv: Comp. Res. Repository. Cited by: §2.