Task-Aware Monocular Depth Estimation for 3D Object Detection, AAAI2020
Monocular depth estimation enables 3D perception from a single 2D image, thus attracting much research attention for years. Almost all methods treat foreground and background regions ("things and stuff") in an image equally. However, not all pixels are equal. Depth of foreground objects plays a crucial role in 3D object recognition and localization. To date how to boost the depth prediction accuracy of foreground objects is rarely discussed. In this paper, we first analyse the data distributions and interaction of foreground and background, then propose the foreground-background separated monocular depth estimation (ForeSeE) method, to estimate the foreground depth and background depth using separate optimization objectives and depth decoders. Our method significantly improves the depth estimation performance on foreground objects. Applying ForeSeE to 3D object detection, we achieve 7.5 AP gains and set new state-of-the-art results among other monocular methods.READ FULL TEXT VIEW PDF
Accurate and reliable 3D object detection is vital to safe autonomous
Monocular object detection and tracking have improved drastically in rec...
Feature disentanglement of the foreground target objects and the backgro...
The precise localization of 3D objects from a single image without depth...
Localizing objects in 3D space and understanding their associated 3D
3D object detection is an important capability needed in various practic...
Despite the remarkable success of modern monocular depth estimation meth...
Task-Aware Monocular Depth Estimation for 3D Object Detection, AAAI2020
Depth bridges the gap between 2D and 3D perception in computer vision. A precise depth map of an image provides rich 3D geometry information like locations and shapes for objects and stuff in a scene, thus attracting more and more attention in both 2D and 3D understanding fields. Monocular depth estimation, which aims to predict the depth map from a single image, is an ill-posed problem, as infinite number of 3D scenes can be projected to the same 2D image. With the development of deep convolutional neural networks[11, 34, 8], recent works have made great progress [42, 6, 17]
. They typically consist of an encoder for feature extraction and a decoder for generating the depth of the whole scene, either by regressing the depth values or predicting the depth range categories. Plausible results have been shown.
When the monocular methods are applied to other tasks focusing on foreground object analysis, e.g., 3D object detection, there are two main obstacles from the low precision of foreground depth: (1) Poor estimate of the object center location; (2) Distorted or faint object shapes. We show some examples in Figure 1. The inaccurate object location and shape make the downstream localization and recognition challenging. The above issues could be handled by enhancing the depth estimation performance on foreground regions. However, all these state-of-the-art methods treat foreground depth and background depth equally, which leads to sub-optimal performance on foreground objects.
In fact, foreground depth and background depth show different data distributions. We make qualitative and quantitative comparisons in Figure 1, Figure 2 and Table 1. Foreground pixels tend to gather into clusters, bring more and bigger depth change and look like frustums in 3D space rather than flat surfaces like road and buildings. Second, foreground pixels account for only a small part of the whole scene. For instance, in the KITTI-Object dataset , pixels belong to background, while only pixels fall within foreground. Furthermore, not all pixels are equal. As just described, foreground pixels play a more crucial role in downstream applications, e.g., autonomous driving and robotic grasping. For example, an estimation error on a car is much different from the same error on a building. The inaccurate shape and location of the car could be catastrophic for 3D object detectors.
The observations make one wonder how to boost the estimation accuracy of foreground while do no harm to background. First of all, it is neither a hard example mining problem, nor a self-learned local attention problem. Different from the former one, here we want to further enhance the performance on specific regions in the scene, which does not have to be harder example. As we can see in Figure 3, foreground is indeed not harder than background. Attention mechanism is widely used to focus on more discriminative local regions in semantic classification problems, e.g
., semantic segmentation and fine-grained classification. But this is not the case for depth estimation. Given a close-up of a car in a scene, one could classify the semantic categories, but could not tell the depth. Another choice is to separately train the foreground and background regions, since the data distributions are different. However, we show that the foreground and background are interdependent to each other for inferring the depth and boosting the performance.
Instead, we formulate it as a multi-objective optimization problem. The objective functions of foreground and background depth are separated. So do the depth decoders. Thus, the foreground depth decoder could fit the foreground depth as well as possible while do no harm to background.
To summarize, our contributions are as follows:
We conduct pioneering discussion about difference and interaction of foreground and background in monocular depth estimation. We show that the different patterns of foreground and background depth lead to sub-optimal results on foreground pixels.
We propose a new framework, termed ForeSeE, to learn and predict foreground and background depth separately. Specifically, it contains separate depth decoders for foreground and background regions, an objective sensitive loss funcion to optimize corresponding decoders, and a simple yet effective foreground-background merging strategy.
With the proposed ForeSeE, we are able to predict much superior foreground depth, whereas background depth is not affected. Furthermore, utilizing the predicted depth maps, our model achieves 7.5 AP gains on 3D object detection task, which effectively verifies our motivation.
Monocular Depth Estimation. Monocular depth estimation is a long-lasting problem in computer vision and robotics. Early works [30, 29] mainly leverage MRF formulations and non-parametric optimization to predict the depth from handcrafted features [9, 14, 19]. As large-scale RGB-D dataset [33, 7]
is available and high-dimentional features are constructed by deep convolutional neural networks (DCNN), numerous novel methods are proposed to predict per-pixel depth. Most of methods formulate the depth estimation as a pixel-wise supervised learning problem. Eigenet al.  is the first to utilize the multi-scale convolutional neural networks to regress the depth map from a single image. Then, various innovative network architectures [21, 20, 16, 41, 42] are proposed to leverage strong high-level features. Fu et al.  propose to use multi-scale dilated convolution and remove subsampling in the last few pooling layers of the encoder to obtain large receptive fields. Furthermore, several methods [5, 44] propose to explicitly enforce geometric constraints for the optimizing process. Qi et al.  combine surface normal and depth prediction together by embedding a depth-surface-normal mutual transformation module into a single network, which is a local geometric constraints. By contrast Yin et al.  propose the virtual normal to leverage the long-range high-order geometric relations. In this work, we focus on boosting the depth prediction accuracy of foreground objects with the proposed ForeSeE optimization strategy.
Not All Pixels are Equal. Some prior works noticed that it is sub-optimal to treat all pixels equally in dense prediction tasks. Sevilla et al.  tackle optic flow estimation by defining different models of image motion for different regions. Li et al.  use deep layer cascade to first segment the easy pixels then the harder ones. Sun et al.  select and weight synthetic pixels which are similar with real ones for learning semantic segmentation. Yuan et al. 
introduce an instance-level adversarial loss for video frame interpolation problem. Shenet al. 
propose an instance-aware image-to-image translation framework. However, different from the above works, we focus on depth estimation problem and aim at improving the accuracy of 3D object detection.
Monocular 3D Object Detection. The lack of depth information poses a substantial challenge for estimating 3D bounding boxes from a single image. Many works seek help from geometry priors and estimated depth information. Deep3DBox  proposes to generate 3D proposals based on 2D-3D bounding box consistency constraint. ROI-10D  introduces RoI lifting to extract fused feature maps from input image and estimated depth map, before the 3D bounding box regression. MonoGRNet  estimates the depth of the targeting 3D bounding box’s center to aid the 3D localization. Recently, some works [40, 36, 38] propose to convert estimated depth map to lidar-like point cloud to help localize 3D objects. Wang et al.  directly applies 3D object detection methods on the generated pseudo-lidar, and claim 3D point cloud is a much superior representation than 2D depth map for better utilizing depth information. In these methods, a reliable depth map, especially the precise foreground depth, is the key to a successful 3D object detection framework. We perform 3D object detection using the pseudo-LiDAR generated by our depth estimation model. The proposed method largely improves the performance and outperforms state-of-the-art methods, which clearly demonstrate the effectiveness.
Dataset. KITTI dataset  has witnessed inspiring progress in the field of depth estimation. As most of scenes in KITTI-Raw data have limited foreground objects, we construct a new benchmark which is based on KITTI-Object dataset. We collect the corresponding ground-truth depth map for each image in KITTI-Object training set, and term it as KITTI-Object-Depth (KOD) dataset. A total of image-depth pairs are divided into training and testing subsets with and samples respectively , which makes sure that images in the two subsets belong to different video clips. 2D bounding boxes are used to distinguish the foreground and background pixels. Pixels fall within the foreground bounding boxes are designated as foreground pixels, while the other pixels are assigned to be background.
Baseline Method. We adopt the same DCNN-based baseline method  which has already shown state-of-the-art performance on several benchmarks. The main structure falls into the typical encoder-decoder style. Given an input image, the encoder extracts the dense features, then the decoder fetches the features and predicts the quantized depth range categories. Specifically, the depth values are discretized into discrete bins in the log space. The quantized labels are assigned to each of the pixels as their classification labels.
Few works  have analysed the depth distribution, not to mention the foreground and background depth distributions. Here we investigate two kinds of data distribution of foreground and background pixels in training subset. Figure 2 shows the depth value distributions. As shown, more than foreground pixels have depth less than , while it is about for background. The foreground depth also shows a heavier long-tail distribution. Depth gradient distributions are shown in Table 1. We use the Laplacian of the depth images as the depth gradient, which calculates the second order spacial derivatives. The Laplacian image highlights areas of rapid depth change. The outputs are scaled to [0, 255] and uniformly discretized into three bins: , and , from small to large. In this way, all pixels are divided into three levels according to their gradient values. The foreground has much higher proportion than background at level and . Besides the depth range and depth gradient, the difference of shapes should also be noted. Generally, depth provides two kinds of information: location and shape. The foreground objects share similar shapes and look like frustums in 3D space, as shown in Figure 1. Based on the above analysis, we propose to consider the foreground and background separately when estimate their depth.
In dense prediction tasks, generally the loss function can be formulated as:
where is the number of pixels, and are the prediction and ground truth of pixel. is the error function, e.g., the widely used cross-entropy error function.
After the analysis in Section 3.2, we further investigate the interaction of foreground and background by splitting the optimization objective. The modified loss function is defined as:
where is the number of foreground pixels, is the number of background pixels and acts as the weight to balance the two loss terms. Figure 3 shows our results with different settings of . When is set to , which means only the background samples are used to supervise the training, the result on foreground becomes very poor. Similarly, the performance on background drops sharply when is set to . It verifies that the foreground depth and background depth are distributed differently. When we increase the foreground weight from to , the result on background improves, which indicates that the foreground and background to some extend could help each other. Further, it should be noted that the optimal values for foreground and background are different. For instance, the model shows its best performance on foreground when , but meanwhile the result on background is much poorer. It indicates that the optimization objectives for foreground and background are not consistent. To address these issues, in Section 4, the foreground-background separated depth estimation method is proposed to achieve the optimum points at the same time.
We highlight three observations:
The foreground and background depth have different depth value distributions, depth gradient distributions and shape patterns;
The foreground and background depth reinforce each other due to their shared similarities;
The optimization objectives of foreground and background depth estimation are mismatched.
In this section we first introduce the network architecture of our method, then present the proposed loss function, and finally show how the mask used to distinguish foreground and background could be dropped during the inference. The whole pipeline is illustrated in Figure 4.
We construct an additional decoder based on the baseline method , thus there are two parallel decoders which have the same structure. One of the decoders is for foreground depth prediction, while the other one aims to estimate the background depth. Specifically, for an image of size
, each decoder outputs a tensor of size, where is the number of depth range categories.
Foreground regions are cropped from the output of foreground depth decoder. The background depth range predictions are obtained in the same way. The global depth range predictions are generated by a seamless merge of foreground and background regions. Then the depth range predictions are converted to the final depth map using the soft-weighted-sum strategy .
As observed in Section 3, although the foreground depth and background depth show different patterns, they do share some similarities and could reinforce each other under an appropriate ratio. Thus, we further weight the foreground and background samples. For either foreground or background branch, the loss function is a weighted average of foreground samples and background samples, but with different bias. Here we define the loss function which supervises the foreground branch as:
where represents mean errors calculated on foreground predictions; is the mean error of foreground predictions; is the weight to balance the foreground and background samples during the training of foreground branch. Larger leads to more preference for foreground samples. Similarly, the loss function of background branch is formulated as:
where is the weight; and are the mean errors of background predictions and foreground predictions on this background branch.
Here we propose a mask-free merge method such that the binary mask is no longer needed once the training is finished. A max-pooling operation is applied on the foreground and background outputs before the softmax operation, which represent the confidence scores of being each range category. For each range category of each pixel, the highest confidence score between foreground and background output is retained, to serve as the final prediction. Formally, for theshaped outputs ,, , (), The final predictions are calculated as:
where represent the output of foreground and background branch, and
is an element-wise maximum operator which takes two vectors as input and outputs a new vector. Then theshaped output is fed to and soft-weighted-sum operations to produce the final depth map. The results only drop slightly compared with the mask-based merge method (from to absRel).
Datasets. We carry out experiments on KITTI dataset, which contains large-scale road scenes captured on driving cars, and serves as a popular benchmark for many computer vision problems related to self-driving cars. Specifically, we construct the KITTI-Object-Depth (KOD) dataset for evaluating the foreground depth estimation, as described in Section 3.1. The KOD dataset will be public available for convenience of future researches. Besides, we also apply our method on KITTI-Object dataset to perform monocular 3D object detection.
|% of s.t.|
Evaluation Metrics. For evaluation of depth estimation, we follow common practice [17, 37] and use the mean absolute relative error (absRel) and scale invariant logarithmic error (SILog) as the main metrics. We also report mean relative squared error (sqRel), mean error () and accuracy under threshold (). The definitions of these metrics are listed in Table 3. As for 3D object detection, we follow the prior works [22, 28] and focus on the “car” class. We report the results of 3D and bird’s-eye-view (BEV) object detection on the validation set. The commonly used average precision (AP) with the IoU thresholds at 0.7 is calculated. The results on KITTI easy, moderate and hard difficulty levels are reported.
Implementation Details. For depth estimation, we follow the most of settings in baseline method 
. The ImageNet pretrained ResNeXt-101 is used as the backbone model. We train the network for epochs, with batch size and base learning rate set to
. The Stochastic Gradient Descent (SGD) solver is adopted to optimize the network on a single GPU.and in foreground-background sensitive loss function are set to . Given a predicted depth map, the point cloud can be reconstructed based on the pinhole camera model. We transform each pixel with depth value to a 3D point in left camera coordinate as follows:
where and are the focal length along the and coordinate axis; and are the 2D coordinate of the optical center. Following , we set the reflectance to for each point and remove the points higher than 1 above the LiDAR source. The resulting point cloud is termed as pseudo-LiDAR. Afterwards, any existing LiDAR-based 3D object detection methods can be applied.
Quantitative Results. We show the ablative results in Table 2. Our ForeSeE outperforms the baseline over all metrics evaluated on foreground, background and global levels. Specifically, when evaluate on foreground level, our method improves the baseline performance by up to (from to absRel). It is in accordance with our intention that ForeSeE is specifically designed to enhance the ability of estimating the foreground depth. We further analyse the effect of each component. When equipped with the separate objectives (SO) described in Section 3.3, the baseline achieves better results on foreground while performs worse when evaluate on background pixels. Directly using the separate decoders (SD) could avoid the harm on background. Finally, the performance on foreground is further improved by applying the foreground-background sensitive loss (FSL).
To compare with other state-of-the-art methods, we apply DenseDepth  to the KOD benchmark, which reports the best performance on KITTI  and NYUv2  datasets among the methods with public available training code. We obtain results of DenseDepth using the code at github111https://github.com/ialhashim/DenseDepth, published by the authors. Except the dataset used, all the settings and hyper-parameters are not modified. The results are shown in Table 4. Compared with DenseDepth, our ForeSeE shows significant advantage on foreground level. For instance, ForeSeE outperforms DenseDepth by absolute absRel (from to absRel), which is a relative improvement of .
Qualitative Results. Besides the quantitative comparison, we show some visualization results. The predicted depth maps are visualized in Figure 5. As shown, our ForeSeE estimates more precise depth on foreground regions. The contour of foreground objects is more clear and accurate. Further, in Figure 6 we compare the estimated depth in the format of 3D point cloud. Although 2D depth map is good at displaying differences on relative depth and front-view shape, it provides few clues about absolute depth and 3D shape (e.g., it is hard to tell the differences between the similar colors of predictions and ground truth in Figure 5). 3D point cloud is a more intuitive and reasonable representation for visually comparing and debugging depth maps. As shown in Figure 6, our method shows less estimation errors and more accurate bird’s-eye-view (BEV) shapes.
To further validate the effectiveness, we conduct experiments on 3D object detection problem. We convert the estimated image-based depth map to LiDAR-like point cloud (pseudo-LiDAR). Then the LiDAR-based algorithms can be applied to recognizing and localizing 3D objects. Here we adopt Frustum-PointNet (F-PointNet)  and AVOD , specifically the F-PointNet-v1 and AVOD-FPN, which are top-performing 3D object detection methods and both utilize the information from LiDAR and RGB images.
Brief Introduction of Detection Methods. Frustum-PointNet leverages 2D detector to generate 2D object region proposals in a RGB image. Each 2D region corresponds to a 3D frustum in 3D space. PointNet-based networks are used to estimate a 3D bounding box from the points within the frustum. AVOD uses multimodal feature fusion RPN which aggregates the front-view image features and BEV LiDAR features to generate 3D object proposals. Based on the proposals, the bounding box regression and category classification are performed in the second subnetwork. We apply the F-PointNet and AVOD on the pseudo-LiDAR generated by our depth estimation model during the training and inference. Hyper-parameters are not modified. More details about the 3D object detector can be referred to the original papers.
Comparisons with State-of-the-art Methods. It should be noted that some works [36, 38] use the DORN  pre-trained on KITTI-raw dataset for depth estimation, which includes the images in training and validation subsets of KITTI-Object. Wang et al.  claim that their results serve as the upper bound. If we also pre-train our baseline depth estimation model on KITTI-raw and use it to generate pseudo-LiDAR, we achieve AP which outperforms their reported AP when both use the F-PointNet as detector and evaluate on the moderate level of car class. But, we want to clearly set a baseline and fairly compare to other state-of-the-art monocular 3D detection methods.
We compare our method with other methods in Table 5. The compared 3D object detection methods include Mono3D , MLF-MONO , ROI-10D , MonoGRNet , MonoPSR , TLNet-Mono  and DFDSNet . Our depth estimation models are trained on KOD training subset which does not contain validation subset of KITTI-Object. Either with F-PointNet or AVOD as the detection method, our ForeSeE-PL brings remarkable improvements on the basis of baseline-PL over all the metrics, e.g., from to with AVOD detector. Note that the 3D detection average precision () is the most widely used metric, on which our method achieves AP gains (from to AP) and outperforms all the state-of-the-art methods. Another advantage of our method is that it is not limited to specific 3D object detection methods. With stronger 3D object detector, we achieve larger improvements, e.g., AP gains on F-PointNet and AP gains on AVOD when evaluate on easy level of .
|Mono3D||5.2 / 2.5||5.2 / 2.3||4.1 / 2.3|
|MLF-MONO||22.0 / 10.5||13.6 / 5.7||11.6 / 5.4|
|ROI-10D||14.5 / 9.6||9.9 / 6.6||8.7 / 6.3|
|MonoGRNet||- / 13.9||- / 10.2||- / 7.6|
|MonoPSR||20.6 / 12.8||18.7 / 11.5||14.5 / 8.6|
|TLNet-Mono||21.9 / 13.8||15.7 / 9.7||14.3 / 9.3|
|DFDSNet||9.5 / 6.0||8.0 / 5.5||7.7 / 4.8|
|F-PN (baseline-PL)||17.3 / 9.6||11.8 / 5.4||10.4 / 5.0|
|F-PN (Our ForeSeE-PL)||20.2 / 13.2||12.6 / 9.4||12.0 / 8.2|
|AVOD (baseline-PL)||19.0 / 7.5||15.3 / 6.1||13.0 / 5.4|
|AVOD (Our ForeSeE-PL)||23.4 / 15.0||17.4 / 12.5||15.9 / 12.0|
Qualitative Results. The visualization of detection results are shown in Figure 7. The 3D bounding boxes are projected into image space for better visualization. There are two obvious advantages of using ForeSeE-PL: less missed detection and more accurate localization. Inaccurate depth predictions will result in shifted localization or rotated orientation, as in Figure 7(b). Even worse, the objects can not be detected if the depth estimation model treats foreground objects as background region, thus causing more missed detections. Our ForeSeE method largely alleviates the problems through predicting more accurate depth on foreground regions.
In this paper, we first analyse the data distribution of foreground and background depth and explicitly explore the interactions. Based on the observations, a simple and effective depth estimation pipeline, namely ForeSeE, is proposed to estimate foreground depth and background depth separately. We introduce a foreground depth estimation benchmark and set fair baselines to encourage the future studies. The experiments on monocular depth estimation and 3D object detection problems demonstrate the effectiveness of ForeSeE. We expect wide application of the proposed method in depth estimation and related downstream problems, e.g., 3D object recognition and localization.
High quality monocular depth estimation via transfer learning. arXiv: Comp. Res. Repository. Cited by: §5.2, Table 4.
3D bounding box estimation using deep learning and geometry. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., Cited by: §2.